Memory mapped IO for fun and profit
TweetWhat I am going to describe below is a fairly straightforward application of memory mapped IO to get huge benefits versus normal IO when loading static, but large, data structures. You can find the code for it here (note: linux only, though it would be an easy windows port). For our face recognition SDK, our model files are large binary files that are full of various bits of (odd-sized) data. We recently flattened everything out and made our initialization, effectively, zero-copy. This is a write-up of that process on a contrived example to make it a bit simpler. If you’ve ever used mmap to persist a large static data structure, you’ll probably find nothing of value here, but for the rest, read on.
For this example, consider a very large list of strings (and their lengths) that we need to persist to disk. Or, a very large array of these:
typedef struct {
char *data;
int len;
} data_t;
To make the problem non-trivial, we will consider the case of different length strings (in my tests I was using a million strings around 150 bytes in length — plus or minus a few to keep them from all being the same size).
Let’s get started
Here is a naive read/write for this data structure:
void naive_write(char* filename, data_t* data, int n)
{
FILE *f;
int i;
f = fopen (filename, "wb" );
fwrite(&n, sizeof(n), 1, f);
for (i=0;i<n;i++)
{
fwrite(&data[i].len, sizeof(data[i].len), 1, f);
fwrite(data[i].data, sizeof(char), data[i].len, f);
}
fclose(f);
}
data_t* naive_read(char* filename)
{
data_t* answer;
FILE *f;
int i, n;
f = fopen (filename, "rb");
fread(&n, sizeof(n), 1, f);
answer = malloc(n * sizeof(data_t));
for (i=0;i<n;i++)
{
fread(&answer[i].len, sizeof(answer[i].len), 1, f);
answer[i].data = malloc(sizeof(char) * answer[i].len);
fread(answer[i].data, sizeof(char), answer[i].len, f);
}
return answer;
}
We first write the number of strings, and then for each string we write its length followed by the string data. When reading, we read the number of strings (and malloc the array of strings) and then for each string we read its length (to malloc the string data itself) and then read the string. This implementation does many mallocs() and it does many freads(). It is, consequently, fairly slow.
With this data format on disk, though, this would be difficult to (easily) optimize much more. So, let’s change the layout.
A better file format
void flat_write(char* filename, data_t* data, int n)
{
FILE *f;
int i;
f = fopen (filename, "wb" );
fwrite(&n, sizeof(n), 1, f);
for (i=0;i<n;i++)
fwrite(&data[i].len, sizeof(data[i].len), 1, f);
for (i=0;i<n;i++)
fwrite(data[i].data, sizeof(char), data[i].len, f);
fclose(f);
}
We made a very minor change to how we write the data structure to disk. This time, we put all of the lengths up front, and then all of the strings after that. This is going to allow us two rather large simplifications: 1) we can malloc() once for all of the data strings, 2) we can read them all with a single fread(). This is a pretty big change because it effectively removes all malloc() pressure and puts everything into fread(). The read function does, though, become slightly more complicated:
data_t* onecopy_read(char* filename)
{
data_t* answer;
FILE *f;
int i, n, total;
f = fopen (filename, "rb");
fread(&n, sizeof(n), 1, f);
answer = malloc(n * sizeof(data_t));
total = 0;
for (i=0;i<n;i++)
{
fread(&answer[i].len, sizeof(answer[i].len), 1, f);
total += answer[i].len;
}
answer[0].data = malloc(sizeof(char)*total);
fread(answer[0].data, sizeof(char), total, f); /* one giant fread */
/* set all the ptrs */
total=0;
for (i=0;i<n;i++)
{
answer[i].data = answer[0].data + sizeof(char)*total;
total += answer[i].len;
}
return answer;
}
This is a decent optimization but not ideal. fread, still, involves a copy (copying from disk into our program’s address space). The only way around this problem is to not use fread.
Towards zero-copy
Using the exact same file, we can open it using mmap() (windows users see: MappedViewOfFile). This maps the file directly into our address space without having to do any copying whatsoever. We can then just setup all the data pointers to point to the memory mapped file, and we’ve initialized our data structure with zero copies.
data_t* zerocopy_read(char* filename)
{
data_t* answer;
int f;
int i, n, total;
char *data_ptr;
int* hdr_ptr;
struct stat sb;
f = open (filename, O_RDONLY);
fstat(f, &sb);
hdr_ptr = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, f, 0);
n = *hdr_ptr++;
answer = malloc(n*sizeof(data_t));
for (i=0;i<n;i++)
answer[i].len = *hdr_ptr++;
data_ptr = (char*) hdr_ptr;
total=0;
for (i=0;i<n;i++)
{
answer[i].data = data_ptr + sizeof(char)*total;
total += answer[i].len;
}
return answer;
}
Timings
So the moral of the story, then, is how much faster are the various versions. Here’s some timings:
(for num_strings=1,000,000 and str_len= ~150 bytes) naive_read time: 0.373289 1copy_read time: 0.202408 0copy_read time: 0.011233
It’s important to understand what we are actually measuring here. I ran this test on Linux (Windows isn’t much different) which has a file cache. The actual disk wasn’t ever really involved in this program or these timings. We are really testing the summed speed of our processing, malloc() (in the first case), fread(), and the kernel’s handling of cached files. (note: if you modified this program to only read the files, and then reboot your machine to clear the file-cache, the first run would be much slower as the file was read in from disk and cached).
Benefits
Having these large static data structures persist using memory mapped files has some rather significant advantages beyond just initializing extremely fast. We’ve shifted the entire burden of managing this memory onto the operating system. If multiple processes are being run, this memory will be shared amongst them. This is not true of the fread() solution. The OS’s file cache will effectively keep the file in memory and ready to go whenever it’s been recently run. If there are portions of your data you aren’t using (in face recognition imagine a scenario where you only do face detection), the rest will never get loaded (or will eventually be swapped out). And last, but not least, the flattening process puts all of the bits contiguous in memory (compare with the naive approach), allowing for the OS to handle the pages of data much more efficiently.
Tweet
twitter
November 15th, 2010 at 4:57 pm
Maybe
sync
echo 3 > /proc/sys/vm/drop_caches
…
should result in re-reading file from disk.
November 16th, 2010 at 3:19 am
really interesting and elegant I think, this solution would have been great in the past for some companies where I worked with GBs of data in initialization… (actually even only the fread and data layout optimization is a great improvement in some cases). anyway I never had the opportunity to do that kind of changes
November 17th, 2010 at 12:59 am
First of all, good article – this is a good use of mmap() to speed up your code. Mmap has a lot of great uses, thanks for pointing this out.
However, I’m fairly certain the speedup does not come from “zero copy” – I believe mmap simply modifies the page tables for your process such that you fault on first access to a page. Then the kernel fault handler initiates a block transfer from the device into the page cache, and updates your page table entry to point to the (cached) data from the storage device.
Rather, the speedup comes from removing the plethora of system calls (fread) and replacing it with a fault.
So it’s not so much zero copy as it is less software overhead.
Great post, happy coding
November 17th, 2010 at 8:46 am
Ok, so please ignore my stupid comment above… I see now why you’re calling it a zero copy operation. Mmap’d pages aren’t copied from the page cache into the processes stack or heap, so you are correct there is less copying. The data is read from device (except for XIP-capable block devices and hardware, such as NOR) and copied into the page cache, so the “zero copy” tripped me up.
As you pointed out, another reason people use mmap is because it’s a simpler software interface. You can treat files like a chunk of memory, and let the OS worry about the IO behind the scenes. Beyond the obvious performance implications, it means your software can be simpler – you don’t have to use fread/fwrite, or (gasp) async IO
Great post, keep ‘em coming
November 26th, 2010 at 11:06 am
It’s quite off-topic, but sometimes the input file needs to be ASCII. For example, consider a file where each line is made of the same number of fields of various types (int, float, string…) printed in ASCII.
Could similar techniques be used to speed-up the load of such a file ?
December 7th, 2010 at 9:45 am
The downside is that the data files will not be very portable to processors/compilers with different endianess or alignment requirements.
February 7th, 2012 at 8:23 am
Calories burned by walking depend on the metabolism rate of the person, but on an average, a 30 minute brisk walk can help you burn around 200 calories. Ideal brisk walking speed differs for each individual. But generally, you should walk at a speed which is quite fast but not very exhausting.
places to stay hervey baypijnbestrijding
February 9th, 2012 at 4:33 am
You just have to collect the money for it. Air still try to take it for cleaner, but clean, saying that they do not need more people.
климатици
February 21st, 2012 at 9:32 am
Window could not challenge this claim and will therefore neshte-rode the bike with your 1000lv dress, shoes for 250lv, bag of 200 BGN and shoved his hair in 50lv helmet. влагоуловители
February 22nd, 2012 at 9:55 am
Unclogging the sink a few streets. Ideally, it was close and could go without a car. When he got to drain the house and knocked on his open-collar a pretty chick who for a moment made him forget about sanitation, forget even his own name. отпушване на канали