Computational performance — a beginner’s case study
TweetSo, not too long ago, I wrote this post. In it, I asked the question which bit of code was faster:
/* going vertical */
for (x=0;x<width;x++)
for (y=0;y<height;y++)
a[y*width+x] = b[y*width+x] + c[y*width+x]
/* going horizontal */
for (y=0;y<height;y++)
for (x=0;x<width;x++)
a[y*width+x] = b[y*width+x] + c[y*width+x]
I made the unfortunate comment, at the time, that every programmer should know which is faster. It turns out that I spoke too soon and maybe was a bit shortsighted. I found this post on stack overflow. In it the questioner re-asks the question: which is faster, and why?
Sadly, the answer he is given, and the answer he chose as the “best”, is almost completely wrong [UPDATE: the "community" eventually overturned the accepted answer and chose a better one]. Since I was the cause of all this confusion, I figured I’d go ahead and give a detailed explanation of which is faster and why.
The 2d matrices are mapped in row-order, meaning you put each row into memory one after the other. (Contrasted with column order where each column is put into memory, one after the other.). It’s really important you understand that fact before we proceed because the entire post hinges on that. If you were to store your matrices in column order, the entire logic of this post flips.
Science
Let them never say we programmers are not scientists. This is the scientific method at work. Let’s test these hypotheses with some real experiments. I’ve uploaded the code I’ve used to github and you can find it here.
For most of my experiments, I added two matrices of size 4096×4096 and stored the result into a third one. This is large enough to wash out any “cache memory” from our experiments. If your matrices are sufficiently small, you can fit the entire thing into large L2 caches, ruining the experiment. These matrices are about 67 MB which ensures that by the time you iterate through one, the beginnings of that matrix will have already been ejected from cache.
Let’s start easy — which is faster, traversing horizontally or vertically?
The answer is: horizontal, by a landslide. I’ve compiled two versions of the program, one with optimizations (-O3 -funroll-loops) and one without optimizations, and run both.
louis@noisy:~/simple-optimization-test$ ./slow horiz: 1.123245 vert: 14.359954 louis@noisy:~/simple-optimization-test$ ./fast horiz: 0.525076 vert: 15.049684
It’s not even close, really. Without any compiler optimizations, the horizontal method is about 15x as fast while with optimizations, it’s closer to 30x. It’s also clear (and possibly bewildering) that the compiler optimizations do help the horizontal but do not help the vertical.
So now that we know which is faster, the question moves to why.
Hypothesis 1: The stack-overflow hypothesis. Are we CPU bound?
Let me quote the “chosen” stack overflow answer:
So if you’re scanning vertically, the next index is calculated as:
index = row * numColumns + col;however, if you’re scanning horizontally then the next index is just as follows:
index = index++;A single addition is going to be fewer op codes for the CPU then a multiplication AND addition, and thus horizontal scanning is faster because of the architecture of computer memory.
Now, let’s remember that the horizontal version is between 15 and 30 times faster depending on our compiler optimization settings. Immediately, you should be highly skeptical of this answer. How could a single instruction, or two, make such a staggering difference? This is a fairly easy to test hypothesis, as well. If we look at the inner loop assembly of the unoptimized version of the code, we see the follow code snippets.
| Vertical_Add (unopt.) | Horiz_Add (unopt.) |
|---|---|
movl -4(%ebp), %eax
sall $12, %eax
addl -8(%ebp), %eax
movl %eax, -12(%ebp)
movl -12(%ebp), %eax
sall $2, %eax
movl %eax, %edx
addl 16(%ebp), %edx
movl -12(%ebp), %eax
sall $2, %eax
addl 8(%ebp), %eax
flds (%eax)
movl -12(%ebp), %eax
sall $2, %eax
addl 12(%ebp), %eax
flds (%eax)
faddp %st, %st(1)
fstps (%edx)
addl $1, -8(%ebp)
|
movl -4(%ebp), %eax
sall $12, %eax
addl -8(%ebp), %eax
movl %eax, -12(%ebp)
movl -12(%ebp), %eax
sall $2, %eax
movl %eax, %edx
addl 16(%ebp), %edx
movl -12(%ebp), %eax
sall $2, %eax
addl 8(%ebp), %eax
flds (%eax)
movl -12(%ebp), %eax
sall $2, %eax
addl 12(%ebp), %eax
flds (%eax)
faddp %st, %st(1)
fstps (%edx)
addl $1, -4(%ebp)
|
You do not need to be an assembly expert (or even a novice) to realize that these two inner loops are absolutely 100% identical. The only difference occurs outside the computational portion of the loop where the loop variables are incremented, and in which order.
So while it is true that a perfectly optimized horizontal add will have less instructions than a vertical add, that is most certainly not the end of the story.
Hypothesis #2: My hypothesis. Are we cache bound?
(note: other stack overflow respondents share my hypothesis, and they deserve credit, as well). Horizontal adds are better because they make better use of cache. Specifically, they make accesses to memory in sequential order. A horizontal add accesses memory in order from start to finish. A vertical add accesses memory in a strided fashion incrementing by large chunks through memory. This is not a cache-friendly behavior.
A quick cache primer
When you access a memory address, the computer loads it into cache, along with its neighbors. This is called a cache line. On my CPU, the cache line size is 64 bytes, or 16 floating point numbers. That means when you load up a particular address, 15 of his neighbors are coming along for the ride. Since we store our matrix in row order, that means when I ask for a particular element, the CPU will cache 15 neighboring elements to his left and right. These cache lines are 64-byte aligned, which means if you ask for 0×07, you are going to get 0×00 through 0x3F (these numbers are multiples of 64).
(note: My example has a few contrived elements to minimize complexity. I used valloc() in my test program, and I made sure my width was a multiple of 64. The net effect of these assumptions is that the beginning of each row is always the beginning of a cache line, and each row is always a clean multiple of the cache line size.)
A cache-eyed view of our algorithms
Horizontal. You will load up the first element and you get the 15 following elements (the cache line) “for free”. You then proceed to use those 15 in order, after the first. At this point, you request the next element (the 17th), and get the next cache line, which you then compute on, and so on. This fits quite nicely.
Vertical. You will load up the first element, like before, and get the 15 elements to his right “for free”. You promptly ignore those 15 elements, and proceed to jump down to the next row (say, 4096 elements away, if that is the width of the row). You will then load up the second element of the first column, including his cache line (the 15 elements to his right), compute only him, and proceed to the third element. And so on.
In the horizontal case, you load up all 16 elements and use them. In the vertical case, you are loading up 16 elements to compute only one of them. On average, each element is loaded ONCE in the horizontal case, and each element is loaded 16 times in the vertical case (computed once, ignored 15x). This means you are doing 16x as many memory to cache loads in the vertical case.
This hypothesis is immediately appealing because 16x is on the same order of magnitude of the timing differences we’ve seen. It also explains why compiler optimizations don’t help the vertical. If we are entirely cache bound, removing unnecessary instructions isn’t going to help.
Can we test this more fully? Well. Yes. Let’s use those extra guys we’ve loaded. Let’s try this bit of code:
void vert_add2(float *in1, float *in2, float *out)
{
int i,j;
for (j=0;j<WIDTH/16;j++)
for (i=0;i<HEIGHT;i++)
{
int index = i * WIDTH + j*16;
out[index]=in1[index]+in2[index];
out[index+1]=in1[index+1]+in2[index+1];
out[index+2]=in1[index+2]+in2[index+2];
out[index+3]=in1[index+3]+in2[index+3];
out[index+4]=in1[index+4]+in2[index+4];
out[index+5]=in1[index+5]+in2[index+5];
out[index+6]=in1[index+6]+in2[index+6];
out[index+7]=in1[index+7]+in2[index+7];
out[index+8]=in1[index+8]+in2[index+8];
out[index+9]=in1[index+9]+in2[index+9];
out[index+10]=in1[index+10]+in2[index+10];
out[index+11]=in1[index+11]+in2[index+11];
out[index+12]=in1[index+12]+in2[index+12];
out[index+13]=in1[index+13]+in2[index+13];
out[index+14]=in1[index+14]+in2[index+14];
out[index+15]=in1[index+15]+in2[index+15];
}
}
In this case, I am still traveling “vertically”, but by cache lines, not elements. I load up the first cache line of the first row, compute it. I then load up the first cache line of the second row, compute it. And so on. If this hypothesis is true, this should be much faster than the normal vertical addition.
louis@noisy:~/simple-optimization-test$ ./fast horiz: 0.581028 vert1: 15.797579 vert_add2: 1.531041
With this simple change, we’ve made the vertical code 10x faster. This is strong evidence that our code is cache, not CPU bound.
Follow-up #1: Can the vertical code be as fast as the horizontal?
Probably not. As the stack overflow patron rightly pointed out, in the end, vertical code will need a marginal amount of extra code. I suspect the effect of these extra instructions will turn out to be extremely trivial in reality.
The real reason it’s difficult to ever make the vertical code as fast as the horizontal code is because CPUs have built-in hardware prefetchers. The behavior of modern hardware prefetchers is fairly complicated and their behavior, alone, could fill many posts of this size. The short version is that a CPU that loads one cache line after the next, in order, is very likely to have some hardware mechanism built-in that begins prefetching cache lines to minimize the wait time. Accessing one line after the next is an extremely common operation. For example, adjacent-line prefetching is a fairly common strategy.
It would be fairly difficult to make the CPU prefetch cache lines optimally in the vertical case. Likewise, the CPU is almost certainly prefetching optimally in the horizontal case, without any help at all.
Follow-up #2: Can’t this horizontal code be made even faster?
Many, many, many people (including people on stack overflow) will see my code:
void horiz_add(float *in1, float *in2, float *out)
{
int i,j;
for (i=0;i<HEIGHT;i++)
for (j=0;j<WIDTH;j++)
{
int index = i * WIDTH + j;
out[index]=in1[index]+in2[index];
}
}
… and feel the uncontrollable urge to “optimize” this…
void fast(float *in1, float *in2, float*out)
{
int i;
for (i=0;i<HEIGHT*WIDTH;i++)
out[i]=in1[i]+in2[i];
}
This is obviously faster. Right? Certainly, without compiler optimizations, it will be. But is the compiler smart enough to do away with the unnecessary calculations and do what is “right”?
louis@noisy:~/optomization$ ./slow horiz: 1.669533 fast: 1.273352 louis@noisy:~/optomization$ ./fast horiz: 0.530805 fast: 0.531471
Laugh. I ran this about 10 times and the “fast” version won sometimes and lost others. On average and unscientifically, I’d say it might be a little faster. At best, this is an extremely marginal improvement.
In the grand scheme of things, this optimization is completely and totally unnecessary. It provides, at best, a barely measurable improvement. And if this bit of code turns out to be a bottleneck…
Follow-up #3: No seriously, can this be made even faster?
Yes, it can. That is left, however, as an exercise for the reader.
twitter
July 6th, 2009 at 9:14 am
“us programmers” should be “we programmers”
July 6th, 2009 at 9:18 am
This reminds me of:
http://blogs.msdn.com/oldnewthing/archive/2005/08/05/448073.aspx
July 6th, 2009 at 11:10 am
Very detailed explanation. Thanks.
July 7th, 2009 at 12:15 am
Very good explanation, thank you
I have a weird output on my MacBook Pro, though:
$> ./fast
horiz: -0.352413
vert1: 0.166096
vert2: 127.975822
fast: -0.330315
Also, I guess this might be optimized by some compiler option, but by default, isn’t this:
for (i = 0; i < HEIGHT * WIDTH; i++)
slower that this:
int max = HEIGHT * WIDTH;
for (i = 0; i < max; i++)
?
I mean, in the logic, computing the fixed maximum amount of loops to go through is not correct, but I am not sure how it is optimized in the end
July 7th, 2009 at 6:19 am
What about interlacing the arrays?
July 8th, 2009 at 1:36 am
Naixn: The weird output is fixed in my github fork: http://github.com/Kobold/simple-optimization-test/tree/master
Louis: Check your fork queue.
July 8th, 2009 at 8:59 am
I merged those changes. Thanks guys.
July 14th, 2009 at 3:29 pm
One big takeaway from all this is that you need a good profiler.
There are so many levels of optimization (compiler, OS, CPU, other hardware) that it’s very hard to get enough intuition to predict performance just from the code. To get the best performance, you need to experiment with different methods, and choose the one that performs the best. But that takes a lot of time, and to justify that time investment you need to know that it’s the bottle neck of your program. So, you need a good profiler.
The other thing I wanted to mention is that when I develop on windows I miss the OSX performance tools. Shark is great, but so are the libraries, like the Accelerate framework. This example would be one function call, and the framework uses SIMD and threading as appropriate. Amazing to work with.
February 24th, 2010 at 10:14 am
You should also look into the effect of page faulting on the problem.
If your arrays are larger than your working set, that some of your memory can be faulted in/out. The resulting disk io can add significant costs to your result.
February 24th, 2010 at 12:37 pm
Your conclusion also depends on the language you’re working in if I recall my High Performance Computing book correctly. I don’t see it specified, but I believe you’re using C or C++. Fortran, I think, would be faster if you go vertical instead of horizontal.
February 24th, 2010 at 1:45 pm
Looking at the ASM output for gcc -O3 -msse2 I noticed that horiz_add() and fast() use the SIMD 4-float vector instructions already (movaps, addps and friends.) And there’s some funky business going on as the fast() and a manual SSE version written with SSE intrinsics are slower than horiz_add() on my Pentium M.