You can’t beat a good compiler… right?
Right! Sort of.
It is extremely well worn conventional wisdom that for your “average” programmer, it’s pretty much impossible to beat a good optimizing compiler at writing fast code. For those who skip the “understanding” part, and jump straight into the “application” part, this rule of thumb tends to be grossly misunderstood, and misused. For these people, compilers are so smart, and so good, that micro-optimization is basically impossible (uh, no), or too hard (no), or even that they make assembly knowledge unnecessary or obsolete (not even close). This tends to lead to a mentality that optimization can be left to the compiler and often results in people warning others to not bother with micro-optimizations at all.
Optimization, today, is largely about writing some naive code, looking at the assembly output, figuring out what you want to improve on, and then getting (or, often, tricking) the compiler into doing what you want. After several iterations of this, you end up with the code you want, and you gain all the benefits of having a compiler in the loop (portability, future-proofing, and so on). This process, though, obviously involves a great deal of low-level knowledge: assembly, cpu architecture, micro-optimizations that are likely to work, etc.
As an aside, I have a question: how exactly does one go about learning how to go beyond standard C into hardcore optimization? For me, it’s been a long road of trial & error, and random discovery. It seems to me that there is probably good resources out there for getting started on optimizing code (let’s say in gcc). Know any?
All the code I used can be found here: http://github.com/lbrandy/simple-optimization-test (see signal_blur.c). I ran all of these tests on Linux using gcc 4.3.3 and -O3. UPDATED: I changed the code to use c99’s restrict keyword, instead of gcc’s __restrict__
Getting yourself into trouble
Here is a naive (and contrived) one dimensional filter routine.
#define LENGTH 5000
// a bad impersonation of a gaussian filter
const float filter[] = {0.01f, 0.2f, 0.58f, 0.2f, 0.01f};
void naive(float* in, float* out)
{
int i,j;
for (i=0;i<LENGTH-4;i++)
{
out[i]=0.0f;
for (j = 0;j<5;j++)
out[i] += filter[j] * in[i+j];
}
}
This is an interesting example because it contains a fatal flaw that might not be obvious. Take a second to understand what this code is doing. It’s fairly simple, but critical to the rest of the discussion. Let’s look at the assembly output of this, built using -O3.
08048430: 8048430: 55 push %ebp 8048431: 31 c0 xor %eax,%eax 8048433: 89 e5 mov %esp,%ebp 8048435: 8b 4d 08 mov 0x8(%ebp),%ecx 8048438: 8b 55 0c mov 0xc(%ebp),%edx 804843b: 90 nop 804843c: 8d 74 26 00 lea 0x0(%esi),%esi 8048440: d9 ee fldz 8048442: d9 14 82 fsts (%edx,%eax,4) 8048445: d9 05 10 89 04 08 flds 0x8048910 804844b: d9 04 81 flds (%ecx,%eax,4) 804844e: d8 c9 fmul %st(1),%st 8048450: de c2 faddp %st,%st(2) 8048452: d9 c9 fxch %st(1) 8048454: d9 14 82 fsts (%edx,%eax,4) 8048457: d9 05 14 89 04 08 flds 0x8048914 804845d: d9 44 81 04 flds 0x4(%ecx,%eax,4) 8048461: d8 c9 fmul %st(1),%st 8048463: de c2 faddp %st,%st(2) 8048465: d9 c9 fxch %st(1) 8048467: d9 14 82 fsts (%edx,%eax,4) 804846a: d9 05 18 89 04 08 flds 0x8048918 8048470: d8 4c 81 08 fmuls 0x8(%ecx,%eax,4) 8048474: de c1 faddp %st,%st(1) 8048476: d9 14 82 fsts (%edx,%eax,4) 8048479: d9 44 81 0c flds 0xc(%ecx,%eax,4) 804847d: de ca fmulp %st,%st(2) 804847f: de c1 faddp %st,%st(1) 8048481: d9 14 82 fsts (%edx,%eax,4) 8048484: d9 c9 fxch %st(1) 8048486: d8 4c 81 10 fmuls 0x10(%ecx,%eax,4) 804848a: de c1 faddp %st,%st(1) 804848c: d9 1c 82 fstps (%edx,%eax,4) 804848f: 83 c0 01 add $0x1,%eax 8048492: 3d 84 13 00 00 cmp $0x1384,%eax 8048497: 75 a7 jne 8048440 8048499: 5d pop %ebp 804849a: c3 ret 804849b: 90 nop 804849c: 8d 74 26 00 lea 0x0(%esi),%esi
From the assembly, we see that the compiler has done quite a bit (if you don’t know assembly, don’t worry, I’ll do my best to explain). First we see all of the address calculation involved in loading the filter coefficients has vanished (it is loading each one directly, e.g. flds 0x8048918). Second, notice that the (fixed sized) inner loop has been completely unrolled, resulting in each of the 5 multiplications and additions per iteration. So far so good.
There is, however, a very alarming surprise is this code. That is the quantity of store instructions (also loads). After every iteration of our inner loop (each filter coefficient), the result is stored. You can see 5 different store instructions (fstps, fsts) per iteration of this loop. Why? Let’s have a look at the code again:
void naive(float* in, float* out)
{
int i,j;
for (i=0;i<LENGTH-4;i++)
{
out[i]=0.0f;
for (j = 0;j<5;j++)
out[i] += filter[j] * in[i+j];
}
}
To the inexperienced, it might be bewildering why the optimizing compiler would generate 5 store instructions for out[i]= in the inner loop. Why wouldn’t it just accumulate the answer in a register, and then store only the final result? The answer is: aliasing. The problem here is that compiler cannot assume that the pointers *in and *out are disjoint. It must store the result into out[i] each iteration because out[i] may be in[i+j] in the next iteration of the inner loop. With a bit of thought, it becomes clear how this code requires these stores to be correct in cases like *out pointing one float ahead of *in.
Another hard-learned tidbit: wasteful store instructions are terrible because stores can be incredibly expensive (far worse than an extra add or multiply).
Fixing it with restricted pointers
There are several ways to fix this problem, but in the spirit of teaching the art of optimization, I’ll go with the use of the __restrict__ qualifier (this is a gcc directive, but most compilers have some support for restricted pointers). [note from comments: restricted pointers are part of the C99 standard now using the keyword restrict. ]. The only change I made is to add __restrict__ to the function declaration:
void naive_restrict(float *__restrict__ in, float *__restrict__ out)
{
int i,j;
for (i=0;i<LENGTH-4;i++)
{
out[i]=0.0f;
for (j = 0;j<5;j++)
out[i] += filter[j] * in[i+j];
}
}
This directive tells the compiler that you, the programmer, promise that *in and *out are disjoint, and no aliasing will occur. If you break your promise, don’t expect your code to be correct. Here is the assembly of that output:
080484a0: 80484a0: 55 push %ebp 80484a1: 31 c0 xor %eax,%eax 80484a3: 89 e5 mov %esp,%ebp 80484a5: 8b 55 08 mov 0x8(%ebp),%edx 80484a8: 8b 4d 0c mov 0xc(%ebp),%ecx 80484ab: 90 nop 80484ac: 8d 74 26 00 lea 0x0(%esi),%esi 80484b0: d9 05 10 89 04 08 flds 0x8048910 80484b6: d9 04 82 flds (%edx,%eax,4) 80484b9: d8 c9 fmul %st(1),%st 80484bb: d8 05 0c 89 04 08 fadds 0x804890c 80484c1: d9 05 14 89 04 08 flds 0x8048914 80484c7: d9 44 82 04 flds 0x4(%edx,%eax,4) 80484cb: d8 c9 fmul %st(1),%st 80484cd: de c2 faddp %st,%st(2) 80484cf: d9 05 18 89 04 08 flds 0x8048918 80484d5: d8 4c 82 08 fmuls 0x8(%edx,%eax,4) 80484d9: de c2 faddp %st,%st(2) 80484db: d8 4c 82 0c fmuls 0xc(%edx,%eax,4) 80484df: de c1 faddp %st,%st(1) 80484e1: d9 c9 fxch %st(1) 80484e3: d8 4c 82 10 fmuls 0x10(%edx,%eax,4) 80484e7: de c1 faddp %st,%st(1) 80484e9: d9 1c 81 fstps (%ecx,%eax,4) 80484ec: 83 c0 01 add $0x1,%eax 80484ef: 3d 84 13 00 00 cmp $0x1384,%eax 80484f4: 75 ba jne 80484b0 80484f6: 5d pop %ebp 80484f7: c3 ret 80484f8: 90 nop 80484f9: 8d b4 26 00 00 00 00 lea 0x0(%esi),%esi
Now, you do not have to be an assembly expert to see how much more streamlined this code is. It consists almost exclusively of loads, multiplies, and adds with a final store at the end. This does exactly what we’d originally hoped. It realizes it can keep a temporary running sum and only store once at the end. We should also note that if you violate the restricted pointer promise, this code will not be correct!
It shouldn’t surprise you to see how much faster this version is, either:
naive: 0.672957 naive_restrict: 0.160432
It’s almost 5 times faster than the non-restricted version.
Going Forward
Though I haven’t actually done it, my overwhelming suspicion is that this code still has a ways to go in terms of speed. The next step would be the use of SSE instructions. Maybe next post.
The take-away from examples like this should be that optimization cannot be “left” to the optimizing compiler. You have to know what you are doing to get a compiler to make fast code. And if you really know what you are doing (which means knowing assembly and the underlying architecture quite well), you can usually get a modern optimizing compiler to do exactly what you want. You have to be in the loop. Truly optimized C code ends up looking nothing like C at all.
With careful hand holding, you can get a compiler to make fast code. In that case, it can become difficult to beat a compiler with hand-optimized code. But that is not because the compiler is so good, but because you are so good at getting the compiler to make the code you want.
June 7th, 2010 at 8:16 am
They actually tried to introduce a keyword into standard C to deal with things like pointer aliasing. But it never made it into the standard. See http://www.lysator.liu.se/c/dmr-on-noalias.html for Dennis Ritchie’s thoughts on the proposed “noalias” keyword.
June 7th, 2010 at 9:53 am
The “restrict” keyword is in C99; #define-ing it away for old compilers makes sense, but there’s no need to use the (obviously unportable) “__restrict__” extension.
On the other hand, “naive” unrolling would work, too: out[i] = filter[0] * in[i] + … + filter[4] * in[i + 4] should generate good code (not tested, sorry!)
Not that it matters to the present discussion, but your “naive implementation” has an off-by-one error, accessing in[LENGTH].
But this is just nitpicking: nice article!
June 7th, 2010 at 9:56 am
restrict is a valid C pointer qualifier and works.
paragraph 6.7.3.1 in the C99 Standard.
It is not needed to use the __restrict__ , just use restrict.
June 7th, 2010 at 10:14 am
This totally, completely, and utterly misses the point of compiler optimization, and optimization in general.
First, yes, per current state-of-the-art, a master-level human assembler can assist the compiler to produce better, safer, faster code in these cases. However, who cares? Seriously, you are talking about, what, 0.6 second. It’s not worth the effort! Oh, did you even realize that you are burning time on 10 lines of code? Seriously, a toy program has thousands of lines of C code already. Good luck trying to optimize every single line. And it’s not like you can get out 0.6 sec every time. Maybe 0.01 next time. Remember, with a 3GHz CPU, 1% of a second is a LONG time.
Oh, and did anyone tell you that mucking with non-standard stuffs are bound to disaster? And micro optimization may not work with future things? For example, multithreaded. Try really hard to minimize the IO of a single thread, but if the OS swaps out your process, destroy your cache, than the cost of swapping is so huge that your little optimization is not even worth mentioning.
Meanwhile, why don’t you invest time in something with higher return. For example, data structures that minimize cache hits and can be used over and over. Or, algorithms that are magnitudes faster than your current ones? Or, a better programming paradigm? Or, JIT with selected compilation? Structural stuffs, not little tiny details.
For some reasons, this reminds me of Vietnam war. America had all details right. They had bigger guns, bigger bombs, bigger airplanes, bigger budget; they won just about anything that they mucked into. And they got kicked out. Micro optimization, you think?
June 7th, 2010 at 10:19 am
It’s not that optimizations like these are too hard, it’s that for the vast majority of applications they’re completely unnecessary. There are very few problem domains where dropping down to assembly is worth the time it takes to optimize, let alone the cost of readability and maintainability. If you’re one of those few engineers working on a kernel or some other piece of performance-critical system code, then more power to you. But suggesting that knowledge of assembly is necessary for the truly average programmer is ridiculous.
That said, props on restrict – I’d never seen that before.
June 7th, 2010 at 10:19 am
@magice, thanks for the predictable response.
Nowhere have I argued that you should be optimizing “all” of your code, or optimizing indiscriminately (or prematurely), or without exhausting other, better options. Nowhere have I said this type of optimization is preferable to cache optimizations, or other high level concerns. Knowing -when- to bust out micro-optimizations is, in and of itself, a skill worth discussing.
However, if you are actually arguing that it is never correct to spend time doing micro-optimization, you are wrong. It’s just that simple.
June 7th, 2010 at 10:27 am
As pointed out to me at http://news.ycombinator.com/item?id=1410755, the off-by-one error is nonsense. Sorry!
June 7th, 2010 at 10:53 am
Very nice article, thanks.
Looking forward to the next post.
June 7th, 2010 at 11:11 am
0.672957 / 0.160432 is about 4.2 — quite a bit short of “almost 5″. I nitpick here because numbers really matter when optimizing. Too many a time I’ve made similar leaps thinking that I had previously obtained “nearly 5 times speedup” and looked down upon something that was 4.5 times faster. Only to find out that the “nearly 5″ was actually slower because I wasn’t being careful.
June 7th, 2010 at 11:56 am
Lots of resources on optimizing here : http://www.agner.org/optimize/
June 7th, 2010 at 11:56 am
Nice article! Out of curiosity, I tried adding -funroll-loops, which made the naive function run faster than the naive_restrict function.
Before adding -funroll-loops:
naive: 0.461494
naive_restrict: 0.143920
sum_variable: 0.105645
After adding -funroll-loops:
naive: 0.107970
naive_restrict: 0.125535
sum_variable: 0.096111
Looking forward to an SSE post.
June 7th, 2010 at 12:21 pm
Funny, I couldn’t recreate your results with gcc 4.3.3 on my machine. The O3 code I got used SSE, fully unrolled both loops, and did the elimination of the unnecessary stores. It also inlined naive() into main().
I imagine the problem is that the assembly code you looked at was for just the naive() function itself. If you call that function from somewhere, then GCC will use alias analysis to detect that the pointers don’t alias, and probably inline, obviating the need for the restrict keyword. Granted, GCC doesn’t have the best alias analysis in the world, but I’d be shocked if this example is one that breaks it.
June 7th, 2010 at 12:38 pm
You can get the same output (if you really don’t have overlapping pointers) with a simple temporal variable
float temp= 0.0f;
for (j = 0;j<5;j++)
temp += filter[j] * in[i+j];
out[i] = temp;
you clearly tell the compiler what to do, with this anti-optimization (I declared more variables than needed)
June 7th, 2010 at 12:40 pm
When you look at the cases where hand optimization can beat the compiler it normally comes down to the human knowing things the compiler can’t know. In this case you can inform the compiler but you can’t always.
June 7th, 2010 at 12:49 pm
@nate,
Did you run the code as I have it on github, or just copy/paste this into your own file? I ask because I just double checked everything, and am not seeing the same things as you. It would seem, to me, that the function pointers as I have it setup keep gcc from inlining it. I’ve been running gcc 4.3.3 on a relatively clean ubuntu 9.04 box.
Is it generating vectorized SSE, or just using the SSE stack for scalar fp operations? I can’t really get gcc to autovectorize this code at all, by playing with flags.
I can, though, get decent speedups by using ‘-O3 -msse3 -mfpmath=sse’ which will force it to generate sse scalar instructions instead of fpu instructions.
June 7th, 2010 at 12:57 pm
@magice
To thee I quote my professor from back in my undergraduate days:
“Hardware exists to provide scalability and high availability, not to cater to poorly optimized code.”
That being said, what @louis said is a perfectly acceptable compromise. Sure application developers will always want to be lazy, so we shall cut them some slack and allow them to optimize just the critical segments of code.
Remember, just as you as an application developer want to be lazy and finish your work quickly so you can go out and play, the people using that shitty-end-result of an application also want to do the same.
P.S. when I say “you” I mean the general masses of application developers who have no sense of pride in their work.
June 7th, 2010 at 1:09 pm
The reason so many program runs slow and hog memory is because ignorant people quit worrying about optimizing code. They just assume it is a non-issue with a 3GHz processor, but they are wrong.
The idiots don’t get it. LENGTH of 5000 might be trivial, but what if you had to do calculations on LENGTH of 100 MILLION or many different number tables, then savings would add up to be very large amounts of time.
June 7th, 2010 at 1:32 pm
Great article! It’s not often that you see someone brave enough to wander into the dark depths of assembly programming. To those of you who respond with frustration and anger: I can only assume that you have a hard time reading/writing assembly.
Saying that micro-optimizations aren’t useful is just silly. Most of you should be aware that some functions are called repeatedly, and those “2 seconds saved” per run may amount to a lot of time.
Also, measuring the need for optimization based on the number of lines of code, indicates a terrible lack of understanding.
June 7th, 2010 at 2:03 pm
@Enlightenment: With LENGTH=10^8, you would be memory bandwidth limited.
June 7th, 2010 at 2:05 pm
Being someone who spends most of my working day optimizing code, I have to agree that the idea that the compiler can optimize better than a human is a limited concept.
In the author’s example, we have a perfect case where the compiler doesn’t have enough information to perform specific optimizations.
In other cases, the code is just too complex for the compiler to properly deconstruct and optimize. This was a big problem with some compiler when dealing with C++ templates.
That being said, I still find that most optimizations are higher level optimizations. For example, using proper containers or data structures. Repeating work is usually a big cause of problems.
Good article.
June 7th, 2010 at 2:05 pm
I was kinda suspicious of your results so I tried your code on cygwin using actual timed tests. I bumped the LENGTH up to 100*1000*1000 to give it something to chew on, and checked the assembler to make sure it replicated your results (which it did). Using time ./a.exe. The profile time fluctuates from 0.577s to 0.655s on the fast version and from 0.545s to 0.670s on the slow version. Frankly, the error seems higher than the effect of the optimization.
I may be wrong, but I suspect the memory timings are all important here. The most expensive operation is loading a chunk of data from main memory into the cache. The additional stores the you managed to optimize out only use the local cache so they are very fast and are effectively hidden by the load from main memory. The fluctuation in timings is caused by the cache getting flushed by other apps in system.
Feel free to criticize my results but I suspect your optimization would only work on older hardware and embedded systems that aren’t so dependent on memory.
I would be interested to see your own results if you tried to time the optimizations. I personally believe that your article is fundamentally flawed because you didn’t profile your results.
June 7th, 2010 at 2:15 pm
Thanks for the good article on basic optimization.
@magice – Listen to the responses on this article – you could learn a thing or two.
As a video game programmer back in the early 90’s, I used to manually unroll inner loops and convert them to assembly by hand. Done correctly, it can change the video performance from 5-7fps to 30+ fps.
The entire program does not need to be optimized, just the slowest, most-run code. That’s what a profiler is for.
-PMF
June 7th, 2010 at 2:16 pm
I apologize – I didn’t see the timings at the end of the article.
June 7th, 2010 at 2:20 pm
@Fenris: That isn’t really an “anti-optimization” per se, as that variable would be put on the stack, which doesn’t really add extra memory (since it doesn’t call any functions within). Unless this call was very deep in the stack, I doubt it would affect anything.
June 7th, 2010 at 2:42 pm
Thanks for the article.
I come from an embedded background where I have found bad C compilers producing painfully bad code.
Sheeple are repeatedly quoting ‘Gotos Bad’, ‘Assembly Optimization Bad’, wondering what the next mantra will be…
June 7th, 2010 at 3:38 pm
A good introductory-level textbook that sets the context for this topic is Computer Systems: A Programmer’s Perspective, Bryant & O’Halloran. Uses X86, Linux (with appropriate Windows asides), and GCC, covers 32-bit and 64-bit architectures. It doesn’t dive into detail on hard-core optimizations, but it does address memory aliasing and other compiler optimization blockers, memory caches, etc. Written for undergraduate students who are new to both C and X86 assembly.
http://csapp.cs.cmu.edu/
June 7th, 2010 at 3:56 pm
I see nothing about you trying to avoid stalling the instruction pipeline in the CPU. The days of non-pipelined, CISC chips are looong gone, my friend.
June 7th, 2010 at 4:25 pm
I understand the point of your article, but it’s just plain silly not to mention that you will on average get more optimization by picking the most efficient algorithm and profiling than you ever will with premature optimization. Get it to work -right- first and then get it to work -fast- if profiling tells you it’s necessary. You should always mention this as beginners will come across your article and think “well now I have to back and optimize everything”.
June 7th, 2010 at 4:33 pm
@nate clark: note that such analysis doesn’t work if it’s not in the same source file; even with GCC’s new whole-program optimization, it won’t work if the function is in a shared library.
June 7th, 2010 at 5:17 pm
@Magice: oh really? 0.6 seconds?
The largest performance boost I’ve achieved solely from microoptimizations, from trying to “beat the compiler” is… over 800%. The application as a whole ran 8 times as fast when we were finished optimizing.
I think that’s worthwhile. It certainly was in that specific case.
Your assumption that it is not worth it to consider performance on a lower level than choice of algorithm is simply foolish.
But bad programmers will always try to make a case for why being a bad programmer, why not understanding the hardware you’re running on, or the code you’re executing, is a good thing.
Guess what. At the end of the day, they’re still bad programmers. They’re missing some vital tools in their toolbox.
A good programmer certainly knows that messing around with low-level optimizations isn’t the first thing you should do, but he *also* has the option available to him. He has the skills required to understand his code, and the performance of that code, because *sometimes* that is essential in getting good perfomance.
June 7th, 2010 at 5:42 pm
http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901 was an invaluable help in my case. It’s not about optimization, but it’s an amazing resource for understanding the CPU. And when you do that, most optimization tricks become pretty obvious.
Also look up the documentation for your target processor (both AMD and Intel have these available in PDF form on their websites)
Knowing which instructions are pipelined and which ones aren’t, and their relative latencies and all that good stuff helps too. Reading the assembly output from the compiler isn’t enough. You also have to know how expensive the instructions it chose are.
Humans certainly can beat a good compiler. And you don’t have to be a “master-level human assembler” to do it. You just have to think. Choose your battles. The compiler is unbeatable on its home turf. don’t try to optimize on the things it does well.
But in many other cases, a human programmer knows things about the code that the compiler doesn’t. And that can make a huge difference. Using “restrict” as shown in this blog post is one example. The compiler can’t safely do this because it doesn’t know if it’ll break the program. The human programmer *does* know that this assumption is safe to make.
June 7th, 2010 at 5:55 pm
this whole article could be written in five lines.
June 7th, 2010 at 6:08 pm
Rather than blame developers for failing to compensate for problematic language features, why not use a language that doesn’t have those problems in the first place (excluding kernel devs, etc)? The simple fact is that pointers eliminate many kinds of automatic optimization. Fran Allen noted this decades ago. I think it’s time to stop apologizing for C.
June 7th, 2010 at 6:26 pm
Use -fpmath=sse to use the SSE unit instead of x87 and -ftree-vectorize (might already be part of -O3) to generate vectorized SSE code.
June 7th, 2010 at 11:30 pm
yeah… go ahead … optimize…
I’m not saying that optimizing is never useful, quite the contrary. However from experience Optimizing has almost always (note.. almost) been harmful. It introduces bugs that are hard to fix because the code is harder to read, it makes the system much more fragile and looses a lot in resilience over external changes. Especially when the optimization is done through compiler tricks and exploiting obscure corner cases.
Optimization such as this should be as isolated as possible, writing the code directly in assembly language would have avoided some of the pitfalls. Burying the code in a library that is properly unit tested and provide a proper interface to it so that the optimization does not spread beyond it`s original intended use would be even better.
However, was it really necessary ? Sure we can come up with a thousand cases where optimization yielded sizeable benefits. But there are far more cases where optimization was done prematurely. No amount of optimization on a bubble sort will beat using a more efficient algorithm. But worst of it is the hidden cost which, with time becomes not so hidden.
Many times I have seen corporations with systems many millions of lines of codes that are so hard to maintain because of the systematic optimization done. In fact I have seen optimization techniques that became laws, your code HAD to do it this way. The end result was that the systematic application of this without regard of the actual impact actually made the code run MUCH slower, I’m not speaking 0.6 seconds or 8x slower I’m speaking orders of magnitude slower.
To make matters worst the code is not so convoluted and over optimized that it is nearly unmanageable. A simple feature that should take a few weeks to implement now takes over a year because of the fragility these optimization brought to the code base. Fix something here and five other things break…
YES optimization is a good thing when used properly, so is goto, firearms, insecticide and land mines. However it is because they are so easy to mis-use that we recommend that they are not be used at all. Most (99%) people are not smart enough to use them properly and in doing it make life impossible for the rest of us.
If you are to optimize then do it proper, roll up your sleeves and get the assembly code directly. but please by the love of good do not do it by tricking the compiler. If it is not smart enough for you then remove it for the bits you think you can do better and have fun but do not try to inject smart behaviour in stupid things it will come back to haunt you.
June 8th, 2010 at 2:05 am
For compilers without restrict support, I believe this would work:
void naive_tempvar(float *in, float *out)
{
int i,j;
for (i=0;i<LENGTH-4;i++)
{
float sum = 0.0f;
for (j = 0;j<5;j++)
sum += filter[j] * in[i+j];
out[i] = sum;
}
}
By fixing the location of out[i] as a temporary variable, the aliasing problem is removed, and the compiler should optimize that code into near-identical assembly.
Please correct me if I'm wrong.
June 8th, 2010 at 2:30 am
@Louis: Good article.
@Drifter: Testing such optimizations stand-alone won’t give you valid numbers. At least with N == 5000. The loading of the exe alone takes too much time.
@No-hand-optimization folks:
Understanding where and when to optimize is IMHO essential for a good programmer. And since the article does not state directly optimizing for a 3GHz-PC you really should consider the vast number of programmers working on embedded systems, which are often restricted in speed _and_ memory.
Sure: Writing assembly for todays IA32/IA64 cpus is overkill. One just cannot write code which is optimal for a Pentium 4 and a i7.
But knowing how to help the compiler to do the best job for you is the trick.
And BTW: Ever wondered why Thunderbird takes longer to start on a 3GHz machine than Netscape on a 200MHz computer ?
I suggest, every application developer must be forced to program on 500MHz computers, it’ll help the users
June 8th, 2010 at 2:31 am
Nice article, it will be interesting to see your SSE-version.
But, there’s always one, to truly optimize your code you should leave high-level code and rewrite the algorithm in assembly (but this you already knew). Trying to squeeze some extra oompf out of the code generated gives, most of the times, just a few percent — occasionally more. The drawback are that you tends to get bugged down in the generated code instead of seeing the whole algorithm and translating that.
That said, it’s not a total waste of time scurrying through the output from the compiler — when hunting for speed bugs, as your example clearly shows.
June 8th, 2010 at 2:36 am
@Graycode : Tested with PowerPC-gcc -O3: The stores are gone, but the restricted version is better unrolled using a lot multiply-add instructions. But compiling with size optimization your code gives the better result (size sometimes matters).
June 8th, 2010 at 2:50 am
Louis,
an interesting and relevant article for those involved in optimizations.
A comment to Magice: my bread and butter is real-time image processing. I can be requested to handle, say, 4 Mpixels images 60 times per second. So my time scale is the millisecond. In such a context, optimization really matters
Unfortunately for us, the behavior of modern processors becomes less and less predictable and repeatable. This slowly turns the art of programming to black magic…
June 8th, 2010 at 3:06 am
Hi,
Interesting article but missing a warning, because on real cases (I mean professional work), the use of “restrict” will certainly lead to a nice big crash
when a new developper arrives and changes the parameters, without knowing that *in and *out should not be overlapping (or do you really expect him to check all the code before changing 3 lines, or knowing what the “restrict” keyword means ?)
In fact I’ll even bet that the code writer himself will eventually break the code some months later because he would have forget about this optimization…
In my opinion the temporary variable solution mentionned by Graycode is far better because it does not introduce unnecessary risks and is finally much more clear on what the processor is really doing.
June 8th, 2010 at 3:16 am
Read a lot of comments from people who go at great depths to say: just leave it up to the compiler. Bad idea. I don’t care if you don’t check the optimization into source control; stick around at work for another 1/2 hour and optimize a few of your methods.
The reasoning is simple: as a programmer you should not trust a black box. You need to understand how the tooling you are using works; where the nuts and bolts are, how everything fits together all the way down to the metal.
Doing anything less is giving the rest of us a bad name. These people are the type who don’t know when to use an enumerator and when to use an array. These are the type of people who think that the year 2000 will never happen.
June 8th, 2010 at 4:18 am
Choosing algorithms and data structures wisely to avoid combinatorial explosion is the very best optimization of human time and safety. Micro-optimization is non-sense.
June 8th, 2010 at 4:25 am
A nice article!
I suspect the source of contention between those advocating and those not preferring micro-optimisation is due to the type of applications being developed. I do a lot of high performance computing and can assure you when you have to run heavy computation with long loops over multi-million element vectors and your application might be running hours on end, micro-optimisation can definitely be worth the effort. This of course means you would first thoroughly profile the application and have a good understanding of the bottlenecks before setting out to optimise.
June 8th, 2010 at 4:26 am
The example given in this article *isn’t* micro-optimisation. Micro-optimisation is doing things like changing
q = x[a + b] * y[a + b];intotemp = a + b; q = x[temp] + y[temp];, or changing * 2 into << 1.Adding appropriate "restrict" qualifiers isn't any kind of overengineering, either; as you've noted, it's something that allows the compiler to optimise better, without affecting the readability or maintainability of the code. In my book, correct use of the "restrict" qualifier is just as valuable and praiseworthy as correct use of the "const" qualifier.
June 8th, 2010 at 5:32 am
I suggest that you just have a poor compiler. Visual Studio 2010 has naive running on a LENGTH array in 14.1754 nanoseconds. Also, I really didn’t like how you didn’t give example inputs or other use cases, or just your general setup.
June 8th, 2010 at 8:52 am
Good article, haven’t played with C/asm optimisation for a while. I have to do a lot of optimisation in my job, in C#, and most of my stuff is limited to algorithms. The problem is in maintenance. If I can’t get the monkeys to understand what I’ve done – it never gets used again. The problem is we all have to deal with marginal coders, and we can’t just blow them off – because they ARE the majority.
But it was good to read about some real optimisation again!
June 8th, 2010 at 9:49 am
Some pretty unfair comments in what predictably turned into a flame war (be it a fairly civil one so far). The fast majority of slow response times in all applications today are from one of 3 factors 1) Blocking on I/O inefficiently (especially networked) 2) Allocating memory inefficiently (especially string construction) 3) Context/Thread switching (inefficiently?)
So my vote would be to teach all of the app developers in the world how to optimize those 3 things before we get into micro optimization religious wars. Also, when app developers don’t micro optimize their code, it’s usually not because they don’t take pride in their work, it’s more often because they have to balance features and quality against performance gains.
June 8th, 2010 at 9:50 am
And by fast majority I meant vast majority. Limited editing abilities….
June 8th, 2010 at 10:43 am
OK, the point of adding the restrict keyword is to get it to accumulate results in a temporary variable in the loop.
Why don’t you just do that?
void naive(float* in, float* out)
{
int i,j;
for (i=0;i<LENGTH-4;i++)
{
float sum = 0.0f;
for (j = 0;j<5;j++)
sum += filter[j] * in[i+j];
out[i] = sum;
}
}
This not only lets the compiler know it doesn't need to worry about aliasing, it's more readable once the expressions (as they tend to) become more complex.
June 8th, 2010 at 5:55 pm
@jon and @jalf
- Pipelining may not be so much an issue on lower end / embedded processors, where you would be looking exactly to make the kind of optimisations demonstrated in this article.
Even the Atom, if I understand it correctly, will not particularly negatively affect your expected execution order/speed from it’s in-line pipeline.
June 8th, 2010 at 10:03 pm
I like the explanation…
btw i believe this was compiled with gcc and not a MS product
June 9th, 2010 at 11:29 am
I’d like to see the same code with fixed-precision integers instead of floats — what’s the speed difference?
I know FPUs are great these days, so maybe there’s no difference. Last time I was doing similar code (90’s) it was a huge no-no to do floating point because of the performance problems. I’m curious!
June 11th, 2010 at 11:18 am
@AlainT – Kudos. I was reading this entire comment thread waiting for someone to mention exactly what you said. In a theoretical world of “Hey let’s see if I can make my code go faster / make the smallest PE possible / win the obfuscated C contest”, this is pretty neat. In the real world of someone else maintaining your code or, as you said, returning to your own code in several years with a different compiler/build environment and rusty knowledge, techniques that require specific programmer knowledge and specific compiler behavior have the potential to create some really painful and hard to find bugs.
June 19th, 2010 at 5:29 pm
fdsdfsgfd
June 28th, 2010 at 1:13 am
very helpful,useful!Sometimes ,it’s miserable that your codes are just running slowly, but u dont know why
July 23rd, 2010 at 6:59 pm
did you ever consider looking at gcc instead of looking at your code (whether C or assembly)?
more specifically do you understand how flex and bison work? do you know what they are doing (in characteristic american “micro-detail”? that’s for you magice
gcc is a mess. it’s not clean code. and this makes it extremely difficult to fix things.
and writing compilers is not easy.
someone suggested reading a book on “good code”.
perhaps you should be reading a book on compilers?
programmers are not only lazy but, today, they are also ignorant. and therein lies the problem. history has been overlooked.
knowing assembly is not enough to understand the problem. you need to understand the steps taken to get from C to assembly, and how they are implemented.
and that’s a lot of work.
note also: there are alternatives to gcc. but you need to investigate the “ancient” past to find them.
August 19th, 2010 at 3:02 pm
Great post. Good job.