{ on programming and the internets }


by Louis Brandy

You can’t beat a good compiler… right?

Right! Sort of.

It is extremely well worn conventional wisdom that for your “average” programmer, it’s pretty much impossible to beat a good optimizing compiler at writing fast code. For those who skip the “understanding” part, and jump straight into the “application” part, this rule of thumb tends to be grossly misunderstood, and misused. For these people, compilers are so smart, and so good, that micro-optimization is basically impossible (uh, no), or too hard (no), or even that they make assembly knowledge unnecessary or obsolete (not even close). This tends to lead to a mentality that optimization can be left to the compiler and often results in people warning others to not bother with micro-optimizations at all.

Optimization, today, is largely about writing some naive code, looking at the assembly output, figuring out what you want to improve on, and then getting (or, often, tricking) the compiler into doing what you want. After several iterations of this, you end up with the code you want, and you gain all the benefits of having a compiler in the loop (portability, future-proofing, and so on). This process, though, obviously involves a great deal of low-level knowledge: assembly, cpu architecture, micro-optimizations that are likely to work, etc.

As an aside, I have a question: how exactly does one go about learning how to go beyond standard C into hardcore optimization? For me, it’s been a long road of trial & error, and random discovery. It seems to me that there is probably good resources out there for getting started on optimizing code (let’s say in gcc). Know any?

All the code I used can be found here: http://github.com/lbrandy/simple-optimization-test (see signal_blur.c). I ran all of these tests on Linux using gcc 4.3.3 and -O3. UPDATED: I changed the code to use c99’s restrict keyword, instead of gcc’s __restrict__

Getting yourself into trouble

Here is a naive (and contrived) one dimensional filter routine.

#define LENGTH 5000

// a bad impersonation of a gaussian filter
const float filter[] = {0.01f, 0.2f, 0.58f, 0.2f, 0.01f};

void naive(float* in, float* out)
{
  int i,j;
  for (i=0;i<LENGTH-4;i++)
  {
    out[i]=0.0f;
    for (j = 0;j<5;j++)
      out[i] += filter[j] * in[i+j];
  }
}

This is an interesting example because it contains a fatal flaw that might not be obvious. Take a second to understand what this code is doing. It’s fairly simple, but critical to the rest of the discussion. Let’s look at the assembly output of this, built using -O3.

08048430 :
 8048430:       55                      push   %ebp
 8048431:       31 c0                   xor    %eax,%eax
 8048433:       89 e5                   mov    %esp,%ebp
 8048435:       8b 4d 08                mov    0x8(%ebp),%ecx
 8048438:       8b 55 0c                mov    0xc(%ebp),%edx
 804843b:       90                      nop
 804843c:       8d 74 26 00             lea    0x0(%esi),%esi
 8048440:       d9 ee                   fldz
 8048442:       d9 14 82                fsts   (%edx,%eax,4)
 8048445:       d9 05 10 89 04 08       flds   0x8048910
 804844b:       d9 04 81                flds   (%ecx,%eax,4)
 804844e:       d8 c9                   fmul   %st(1),%st
 8048450:       de c2                   faddp  %st,%st(2)
 8048452:       d9 c9                   fxch   %st(1)
 8048454:       d9 14 82                fsts   (%edx,%eax,4)
 8048457:       d9 05 14 89 04 08       flds   0x8048914
 804845d:       d9 44 81 04             flds   0x4(%ecx,%eax,4)
 8048461:       d8 c9                   fmul   %st(1),%st
 8048463:       de c2                   faddp  %st,%st(2)
 8048465:       d9 c9                   fxch   %st(1)
 8048467:       d9 14 82                fsts   (%edx,%eax,4)
 804846a:       d9 05 18 89 04 08       flds   0x8048918
 8048470:       d8 4c 81 08             fmuls  0x8(%ecx,%eax,4)
 8048474:       de c1                   faddp  %st,%st(1)
 8048476:       d9 14 82                fsts   (%edx,%eax,4)
 8048479:       d9 44 81 0c             flds   0xc(%ecx,%eax,4)
 804847d:       de ca                   fmulp  %st,%st(2)
 804847f:       de c1                   faddp  %st,%st(1)
 8048481:       d9 14 82                fsts   (%edx,%eax,4)
 8048484:       d9 c9                   fxch   %st(1)
 8048486:       d8 4c 81 10             fmuls  0x10(%ecx,%eax,4)
 804848a:       de c1                   faddp  %st,%st(1)
 804848c:       d9 1c 82                fstps  (%edx,%eax,4)
 804848f:       83 c0 01                add    $0x1,%eax
 8048492:       3d 84 13 00 00          cmp    $0x1384,%eax
 8048497:       75 a7                   jne    8048440 
 8048499:       5d                      pop    %ebp
 804849a:       c3                      ret
 804849b:       90                      nop
 804849c:       8d 74 26 00             lea    0x0(%esi),%esi

From the assembly, we see that the compiler has done quite a bit (if you don’t know assembly, don’t worry, I’ll do my best to explain). First we see all of the address calculation involved in loading the filter coefficients has vanished (it is loading each one directly, e.g. flds 0x8048918). Second, notice that the (fixed sized) inner loop has been completely unrolled, resulting in each of the 5 multiplications and additions per iteration. So far so good.

There is, however, a very alarming surprise is this code. That is the quantity of store instructions (also loads). After every iteration of our inner loop (each filter coefficient), the result is stored. You can see 5 different store instructions (fstps, fsts) per iteration of this loop. Why? Let’s have a look at the code again:

void naive(float* in, float* out)
{
  int i,j;
  for (i=0;i<LENGTH-4;i++)
  {
    out[i]=0.0f;
    for (j = 0;j<5;j++)
      out[i] += filter[j] * in[i+j];
  }
}

To the inexperienced, it might be bewildering why the optimizing compiler would generate 5 store instructions for out[i]= in the inner loop. Why wouldn’t it just accumulate the answer in a register, and then store only the final result? The answer is: aliasing. The problem here is that compiler cannot assume that the pointers *in and *out are disjoint. It must store the result into out[i] each iteration because out[i] may be in[i+j] in the next iteration of the inner loop. With a bit of thought, it becomes clear how this code requires these stores to be correct in cases like *out pointing one float ahead of *in.

Another hard-learned tidbit: wasteful store instructions are terrible because stores can be incredibly expensive (far worse than an extra add or multiply).

Fixing it with restricted pointers

There are several ways to fix this problem, but in the spirit of teaching the art of optimization, I’ll go with the use of the __restrict__ qualifier (this is a gcc directive, but most compilers have some support for restricted pointers). [note from comments: restricted pointers are part of the C99 standard now using the keyword restrict. ]. The only change I made is to add __restrict__ to the function declaration:

void naive_restrict(float *__restrict__ in, float *__restrict__ out)
{
  int i,j;
  for (i=0;i<LENGTH-4;i++)
  {
    out[i]=0.0f;
    for (j = 0;j<5;j++)
      out[i] += filter[j] * in[i+j];
  }
}

This directive tells the compiler that you, the programmer, promise that *in and *out are disjoint, and no aliasing will occur. If you break your promise, don’t expect your code to be correct. Here is the assembly of that output:

080484a0 :
 80484a0:       55                      push   %ebp
 80484a1:       31 c0                   xor    %eax,%eax
 80484a3:       89 e5                   mov    %esp,%ebp
 80484a5:       8b 55 08                mov    0x8(%ebp),%edx
 80484a8:       8b 4d 0c                mov    0xc(%ebp),%ecx
 80484ab:       90                      nop
 80484ac:       8d 74 26 00             lea    0x0(%esi),%esi
 80484b0:       d9 05 10 89 04 08       flds   0x8048910
 80484b6:       d9 04 82                flds   (%edx,%eax,4)
 80484b9:       d8 c9                   fmul   %st(1),%st
 80484bb:       d8 05 0c 89 04 08       fadds  0x804890c
 80484c1:       d9 05 14 89 04 08       flds   0x8048914
 80484c7:       d9 44 82 04             flds   0x4(%edx,%eax,4)
 80484cb:       d8 c9                   fmul   %st(1),%st
 80484cd:       de c2                   faddp  %st,%st(2)
 80484cf:       d9 05 18 89 04 08       flds   0x8048918
 80484d5:       d8 4c 82 08             fmuls  0x8(%edx,%eax,4)
 80484d9:       de c2                   faddp  %st,%st(2)
 80484db:       d8 4c 82 0c             fmuls  0xc(%edx,%eax,4)
 80484df:       de c1                   faddp  %st,%st(1)
 80484e1:       d9 c9                   fxch   %st(1)
 80484e3:       d8 4c 82 10             fmuls  0x10(%edx,%eax,4)
 80484e7:       de c1                   faddp  %st,%st(1)
 80484e9:       d9 1c 81                fstps  (%ecx,%eax,4)
 80484ec:       83 c0 01                add    $0x1,%eax
 80484ef:       3d 84 13 00 00          cmp    $0x1384,%eax
 80484f4:       75 ba                   jne    80484b0 
 80484f6:       5d                      pop    %ebp
 80484f7:       c3                      ret
 80484f8:       90                      nop
 80484f9:       8d b4 26 00 00 00 00    lea    0x0(%esi),%esi

Now, you do not have to be an assembly expert to see how much more streamlined this code is. It consists almost exclusively of loads, multiplies, and adds with a final store at the end. This does exactly what we’d originally hoped. It realizes it can keep a temporary running sum and only store once at the end. We should also note that if you violate the restricted pointer promise, this code will not be correct!

It shouldn’t surprise you to see how much faster this version is, either:

  naive: 0.672957
  naive_restrict: 0.160432

It’s almost 5 times faster than the non-restricted version.

Going Forward

Though I haven’t actually done it, my overwhelming suspicion is that this code still has a ways to go in terms of speed. The next step would be the use of SSE instructions. Maybe next post.

The take-away from examples like this should be that optimization cannot be “left” to the optimizing compiler. You have to know what you are doing to get a compiler to make fast code. And if you really know what you are doing (which means knowing assembly and the underlying architecture quite well), you can usually get a modern optimizing compiler to do exactly what you want. You have to be in the loop. Truly optimized C code ends up looking nothing like C at all.

With careful hand holding, you can get a compiler to make fast code. In that case, it can become difficult to beat a compiler with hand-optimized code. But that is not because the compiler is so good, but because you are so good at getting the compiler to make the code you want.

trackback

58 Responses to “You can’t beat a good compiler… right?”

  1. June 7th, 2010 at 8:16 am

    Hallvard Norheim Bø says:

    They actually tried to introduce a keyword into standard C to deal with things like pointer aliasing. But it never made it into the standard. See http://www.lysator.liu.se/c/dmr-on-noalias.html for Dennis Ritchie’s thoughts on the proposed “noalias” keyword.

  2. June 7th, 2010 at 9:53 am

    Joachim Schipper says:

    The “restrict” keyword is in C99; #define-ing it away for old compilers makes sense, but there’s no need to use the (obviously unportable) “__restrict__” extension.

    On the other hand, “naive” unrolling would work, too: out[i] = filter[0] * in[i] + … + filter[4] * in[i + 4] should generate good code (not tested, sorry!)

    Not that it matters to the present discussion, but your “naive implementation” has an off-by-one error, accessing in[LENGTH].

    But this is just nitpicking: nice article!

  3. June 7th, 2010 at 9:56 am

    Peter Varga says:

    restrict is a valid C pointer qualifier and works.
    paragraph 6.7.3.1 in the C99 Standard.

    It is not needed to use the __restrict__ , just use restrict.

  4. June 7th, 2010 at 10:14 am

    Magice says:

    This totally, completely, and utterly misses the point of compiler optimization, and optimization in general.

    First, yes, per current state-of-the-art, a master-level human assembler can assist the compiler to produce better, safer, faster code in these cases. However, who cares? Seriously, you are talking about, what, 0.6 second. It’s not worth the effort! Oh, did you even realize that you are burning time on 10 lines of code? Seriously, a toy program has thousands of lines of C code already. Good luck trying to optimize every single line. And it’s not like you can get out 0.6 sec every time. Maybe 0.01 next time. Remember, with a 3GHz CPU, 1% of a second is a LONG time.

    Oh, and did anyone tell you that mucking with non-standard stuffs are bound to disaster? And micro optimization may not work with future things? For example, multithreaded. Try really hard to minimize the IO of a single thread, but if the OS swaps out your process, destroy your cache, than the cost of swapping is so huge that your little optimization is not even worth mentioning.

    Meanwhile, why don’t you invest time in something with higher return. For example, data structures that minimize cache hits and can be used over and over. Or, algorithms that are magnitudes faster than your current ones? Or, a better programming paradigm? Or, JIT with selected compilation? Structural stuffs, not little tiny details.

    For some reasons, this reminds me of Vietnam war. America had all details right. They had bigger guns, bigger bombs, bigger airplanes, bigger budget; they won just about anything that they mucked into. And they got kicked out. Micro optimization, you think?

  5. June 7th, 2010 at 10:19 am

    Breck Fresen says:

    It’s not that optimizations like these are too hard, it’s that for the vast majority of applications they’re completely unnecessary. There are very few problem domains where dropping down to assembly is worth the time it takes to optimize, let alone the cost of readability and maintainability. If you’re one of those few engineers working on a kernel or some other piece of performance-critical system code, then more power to you. But suggesting that knowledge of assembly is necessary for the truly average programmer is ridiculous.

    That said, props on restrict – I’d never seen that before.

  6. June 7th, 2010 at 10:19 am

    louis says:

    @magice, thanks for the predictable response.

    Nowhere have I argued that you should be optimizing “all” of your code, or optimizing indiscriminately (or prematurely), or without exhausting other, better options. Nowhere have I said this type of optimization is preferable to cache optimizations, or other high level concerns. Knowing -when- to bust out micro-optimizations is, in and of itself, a skill worth discussing.

    However, if you are actually arguing that it is never correct to spend time doing micro-optimization, you are wrong. It’s just that simple.

  7. June 7th, 2010 at 10:27 am

    Joachim Schipper says:

    As pointed out to me at http://news.ycombinator.com/item?id=1410755, the off-by-one error is nonsense. Sorry!

  8. June 7th, 2010 at 10:53 am

    Pau Fernández says:

    Very nice article, thanks.
    Looking forward to the next post.

  9. June 7th, 2010 at 11:11 am

    George Phillips says:

    0.672957 / 0.160432 is about 4.2 — quite a bit short of “almost 5″. I nitpick here because numbers really matter when optimizing. Too many a time I’ve made similar leaps thinking that I had previously obtained “nearly 5 times speedup” and looked down upon something that was 4.5 times faster. Only to find out that the “nearly 5″ was actually slower because I wasn’t being careful.

  10. June 7th, 2010 at 11:56 am

    Sebi says:

    Lots of resources on optimizing here : http://www.agner.org/optimize/

  11. June 7th, 2010 at 11:56 am

    Alan McIntyre says:

    Nice article! Out of curiosity, I tried adding -funroll-loops, which made the naive function run faster than the naive_restrict function.

    Before adding -funroll-loops:
    naive: 0.461494
    naive_restrict: 0.143920
    sum_variable: 0.105645

    After adding -funroll-loops:
    naive: 0.107970
    naive_restrict: 0.125535
    sum_variable: 0.096111

    Looking forward to an SSE post. :)

  12. June 7th, 2010 at 12:21 pm

    nate clark says:

    Funny, I couldn’t recreate your results with gcc 4.3.3 on my machine. The O3 code I got used SSE, fully unrolled both loops, and did the elimination of the unnecessary stores. It also inlined naive() into main().

    I imagine the problem is that the assembly code you looked at was for just the naive() function itself. If you call that function from somewhere, then GCC will use alias analysis to detect that the pointers don’t alias, and probably inline, obviating the need for the restrict keyword. Granted, GCC doesn’t have the best alias analysis in the world, but I’d be shocked if this example is one that breaks it.

  13. June 7th, 2010 at 12:38 pm

    Fenris says:

    You can get the same output (if you really don’t have overlapping pointers) with a simple temporal variable

    float temp= 0.0f;
    for (j = 0;j<5;j++)
    temp += filter[j] * in[i+j];
    out[i] = temp;

    you clearly tell the compiler what to do, with this anti-optimization (I declared more variables than needed)

  14. June 7th, 2010 at 12:40 pm

    Loren Pechtel says:

    When you look at the cases where hand optimization can beat the compiler it normally comes down to the human knowing things the compiler can’t know. In this case you can inform the compiler but you can’t always.

  15. June 7th, 2010 at 12:49 pm

    louis says:

    @nate,

    Did you run the code as I have it on github, or just copy/paste this into your own file? I ask because I just double checked everything, and am not seeing the same things as you. It would seem, to me, that the function pointers as I have it setup keep gcc from inlining it. I’ve been running gcc 4.3.3 on a relatively clean ubuntu 9.04 box.

    Is it generating vectorized SSE, or just using the SSE stack for scalar fp operations? I can’t really get gcc to autovectorize this code at all, by playing with flags.

    I can, though, get decent speedups by using ‘-O3 -msse3 -mfpmath=sse’ which will force it to generate sse scalar instructions instead of fpu instructions.

  16. June 7th, 2010 at 12:57 pm

    Drako says:

    @magice
    To thee I quote my professor from back in my undergraduate days:

    “Hardware exists to provide scalability and high availability, not to cater to poorly optimized code.”

    That being said, what @louis said is a perfectly acceptable compromise. Sure application developers will always want to be lazy, so we shall cut them some slack and allow them to optimize just the critical segments of code.

    Remember, just as you as an application developer want to be lazy and finish your work quickly so you can go out and play, the people using that shitty-end-result of an application also want to do the same.

    P.S. when I say “you” I mean the general masses of application developers who have no sense of pride in their work.

  17. June 7th, 2010 at 1:09 pm

    Enlightenment says:

    The reason so many program runs slow and hog memory is because ignorant people quit worrying about optimizing code. They just assume it is a non-issue with a 3GHz processor, but they are wrong.

    The idiots don’t get it. LENGTH of 5000 might be trivial, but what if you had to do calculations on LENGTH of 100 MILLION or many different number tables, then savings would add up to be very large amounts of time.

  18. June 7th, 2010 at 1:32 pm

    Gypsy says:

    Great article! It’s not often that you see someone brave enough to wander into the dark depths of assembly programming. To those of you who respond with frustration and anger: I can only assume that you have a hard time reading/writing assembly.

    Saying that micro-optimizations aren’t useful is just silly. Most of you should be aware that some functions are called repeatedly, and those “2 seconds saved” per run may amount to a lot of time.

    Also, measuring the need for optimization based on the number of lines of code, indicates a terrible lack of understanding.

  19. June 7th, 2010 at 2:03 pm

    Jed says:

    @Enlightenment: With LENGTH=10^8, you would be memory bandwidth limited.

  20. June 7th, 2010 at 2:05 pm

    Tim Smith says:

    Being someone who spends most of my working day optimizing code, I have to agree that the idea that the compiler can optimize better than a human is a limited concept.

    In the author’s example, we have a perfect case where the compiler doesn’t have enough information to perform specific optimizations.

    In other cases, the code is just too complex for the compiler to properly deconstruct and optimize. This was a big problem with some compiler when dealing with C++ templates.

    That being said, I still find that most optimizations are higher level optimizations. For example, using proper containers or data structures. Repeating work is usually a big cause of problems.

    Good article.

  21. June 7th, 2010 at 2:05 pm

    Drifter says:

    I was kinda suspicious of your results so I tried your code on cygwin using actual timed tests. I bumped the LENGTH up to 100*1000*1000 to give it something to chew on, and checked the assembler to make sure it replicated your results (which it did). Using time ./a.exe. The profile time fluctuates from 0.577s to 0.655s on the fast version and from 0.545s to 0.670s on the slow version. Frankly, the error seems higher than the effect of the optimization.

    I may be wrong, but I suspect the memory timings are all important here. The most expensive operation is loading a chunk of data from main memory into the cache. The additional stores the you managed to optimize out only use the local cache so they are very fast and are effectively hidden by the load from main memory. The fluctuation in timings is caused by the cache getting flushed by other apps in system.

    Feel free to criticize my results but I suspect your optimization would only work on older hardware and embedded systems that aren’t so dependent on memory.

    I would be interested to see your own results if you tried to time the optimizations. I personally believe that your article is fundamentally flawed because you didn’t profile your results.

  22. June 7th, 2010 at 2:15 pm

    PacManfan says:

    Thanks for the good article on basic optimization.
    @magice – Listen to the responses on this article – you could learn a thing or two.
    As a video game programmer back in the early 90’s, I used to manually unroll inner loops and convert them to assembly by hand. Done correctly, it can change the video performance from 5-7fps to 30+ fps.
    The entire program does not need to be optimized, just the slowest, most-run code. That’s what a profiler is for.
    -PMF

  23. June 7th, 2010 at 2:16 pm

    Drifter says:

    I apologize – I didn’t see the timings at the end of the article.

  24. June 7th, 2010 at 2:20 pm

    Kevin H says:

    @Fenris: That isn’t really an “anti-optimization” per se, as that variable would be put on the stack, which doesn’t really add extra memory (since it doesn’t call any functions within). Unless this call was very deep in the stack, I doubt it would affect anything.

  25. June 7th, 2010 at 2:42 pm

    angelom says:

    Thanks for the article.

    I come from an embedded background where I have found bad C compilers producing painfully bad code.

    Sheeple are repeatedly quoting ‘Gotos Bad’, ‘Assembly Optimization Bad’, wondering what the next mantra will be…

  26. June 7th, 2010 at 3:38 pm

    Dan says:

    A good introductory-level textbook that sets the context for this topic is Computer Systems: A Programmer’s Perspective, Bryant & O’Halloran. Uses X86, Linux (with appropriate Windows asides), and GCC, covers 32-bit and 64-bit architectures. It doesn’t dive into detail on hard-core optimizations, but it does address memory aliasing and other compiler optimization blockers, memory caches, etc. Written for undergraduate students who are new to both C and X86 assembly.

    http://csapp.cs.cmu.edu/

  27. June 7th, 2010 at 3:56 pm

    Jon says:

    I see nothing about you trying to avoid stalling the instruction pipeline in the CPU. The days of non-pipelined, CISC chips are looong gone, my friend.

  28. June 7th, 2010 at 4:25 pm

    dude says:

    I understand the point of your article, but it’s just plain silly not to mention that you will on average get more optimization by picking the most efficient algorithm and profiling than you ever will with premature optimization. Get it to work -right- first and then get it to work -fast- if profiling tells you it’s necessary. You should always mention this as beginners will come across your article and think “well now I have to back and optimize everything”.

  29. June 7th, 2010 at 4:33 pm

    Joachim Schipper says:

    @nate clark: note that such analysis doesn’t work if it’s not in the same source file; even with GCC’s new whole-program optimization, it won’t work if the function is in a shared library.

  30. June 7th, 2010 at 5:17 pm

    jalf says:

    @Magice: oh really? 0.6 seconds?

    The largest performance boost I’ve achieved solely from microoptimizations, from trying to “beat the compiler” is… over 800%. The application as a whole ran 8 times as fast when we were finished optimizing.

    I think that’s worthwhile. It certainly was in that specific case.

    Your assumption that it is not worth it to consider performance on a lower level than choice of algorithm is simply foolish.

    But bad programmers will always try to make a case for why being a bad programmer, why not understanding the hardware you’re running on, or the code you’re executing, is a good thing.

    Guess what. At the end of the day, they’re still bad programmers. They’re missing some vital tools in their toolbox.

    A good programmer certainly knows that messing around with low-level optimizations isn’t the first thing you should do, but he *also* has the option available to him. He has the skills required to understand his code, and the performance of that code, because *sometimes* that is essential in getting good perfomance.

  31. June 7th, 2010 at 5:42 pm

    jalf says:

    As an aside, I have a question: how exactly does one go about learning how to go beyond standard C into hardcore optimization? For me, it’s been a long road of trial & error, and random discovery. It seems to me that there is probably good resources out there for getting started on optimizing code (let’s say in gcc). Know any?

    http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901 was an invaluable help in my case. It’s not about optimization, but it’s an amazing resource for understanding the CPU. And when you do that, most optimization tricks become pretty obvious.

    Also look up the documentation for your target processor (both AMD and Intel have these available in PDF form on their websites)

    Knowing which instructions are pipelined and which ones aren’t, and their relative latencies and all that good stuff helps too. Reading the assembly output from the compiler isn’t enough. You also have to know how expensive the instructions it chose are.

    Humans certainly can beat a good compiler. And you don’t have to be a “master-level human assembler” to do it. You just have to think. Choose your battles. The compiler is unbeatable on its home turf. don’t try to optimize on the things it does well.
    But in many other cases, a human programmer knows things about the code that the compiler doesn’t. And that can make a huge difference. Using “restrict” as shown in this blog post is one example. The compiler can’t safely do this because it doesn’t know if it’ll break the program. The human programmer *does* know that this assumption is safe to make.

  32. June 7th, 2010 at 5:55 pm

    Anonymous says:

    this whole article could be written in five lines.

  33. June 7th, 2010 at 6:08 pm

    Anonymous says:

    Rather than blame developers for failing to compensate for problematic language features, why not use a language that doesn’t have those problems in the first place (excluding kernel devs, etc)? The simple fact is that pointers eliminate many kinds of automatic optimization. Fran Allen noted this decades ago. I think it’s time to stop apologizing for C.

  34. June 7th, 2010 at 6:26 pm

    Rick says:

    Use -fpmath=sse to use the SSE unit instead of x87 and -ftree-vectorize (might already be part of -O3) to generate vectorized SSE code.

  35. June 7th, 2010 at 11:30 pm

    Kelexel says:

    yeah… go ahead … optimize…

    I’m not saying that optimizing is never useful, quite the contrary. However from experience Optimizing has almost always (note.. almost) been harmful. It introduces bugs that are hard to fix because the code is harder to read, it makes the system much more fragile and looses a lot in resilience over external changes. Especially when the optimization is done through compiler tricks and exploiting obscure corner cases.

    Optimization such as this should be as isolated as possible, writing the code directly in assembly language would have avoided some of the pitfalls. Burying the code in a library that is properly unit tested and provide a proper interface to it so that the optimization does not spread beyond it`s original intended use would be even better.

    However, was it really necessary ? Sure we can come up with a thousand cases where optimization yielded sizeable benefits. But there are far more cases where optimization was done prematurely. No amount of optimization on a bubble sort will beat using a more efficient algorithm. But worst of it is the hidden cost which, with time becomes not so hidden.

    Many times I have seen corporations with systems many millions of lines of codes that are so hard to maintain because of the systematic optimization done. In fact I have seen optimization techniques that became laws, your code HAD to do it this way. The end result was that the systematic application of this without regard of the actual impact actually made the code run MUCH slower, I’m not speaking 0.6 seconds or 8x slower I’m speaking orders of magnitude slower.

    To make matters worst the code is not so convoluted and over optimized that it is nearly unmanageable. A simple feature that should take a few weeks to implement now takes over a year because of the fragility these optimization brought to the code base. Fix something here and five other things break…

    YES optimization is a good thing when used properly, so is goto, firearms, insecticide and land mines. However it is because they are so easy to mis-use that we recommend that they are not be used at all. Most (99%) people are not smart enough to use them properly and in doing it make life impossible for the rest of us.

    If you are to optimize then do it proper, roll up your sleeves and get the assembly code directly. but please by the love of good do not do it by tricking the compiler. If it is not smart enough for you then remove it for the bits you think you can do better and have fun but do not try to inject smart behaviour in stupid things it will come back to haunt you.

  36. June 8th, 2010 at 2:05 am

    Graycode says:

    For compilers without restrict support, I believe this would work:

    void naive_tempvar(float *in, float *out)
    {
    int i,j;
    for (i=0;i<LENGTH-4;i++)
    {
    float sum = 0.0f;
    for (j = 0;j<5;j++)
    sum += filter[j] * in[i+j];
    out[i] = sum;
    }
    }

    By fixing the location of out[i] as a temporary variable, the aliasing problem is removed, and the compiler should optimize that code into near-identical assembly.

    Please correct me if I'm wrong.

  37. June 8th, 2010 at 2:30 am

    42Bastian says:

    @Louis: Good article.

    @Drifter: Testing such optimizations stand-alone won’t give you valid numbers. At least with N == 5000. The loading of the exe alone takes too much time.

    @No-hand-optimization folks:
    Understanding where and when to optimize is IMHO essential for a good programmer. And since the article does not state directly optimizing for a 3GHz-PC you really should consider the vast number of programmers working on embedded systems, which are often restricted in speed _and_ memory.

    Sure: Writing assembly for todays IA32/IA64 cpus is overkill. One just cannot write code which is optimal for a Pentium 4 and a i7.
    But knowing how to help the compiler to do the best job for you is the trick.

    And BTW: Ever wondered why Thunderbird takes longer to start on a 3GHz machine than Netscape on a 200MHz computer ?

    I suggest, every application developer must be forced to program on 500MHz computers, it’ll help the users :-)

  38. June 8th, 2010 at 2:31 am

    Aldurs says:

    Nice article, it will be interesting to see your SSE-version.
    But, there’s always one, to truly optimize your code you should leave high-level code and rewrite the algorithm in assembly (but this you already knew). Trying to squeeze some extra oompf out of the code generated gives, most of the times, just a few percent — occasionally more. The drawback are that you tends to get bugged down in the generated code instead of seeing the whole algorithm and translating that.

    That said, it’s not a total waste of time scurrying through the output from the compiler — when hunting for speed bugs, as your example clearly shows.

  39. June 8th, 2010 at 2:36 am

    42Bastian says:

    @Graycode : Tested with PowerPC-gcc -O3: The stores are gone, but the restricted version is better unrolled using a lot multiply-add instructions. But compiling with size optimization your code gives the better result (size sometimes matters).

  40. June 8th, 2010 at 2:50 am

    Yves says:

    Louis,

    an interesting and relevant article for those involved in optimizations.

    A comment to Magice: my bread and butter is real-time image processing. I can be requested to handle, say, 4 Mpixels images 60 times per second. So my time scale is the millisecond. In such a context, optimization really matters :)

    Unfortunately for us, the behavior of modern processors becomes less and less predictable and repeatable. This slowly turns the art of programming to black magic…

  41. June 8th, 2010 at 3:06 am

    AlainT says:

    Hi,

    Interesting article but missing a warning, because on real cases (I mean professional work), the use of “restrict” will certainly lead to a nice big crash :-( when a new developper arrives and changes the parameters, without knowing that *in and *out should not be overlapping (or do you really expect him to check all the code before changing 3 lines, or knowing what the “restrict” keyword means ?)

    In fact I’ll even bet that the code writer himself will eventually break the code some months later because he would have forget about this optimization…

    In my opinion the temporary variable solution mentionned by Graycode is far better because it does not introduce unnecessary risks and is finally much more clear on what the processor is really doing.

  42. June 8th, 2010 at 3:16 am

    Jonathan Chayce Dickinson (Mr.) says:

    Read a lot of comments from people who go at great depths to say: just leave it up to the compiler. Bad idea. I don’t care if you don’t check the optimization into source control; stick around at work for another 1/2 hour and optimize a few of your methods.

    The reasoning is simple: as a programmer you should not trust a black box. You need to understand how the tooling you are using works; where the nuts and bolts are, how everything fits together all the way down to the metal.

    Doing anything less is giving the rest of us a bad name. These people are the type who don’t know when to use an enumerator and when to use an array. These are the type of people who think that the year 2000 will never happen.

  43. June 8th, 2010 at 4:18 am

    jcm says:

    Choosing algorithms and data structures wisely to avoid combinatorial explosion is the very best optimization of human time and safety. Micro-optimization is non-sense.

  44. June 8th, 2010 at 4:25 am

    Kia says:

    A nice article!

    I suspect the source of contention between those advocating and those not preferring micro-optimisation is due to the type of applications being developed. I do a lot of high performance computing and can assure you when you have to run heavy computation with long loops over multi-million element vectors and your application might be running hours on end, micro-optimisation can definitely be worth the effort. This of course means you would first thoroughly profile the application and have a good understanding of the bottlenecks before setting out to optimise.

  45. June 8th, 2010 at 4:26 am

    caf says:

    The example given in this article *isn’t* micro-optimisation. Micro-optimisation is doing things like changing q = x[a + b] * y[a + b]; into temp = a + b; q = x[temp] + y[temp];, or changing * 2 into << 1.

    Adding appropriate "restrict" qualifiers isn't any kind of overengineering, either; as you've noted, it's something that allows the compiler to optimise better, without affecting the readability or maintainability of the code. In my book, correct use of the "restrict" qualifier is just as valuable and praiseworthy as correct use of the "const" qualifier.

  46. June 8th, 2010 at 5:32 am

    Mark says:

    I suggest that you just have a poor compiler. Visual Studio 2010 has naive running on a LENGTH array in 14.1754 nanoseconds. Also, I really didn’t like how you didn’t give example inputs or other use cases, or just your general setup.

  47. June 8th, 2010 at 8:52 am

    dvader says:

    Good article, haven’t played with C/asm optimisation for a while. I have to do a lot of optimisation in my job, in C#, and most of my stuff is limited to algorithms. The problem is in maintenance. If I can’t get the monkeys to understand what I’ve done – it never gets used again. The problem is we all have to deal with marginal coders, and we can’t just blow them off – because they ARE the majority.
    But it was good to read about some real optimisation again! :)

  48. June 8th, 2010 at 9:49 am

    Matt says:

    Some pretty unfair comments in what predictably turned into a flame war (be it a fairly civil one so far). The fast majority of slow response times in all applications today are from one of 3 factors 1) Blocking on I/O inefficiently (especially networked) 2) Allocating memory inefficiently (especially string construction) 3) Context/Thread switching (inefficiently?)

    So my vote would be to teach all of the app developers in the world how to optimize those 3 things before we get into micro optimization religious wars. Also, when app developers don’t micro optimize their code, it’s usually not because they don’t take pride in their work, it’s more often because they have to balance features and quality against performance gains.

  49. June 8th, 2010 at 9:50 am

    Matt says:

    And by fast majority I meant vast majority. Limited editing abilities….

  50. June 8th, 2010 at 10:43 am

    Peter da Silva says:

    OK, the point of adding the restrict keyword is to get it to accumulate results in a temporary variable in the loop.

    Why don’t you just do that?

    void naive(float* in, float* out)
    {
    int i,j;
    for (i=0;i<LENGTH-4;i++)
    {
    float sum = 0.0f;
    for (j = 0;j<5;j++)
    sum += filter[j] * in[i+j];
    out[i] = sum;
    }
    }

    This not only lets the compiler know it doesn't need to worry about aliasing, it's more readable once the expressions (as they tend to) become more complex.

  51. June 8th, 2010 at 5:55 pm

    Jonty says:

    @jon and @jalf

    - Pipelining may not be so much an issue on lower end / embedded processors, where you would be looking exactly to make the kind of optimisations demonstrated in this article.

    Even the Atom, if I understand it correctly, will not particularly negatively affect your expected execution order/speed from it’s in-line pipeline.

  52. June 8th, 2010 at 10:03 pm

    ricardo saracino says:

    I like the explanation…

    btw i believe this was compiled with gcc and not a MS product

  53. June 9th, 2010 at 11:29 am

    Jason Cohen says:

    I’d like to see the same code with fixed-precision integers instead of floats — what’s the speed difference?

    I know FPUs are great these days, so maybe there’s no difference. Last time I was doing similar code (90’s) it was a huge no-no to do floating point because of the performance problems. I’m curious!

  54. June 11th, 2010 at 11:18 am

    shegget says:

    @AlainT – Kudos. I was reading this entire comment thread waiting for someone to mention exactly what you said. In a theoretical world of “Hey let’s see if I can make my code go faster / make the smallest PE possible / win the obfuscated C contest”, this is pretty neat. In the real world of someone else maintaining your code or, as you said, returning to your own code in several years with a different compiler/build environment and rusty knowledge, techniques that require specific programmer knowledge and specific compiler behavior have the potential to create some really painful and hard to find bugs.

  55. June 19th, 2010 at 5:29 pm

    Anonymous says:

    fdsdfsgfd

  56. June 28th, 2010 at 1:13 am

    IvenJ says:

    very helpful,useful!Sometimes ,it’s miserable that your codes are just running slowly, but u dont know why

  57. July 23rd, 2010 at 6:59 pm

    argv says:

    did you ever consider looking at gcc instead of looking at your code (whether C or assembly)?

    more specifically do you understand how flex and bison work? do you know what they are doing (in characteristic american “micro-detail”? that’s for you magice ;)

    gcc is a mess. it’s not clean code. and this makes it extremely difficult to fix things.

    and writing compilers is not easy.

    someone suggested reading a book on “good code”.
    perhaps you should be reading a book on compilers?

    programmers are not only lazy but, today, they are also ignorant. and therein lies the problem. history has been overlooked.

    knowing assembly is not enough to understand the problem. you need to understand the steps taken to get from C to assembly, and how they are implemented.

    and that’s a lot of work.

    note also: there are alternatives to gcc. but you need to investigate the “ancient” past to find them.

  58. August 19th, 2010 at 3:02 pm

    Nadav says:

    Great post. Good job.

Leave a Reply


Need a new job?