{ on programming and the internets }


by Louis Brandy

Bugs of Doom (aka the Heisenbugs)

I think anyone who has programmed for any length of time has some stories to tell about the most ridiculous and convoluted bugs that they’ve ever encountered. Here’s my little hall of fame. Pretty much all of the most convoluted bugs have something in common: they are incredibly difficult to reproduce. In low level languages, this typically involves one of three issues: memory management, concurrency, or rarely a bug in something you should be trusting (your compiler, your OS, etc.). In higher level languages, the most convoluted bugs tend to happen when the auto-magic of your language hits some corner case and goes haywire.

The absolute worst of the worst are the heisenbugs. That is, the bugs that only happen when you aren’t looking for them. The harder you try to find them, the less reproducible they become. The very act of trying to debug them, makes them disappear.

Concurrency troubles

What happened: We created a live demonstration of our face detection and face tracking. We decided to stress test it and leave it running for awhile. It would dead lock after running for about 2 days.

How we attacked it: It was relatively simple to determine that a thread was crashing. The problem was altering the code in any meaningful way ended up making the bug irreproducible. You can imagine the pain in testing this, however, because it takes a good 4 to 5 days to decide the bug isn’t going to happen, and even then you aren’t sure.

What caused it: It ended up, no surprise, being a subtle race condition. We set up a thread-safe queue to have the mother process dump all the work to be done onto, and then had the worker threads reading off that queue. In our implementation, the queue could be connected to by worker threads, it would give them things to do, and be disconnected from once the worker thread was done. It would also clean itself up once all the connections died and there was nothing left on the queue.

Well, that last little bit is the root of the problem: the program crashed in the pathological condition when one of the threads was blocked in such a way that the other threads were all able to connect to the queue, process all of the data, and then disconnect. The queue would clean itself up, then, and the final thread would awake and try to connect to the now vanished queue. Crash.

The fallout: It probably took us an off-and-on three weeks to figure this one out. Most of that time was spent reproducing the original crash before we started looking seriously. Threading is no joke. Even simple implementations are wrought with potential problems. From that point on, we made some (obvious) rules regarding our threading constructions. First, we have a “prove this is too slow, first” approach to extra complexity (is there an off the shelf solution? Does an extremely simple solution work, first?). Next, if you really need a customized synchronization method for performance reasons, you make it generic (so it can be reused) and you test it with extreme predjudice. We’ve built up a nice library of extremely well-tested thread-safe data structures for our use cases. We rarely have to write new ones, these days.

Memory, memory, memory

What happened: Again we decide to stress test an application and it seg-faulted after a few days.

How we attacked it: We turned off threading, and it ran for 2 weeks, no problems. We removed the assembly code (replacing it with our non-assembly versions) and it ran for 2 weeks, no problems. Sigh. Okay, now what? Was the assembly code bugged and trouncing out of bounds overwriting important information? Another race condition? Or some memory issues that only showed up in certain circumstances? It was the last one.

What caused it: I did. In order to maximize speed for SIMD assembly, you want your data structures to be aligned to cache-line boundaries. This typically means you want all of your primary data structures to start on addresses mod 16 = 0. In order to do this, I had created an aligned allocator. The assembly code was using the aligned allocator, while the non-assembly code defaulted to the normal allocator. As it turns out, you really shouldn’t use valloc() to get this kind of alignment. That was just boneheaded.

valloc() gives not just 16-byte alignment but (on our machines) 4096 byte alignment. That’s like 4080 bytes better! It also has the horrible side effect of potentially fragmenting memory into pieces. This is exactly what was happening in our case. Switching to a more sane memalign() caused the problem to go away (or, at least, cause the half-life of the problem to be pushed beyond our acceptance horizon).

The fallout: This bug actually took close to two months to finally track down. The reason it took so long was because of the other experiments we had to run for a week each to convince ourselves it wasn’t going to crash in the same way. And then the experiments seemed to point in the wrong direction (an assembly or threading bug). Anyway, we decided to add some real memory recovery code to our application. More importantly, we learned the danger of memory fragmentation first hand.

Memory, again, maybe

What happened: I was improving one of our internal programs that operated on huge amounts of data in batch. You couldn’t fit all of the uncompressed data in memory at once, so you had to uncompress a chunk, do your work, free the uncompressed chunk, and start on the next chunk. Very occasionally, the free() call appeared to fail. It appeared that that chunk just stayed in memory, and it would go to expand the next chunk, causing the system to start swapping and crawling to a halt. It appeared to be some kind of memory leak.

How we attacked it: This was the ultimate heisenbug. Every single time I tried to probe the executable, the bizarre behavior would vanish. Run it through valgrind, it vanished. Change the compile flags, it vanished. Use a malloc() debug replacement, it vanished. Write a custom version of malloc() to track simple issues, it vanished. Put a print statement in the offending code, it vanished. Attach gdb, of course, it vanished.

What caused it: Well, uhm, that’s a funny story. After about two weeks of trying everything under the sun, it stopped happening. I don’t know what I changed. I tried reverting my changes bit by bit. I tried a fresh check out. I did everything I could to reproduce this bug, and I couldn’t. And I can’t fix a bug I can’t reproduce.

The fallout: Maybe it was a bad object file getting into the build by accident. Maybe it was a pathological linux kernel issue. Maybe it was a pathological malloc issue. Maybe I am crazy. Maybe. Maybe there’s a landmine sitting in that program waiting to explode at some later date.

How about you? Got any stories?

trackback

11 Responses to “Bugs of Doom (aka the Heisenbugs)”

  1. February 16th, 2009 at 12:52 pm

    Kevin Gadd says:

    At a previous employer I had the misfortune of discovering a bug in the custom in-house build system. Incremental rebuilds of the executable would occasionally miss changes to dependencies. The result was that our app would get linked together but constants would have different values in different modules (among other things). I suspect this also meant that things like inline functions could get mismatched too, but I never saw proof of that.

    It took months of intermittent, inexplicable crashes in local debug builds that always went away if I rebuilt my executable from scratch. The kind of failures this produced were so bizarre that we ended up replacing my workstation’s RAM and hard disk before I finally was lucky enough to get a crash that was debuggable. In the end I hit ’step into’ in a debugger and watched the value of a constant change from 5 to 6 in the watch window.

    Moral of the story: Don’t write your own build scripts, use scons or something instead. :D

  2. February 16th, 2009 at 3:17 pm

    James says:

    wait ..what? your face detection program takes two days to detect a face?

  3. February 16th, 2009 at 3:53 pm

    louis says:

    @Kevin,

    I’ve gotten in the habit of wiping clean and doing full re-builds when bizarre stuff starts happening. It’s rare but stray object files can cause so much badness.

    @James

    That wouldn’t be terribly useful, would it?

    We hooked it up to a video feed. We used either a camera or an antenna for local television.

  4. February 16th, 2009 at 7:35 pm

    David says:

    The worst heisenbug I saw was caused by a Makefile-based multi-step build that needed to be run twice. The 2nd, 3rd, 4th, etc. builds would be correct… but the first build created a program with internal binary inconsistencies caused by inconsistent defs being passed to the compiler. The corrupted version could run for days without crashing, or a couple of hours under intense simulated usage. Needless to say, it took a long time for the programmers to figure out why the automated production builds were always unstable when it was so hard to reproduce the bug in development. Every time you thought you had figured out how to reproduce it, you would try again and the bug would disappear….

  5. February 16th, 2009 at 9:31 pm

    Bill King says:

    I wrote at that time a simple program to listen at the top of a directory tree for change notifications, and scan down/copy over to mirror version. :/ Memory fragmentation fail (after 2 days). Windows (or was it delphi?) doesn’t like cleaning up the blocks it allocated, so over time, it’d just grow, and grow, and grow. Plugged in gc, and bug “solved”. It was supposed to be a simple one-day to write application *rolls eyes*.

  6. February 16th, 2009 at 9:53 pm

    Matthew Crumley says:

    When I first started programming on Linux, I was playing around with a program (I don’t even remember what it did). I named the executable “test”. When I ran it, I was expecting some output, but didn’t get any, so I assumed it was crashing. When I stepped though it in the debugger, everything was fine, and I got the expected answer. After cutting out more and more code, until I *knew* it couldn’t be crashing, I found out that there’s an existing program called test. I didn’t know at the time to use “./test”, so I was running the existing command. When I ran it with gdb of course, it found the local file and worked.

    I learned a couple good lessons about Unix/Linux programming from that…

  7. February 16th, 2009 at 10:02 pm

    dar7yl says:

    I had an exasperating experience trying to track down a heisenbug in a medical display device I was working on. (l33t project!)

    The device would randomly lock up, frazzle the display, reboot, or randomly perform uncommanded functions.

    There was no way to reproduce the problem. When traces were enabled, it would go away. If you watched it for hours, nothing happened, but when you turned your back, BAM!

    I realized it was a heisenbug, so I bring out my heisentools.
    First, check for uninitialized variables, the cause of 95% of HB’s (IMHO). Well, I found lots of those (oops), but didn’t fix the problem.

    Ok, so next look for race conditions. The process model was fairly simple, and easy to instrument. There were a few weird interactions, but no deadlock potential.

    Finally, I had a bit of luck, and managed to get a snapshot from just before the crash. There I noticed that some data wasn’t in the expected location. That data was supposed to be written from the device driver, under interrupt control.

    Once spotted, it was easy to track down where I had neglected to set the data direction bit during a block copy, inside that interrupt routine. Turns out that the interrupt inherits the DD from the interrupted process. Most of the time, it is set correctly, but rarely, it is reversed. Then that errant data goes on to scramble control structures and bring the machine down, usually much later.

    The hunt took 3 days, which really messed up the schedule. On the plus side, the system was vastly improved afterword. There’s nothing like a weird bug to make you go through your code with a fine-tooth comb.

  8. February 16th, 2009 at 10:35 pm

    Paul Betts says:

    The way I try to solve these tricky race conditions is through a circular in-memory log – just have it write a 64-bit timestamp, the current thread ID, and a 2-byte tag (like a Pool tag in the Windows kernel). Make sure that the index to the next item is atomic, or else your log will be corrupted; just keep InterlockedIncrement’ing it and mod by LOG_SIZE.

    This way, when the program dies, you can dump out the log and see the sequence of events that led to this happening.

  9. February 17th, 2009 at 4:28 am

    Dexter says:

    It took me two weeks to figure out why the pricelist.ini (a very basic INI file) that the web application produced for the barcode reader (a WinCE handheld) was not recognized by the gadget’s software.
    I could make it read a manually made file, but not a genereated one. And they looked identical!
    Only with the hex editor I realized that the gadget ran a little-endian ARM CPU, so it needed the UTF bytes swapped!
    A single change in the output encoding did the trick…

    Dexter

  10. February 17th, 2009 at 7:11 am

    Steve says:

    Back in 1998 I was working in a big team on a complex C/C++ settlement system at a clearing bank in Luxembourg. We used Clearcase for version control, which presented version-specific trees of files to our workstations using a modified NFS mechanism. Builds took a long time, so somebody turned on a clever Clearcase feature that wrapped our compiler and magically popped up .o files originally compiled on other developers’ workstations with compatible file versions, saving us the need to compile every file ourselves to make our executables.

    Clever idea in principle, except that after a couple of months, everybody’s executables started crashing. It turned out, after a great deal of investigation, that environment variables on one of the developer’s machines had influenced a C++ template expansion strategy, but the smart compiler wrapper had happily propagated the resulting incompatible .o files to the central Clearcase server, and therefore to all the other developers’ machines.

    If I recall correctly, the peculiar choice of solution was to periodically clean out the server-cached .o files. Ah, good times, good times.

  11. February 18th, 2009 at 1:54 pm

    Chris Wash says:

    This I wouldn’t call a bug but an issue that warranted debugging no less. It’s a good story.

    http://www.ibiblio.org/harris/500milemail.html

Leave a Reply


Need a new job?