<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Bugs of Doom (aka the Heisenbugs)</title>
	<atom:link href="http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/feed/" rel="self" type="application/rss+xml" />
	<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/</link>
	<description>{ on programming and the internets, every monday }</description>
	<lastBuildDate>Mon, 23 Aug 2010 09:39:45 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Chris Wash</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2975</link>
		<dc:creator>Chris Wash</dc:creator>
		<pubDate>Wed, 18 Feb 2009 18:54:09 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2975</guid>
		<description>This I wouldn&#039;t call a bug but an issue that warranted debugging no less.  It&#039;s a good story.

http://www.ibiblio.org/harris/500milemail.html</description>
		<content:encoded><![CDATA[<p>This I wouldn&#8217;t call a bug but an issue that warranted debugging no less.  It&#8217;s a good story.</p>
<p><a href="http://www.ibiblio.org/harris/500milemail.html" rel="nofollow">http://www.ibiblio.org/harris/500milemail.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2901</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Tue, 17 Feb 2009 12:11:22 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2901</guid>
		<description>Back in 1998 I was working in a big team on a complex C/C++ settlement system at a clearing bank in Luxembourg.  We used Clearcase for version control, which presented version-specific trees of files to our workstations using a modified NFS mechanism.  Builds took a long time, so somebody turned on a clever Clearcase feature that wrapped our compiler and magically popped up .o files originally compiled on other developers&#039; workstations with compatible file versions, saving us the need to compile every file ourselves to make our executables.

Clever idea in principle, except that after a couple of months, everybody&#039;s executables started crashing. It turned out, after a great deal of investigation, that environment variables on one of the developer&#039;s machines had influenced a C++ template expansion strategy, but the smart compiler wrapper had happily propagated the resulting incompatible .o files to the central Clearcase server, and therefore to all the other developers&#039; machines.

If I recall correctly, the peculiar choice of solution was to periodically clean out the server-cached .o files. Ah, good times, good times.</description>
		<content:encoded><![CDATA[<p>Back in 1998 I was working in a big team on a complex C/C++ settlement system at a clearing bank in Luxembourg.  We used Clearcase for version control, which presented version-specific trees of files to our workstations using a modified NFS mechanism.  Builds took a long time, so somebody turned on a clever Clearcase feature that wrapped our compiler and magically popped up .o files originally compiled on other developers&#8217; workstations with compatible file versions, saving us the need to compile every file ourselves to make our executables.</p>
<p>Clever idea in principle, except that after a couple of months, everybody&#8217;s executables started crashing. It turned out, after a great deal of investigation, that environment variables on one of the developer&#8217;s machines had influenced a C++ template expansion strategy, but the smart compiler wrapper had happily propagated the resulting incompatible .o files to the central Clearcase server, and therefore to all the other developers&#8217; machines.</p>
<p>If I recall correctly, the peculiar choice of solution was to periodically clean out the server-cached .o files. Ah, good times, good times.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dexter</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2897</link>
		<dc:creator>Dexter</dc:creator>
		<pubDate>Tue, 17 Feb 2009 09:28:14 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2897</guid>
		<description>It took me two weeks to figure out why the pricelist.ini (a very basic INI file) that the web application produced for the barcode reader (a WinCE handheld) was not recognized by the gadget&#039;s software.
I could make it read a manually made file, but not a genereated one. And they looked identical!
Only with the hex editor I realized that the gadget ran a little-endian ARM CPU, so it needed the UTF bytes swapped!
A single change in the output encoding did the trick...

Dexter</description>
		<content:encoded><![CDATA[<p>It took me two weeks to figure out why the pricelist.ini (a very basic INI file) that the web application produced for the barcode reader (a WinCE handheld) was not recognized by the gadget&#8217;s software.<br />
I could make it read a manually made file, but not a genereated one. And they looked identical!<br />
Only with the hex editor I realized that the gadget ran a little-endian ARM CPU, so it needed the UTF bytes swapped!<br />
A single change in the output encoding did the trick&#8230;</p>
<p>Dexter</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul Betts</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2874</link>
		<dc:creator>Paul Betts</dc:creator>
		<pubDate>Tue, 17 Feb 2009 03:35:08 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2874</guid>
		<description>The way I try to solve these tricky race conditions is through a circular in-memory log - just have it write a 64-bit timestamp, the current thread ID, and a 2-byte tag (like a Pool tag in the Windows kernel). Make sure that the index to the next item is atomic, or else your log will be corrupted; just keep InterlockedIncrement&#039;ing it and mod by LOG_SIZE. 

This way, when the program dies, you can dump out the log and see the sequence of events that led to this happening.</description>
		<content:encoded><![CDATA[<p>The way I try to solve these tricky race conditions is through a circular in-memory log &#8211; just have it write a 64-bit timestamp, the current thread ID, and a 2-byte tag (like a Pool tag in the Windows kernel). Make sure that the index to the next item is atomic, or else your log will be corrupted; just keep InterlockedIncrement&#8217;ing it and mod by LOG_SIZE. </p>
<p>This way, when the program dies, you can dump out the log and see the sequence of events that led to this happening.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dar7yl</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2873</link>
		<dc:creator>dar7yl</dc:creator>
		<pubDate>Tue, 17 Feb 2009 03:02:15 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2873</guid>
		<description>I had an exasperating experience trying to track down a heisenbug in a medical display device I was working on. (l33t project!)

The device would randomly lock up, frazzle the display, reboot, or randomly perform uncommanded functions.

There was no way to reproduce the problem.  When traces were enabled, it would go away.  If you watched it for hours, nothing happened, but when you turned your back, BAM!

I realized it was a heisenbug, so I bring out my heisentools.
First, check for uninitialized variables, the cause of 95% of HB&#039;s (IMHO).  Well, I found lots of those (oops), but didn&#039;t fix the problem.

Ok, so next look for race conditions.  The process model was fairly simple, and easy to instrument.  There were a few weird interactions, but no deadlock potential.

Finally, I had a bit of luck, and managed to get a snapshot from just before the crash.  There I noticed that some data wasn&#039;t in the expected location.  That data was supposed to be written from the device driver, under interrupt control.

Once spotted, it was easy to track down where I had neglected to set the data direction bit during a block copy, inside that interrupt routine.  Turns out that the interrupt inherits the DD from the interrupted process.  Most of the time, it is set correctly, but rarely, it is reversed.  Then that errant data goes on to scramble control structures and bring the machine down, usually much later.

The hunt took 3 days, which really messed up the schedule.  On the plus side, the system was vastly improved afterword.  There&#039;s nothing like a weird bug to make you go through your code with a fine-tooth comb.</description>
		<content:encoded><![CDATA[<p>I had an exasperating experience trying to track down a heisenbug in a medical display device I was working on. (l33t project!)</p>
<p>The device would randomly lock up, frazzle the display, reboot, or randomly perform uncommanded functions.</p>
<p>There was no way to reproduce the problem.  When traces were enabled, it would go away.  If you watched it for hours, nothing happened, but when you turned your back, BAM!</p>
<p>I realized it was a heisenbug, so I bring out my heisentools.<br />
First, check for uninitialized variables, the cause of 95% of HB&#8217;s (IMHO).  Well, I found lots of those (oops), but didn&#8217;t fix the problem.</p>
<p>Ok, so next look for race conditions.  The process model was fairly simple, and easy to instrument.  There were a few weird interactions, but no deadlock potential.</p>
<p>Finally, I had a bit of luck, and managed to get a snapshot from just before the crash.  There I noticed that some data wasn&#8217;t in the expected location.  That data was supposed to be written from the device driver, under interrupt control.</p>
<p>Once spotted, it was easy to track down where I had neglected to set the data direction bit during a block copy, inside that interrupt routine.  Turns out that the interrupt inherits the DD from the interrupted process.  Most of the time, it is set correctly, but rarely, it is reversed.  Then that errant data goes on to scramble control structures and bring the machine down, usually much later.</p>
<p>The hunt took 3 days, which really messed up the schedule.  On the plus side, the system was vastly improved afterword.  There&#8217;s nothing like a weird bug to make you go through your code with a fine-tooth comb.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matthew Crumley</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2872</link>
		<dc:creator>Matthew Crumley</dc:creator>
		<pubDate>Tue, 17 Feb 2009 02:53:42 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2872</guid>
		<description>When I first started programming on Linux, I was playing around with a program (I don&#039;t even remember what it did). I named the executable &quot;test&quot;. When I ran it, I was expecting some output, but didn&#039;t get any, so I assumed it was crashing. When I stepped though it in the debugger, everything was fine, and I got the expected answer. After cutting out more and more code, until I *knew* it couldn&#039;t be crashing, I found out that there&#039;s an existing program called test. I didn&#039;t know at the time to use &quot;./test&quot;, so I was running the existing command. When I ran it with gdb of course, it found the local file and worked.

I learned a couple good lessons about Unix/Linux programming from that...</description>
		<content:encoded><![CDATA[<p>When I first started programming on Linux, I was playing around with a program (I don&#8217;t even remember what it did). I named the executable &#8220;test&#8221;. When I ran it, I was expecting some output, but didn&#8217;t get any, so I assumed it was crashing. When I stepped though it in the debugger, everything was fine, and I got the expected answer. After cutting out more and more code, until I *knew* it couldn&#8217;t be crashing, I found out that there&#8217;s an existing program called test. I didn&#8217;t know at the time to use &#8220;./test&#8221;, so I was running the existing command. When I ran it with gdb of course, it found the local file and worked.</p>
<p>I learned a couple good lessons about Unix/Linux programming from that&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill King</title>
		<link>http://lbrandy.com/blog/2009/02/bugs-of-doom-aka-the-heisenbugs/comment-page-1/#comment-2871</link>
		<dc:creator>Bill King</dc:creator>
		<pubDate>Tue, 17 Feb 2009 02:31:29 +0000</pubDate>
		<guid isPermaLink="false">http://lbrandy.com/blog/?p=540#comment-2871</guid>
		<description>I wrote at that time a simple program to listen at the top of a directory tree for change notifications, and scan down/copy over to mirror version. :/ Memory fragmentation fail (after 2 days). Windows (or was it delphi?) doesn&#039;t like cleaning up the blocks it allocated, so over time, it&#039;d just grow, and grow, and grow. Plugged in gc, and bug &quot;solved&quot;. It was supposed to be a simple one-day to write application *rolls eyes*.</description>
		<content:encoded><![CDATA[<p>I wrote at that time a simple program to listen at the top of a directory tree for change notifications, and scan down/copy over to mirror version. :/ Memory fragmentation fail (after 2 days). Windows (or was it delphi?) doesn&#8217;t like cleaning up the blocks it allocated, so over time, it&#8217;d just grow, and grow, and grow. Plugged in gc, and bug &#8220;solved&#8221;. It was supposed to be a simple one-day to write application *rolls eyes*.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
