Parallel programming is hard. Right?
TweetIt’s a fairly oft-repeated mantra of programmers. Parallel programming is hard. Right? Right! Almost everyone will tell you so! I was at a job fair not too long ago and even one of the students told me so. Everyone knows parallel programming is hard.
Sadly, though, I disagree with the vast majority on why. The most common answer, by far, will involve some combination of the following words: deadlock, livelock, synchronization, shared memory, race condition, thread safety and so on. You can find umpteen blog posts explaining how difficult it is to avoid race conditions, and how some of the most pernicious bugs come from synchronization issues. Yes, indeed.
(note: this post is mostly about data-parallel programming, or other parallel optimizations for computational performance purposes. There are other reasons to use concurrency, but I’m not sure my argument will hold in most of those cases.)
Avoiding race conditions is hard, but it’s not that hard.
There, I said it. I think most people, especially those who’ve only learned about parallel programming in a classroom setting, believe that race conditions, synchronization, and dead-locks are the crux of the parallel programming problem. There are a whole bunch of really solid reasons for me to say they are wrong:
- You can almost always use a simple locked data-structure for synchronization fairly painlessly (and with a bit of practice, you can usually write one fairly painlessly, too).
- You can almost always abstract away the synchronization primitives and unit-test them really well.
- There are tons of good off-the-shelf and well-tested synchronization libraries for most platforms, and most languages.
- The truly difficult synchronization problems have many, many smart people working on them, and these are things you should not be reinventing.
- And last but not least, and most importantly, no matter how difficult your synchronization issues are, I can assure you that bigger problems are lurking.
Why parallel programming is actually hard
There is probably no one correct answer to this question, but most will probably involve words like: granularity, throughput, response-time, cache size, memory bottleneck, and so on.
Let’s start with the core of the problem: granularity. When parallelizing some computation, you must decide “at what level” to parallelize the problem. Choosing the correct granularity is almost never immediately obvious and, almost always, there is more than one correct answer. Trade-offs abound.
I’ll invoke a problem from my daily work that will help explain this nicely. A video file is nothing but a sequence of images. If you want to perform face detection on a video stream, you can parallelize in several ways. A coarse grain approach might be to give each worker its own image, and collect the results at the end. A finer grained approach might throw all of the threads at a single image hoping you can process that image n-times faster.
Here are some general (yet frequently broken) rules of thumb:
1. Finer grained algorithms will thread less efficiently than coarse grained. Usually, it’s because any natural parallelism in your problem tends to improve as you become more coarse grained (do 20 images in each thread, instead of one). Furthermore, coarse grained algorithms do less communication and synchronization per unit of work. This means coarse grain approaches will tend to have better throughput (work done per unit of time) and less administrative overhead. There is an important caveat here, if you are too coarse grained, or if your jobs vary in size too much, you run smack into scheduling problems that will certainly leave some workers with nothing to do.
2. Finer grained algorithms will have better response times. In this case, response time means from submission of a work unit, until the result is ready. Since a fine grained algorithm is more frequently synchronizing the state, you have more opportunities to prioritize things that need to get done, and then declare them done.
3. Finer grained algorithms will tend to use less cache/memory. This is because, in general, less data being worked on.
4. Finer grained problems tend to be more painful to maintain. You will assume parallelism along some dimension that really needs to be serialized for some new feature.
Let’s go back to my real-world example. Sometimes throughput is king, like when you are processing a movie file. In this scenario, sending a frame to each worker thread and collecting the results later is fairly easy to implement, and threads extremely well. Other times, response time is king. For example, when tracking a face with a pan-tilt-zoom camera. Sending each frame to a different worker is completely inappropriate for this use-case because this doesn’t improve response time at all (each thread only has one worker, still). As soon as a frame is submitted, you need an answer as quickly as possible to have stable control. In this case, it is far better to throw all of the worker threads at a single frame (thread at the sub-frame level), and hope to improve the computation of a single frame.
So what do you do when the frame-level approach threads more efficiently than the sub-frame level treading? Well, you’ve run into a situation where one approach is better for throughput, and the other for response time. If you want to support both use-cases, you need both approaches (and some good documentation!). I think this demonstrates my original point that bigger problems than synchronization are lurking. No matter how difficult your synchronization problems are, I just doubled them.
Trade-offs in every direction
So lets say you can sort your way through that nest of trade-offs to find a place that gives you the right balance of response-time and throughput, stays within your memory requirements, and threads at an acceptable efficiency, you are home free! Right? Wrong. Now add in the fact that you cannot predict the future, and you don’t know what features will be needed (and what will then need to be serialized), how your response-time vs throughput needs might change, or how you will ever maintain this mess.
If you’ve survived this far, just remember the computer has one last trump card to play on you. All of these threads run on all real machines, with real memory bandwidths, real caches, connected over networks with fixed bandwidth. Each of these constraints has the ability to trip up your beautifully parallel problem and make it painfully serial. Let’s not even mention the idea that it might need to run on different machines with different capabilities.
In truth, all of these difficulties come from the same source. Mapping a non-trivial problem onto a parallel machine optimally borders on the impossible when you fully consider all of the different requirements along all of the different dimensions.
And so here we are. When I think of parallel programming, synchronization is difficult, but managable. It is not, however, what makes parallel programming hard.
Tweet
twitter
February 24th, 2010 at 10:42 am
Hi Louis,
You have some great stuff to say, but I respectfully disagree with your conclusions. Granularity is indeed very important on concurrent and parallel computing, but I think you’ve hand-waved a bit too much with the rest saying it’s not that hard.
The reality is that languages abstract enough away from you to prevent a full understanding of the concurrent usage of resources in a system, such that you end up thinking everything is far easier than it actually is.
I submit to you and your readers my own thoughts on the subject when I compared Java to Google’s Go and why concurrency is a hard problem to crack, even with the right tools and abstractions:
http://www.lessonsoffailure.com/software/google-go-not-getting-us-anywhere/
http://www.lessonsoffailure.com/software/googles-go-not-getting-us-anywhere-part-2/
February 24th, 2010 at 11:16 am
To start with, Java and Google Go are not “the right tools” for concurrency.
You must have programming models that enable you to express parallel code at the same level as you express sequential code. You must have communication primitives built into the language that provide safe communication and synchronization between concurrent threads.
Look at process calculi such as CSP, CCS and pi-calculus and languages such as occam and more recently XC (from XMOS). With the right languages built on the right models, parallel processing isn’t that hard. With the languages targeted towards the right architectures (that are designed for concurrency and therefore provide deterministic performance on common concurrent patterns) it’s even easier.
You still need to know your application, end user, and data-set well. But that has always been the case – sequential or parallel.
Per Brinch Hansen is another brilliant reference – check out his work on SuperPascal etc.
February 24th, 2010 at 11:18 am
Dave, I think what he’s saying is not that these problems are trivial, but that they’ve been (by and large) solved already. The difficulty of parallel programming arises in problems that can not possibly be answered ahead of time. Depending on your particular project, you have to make extremely difficult decisions that significantly impact your program’s performance.
February 24th, 2010 at 11:37 am
They may have been solved in other (far more obscure) languages, but not by any mainstream languages, which are the bulk of how people are trying to work these problems out in real applications. That’s my point. It’s not just an abstraction issue, it’s a tool and understanding issue as well.
February 24th, 2010 at 11:44 am
@Jonathan: “To start with, Java and Google Go are not “the right tools” for concurrency.”
If you read my posts, you’d see that I agree with you…
February 24th, 2010 at 1:26 pm
I’m not sure if this is intentional or not, but there’s one area that’s neglected here: situations where threads might not all be doing the same thing. Rather than having each thread do the entire amount of work for a single unit of work, it’s possible to have each thread do a different step in that unit of work.
This is an alternative way of doing things that (in my experience) tends to be more efficient for very IO-bound operations. Also (again in my experience), granularity tends to not matter much in these models.
But then again, the problems that I’m solving sound to be very different from the ones you’re taking on.
February 24th, 2010 at 2:27 pm
The hardness of concurrent programming has been demonstrated in “Threads Are Evil” at http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html
February 24th, 2010 at 2:51 pm
Please take a look at
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
The “13 dwarves” of parallel computing. It pretty much sums it up.
February 24th, 2010 at 8:30 pm
I think the real problem isn’t if concurrency is hard or not, it’s more that you have to start from scratch with a language or code base that has the right properties to properly be concurrent. Most code in popular languages and libraries is not even thread safe, let along having more advanced properties such as those that functional or concurrency oriented languages guarantee from the get go.
February 25th, 2010 at 4:30 am
I kind of agree with you:)
http://www.components4developers.com
February 26th, 2010 at 2:44 pm
Stick with C and pthreads and semaphores. Debugging is the hardest part for me with multi-threadded programming. Need to find a good debugger that can dig into multiple running threads.
August 9th, 2010 at 2:54 pm
It’s not that hard? When performance isn’t matter at all, it’s not hard at all. But I somehow agree that unpredictable situation or unknown runtime information are some important causes to parallel programming is hard.
Domain knowledge would help speculating more accurately, but it’s still limited in many cases.
December 12th, 2011 at 8:54 am
Parallel computer programs are more difficult to write than sequential ones because concurrency introduces several new classes of potential software bugs, of which race conditions are the most common. APIs and tools in .Net 4.0 can make parallel programming much simpler, improving application performance.