This is a war story from my real day job. I work at a start-up company in Pittsburgh that does face detection, tracking, and recognition software. This particular problem involved optimization of the face recognizer.
I'm going to try to simplify our face recognition algorithm as much as possible to get to the point I'm trying to make. If you are reading this to try to learn our algorithmic secrets, you'll be sorely disappointed. What you'll see here is in the first few chapters of a pattern recognition textbook.
Recently, we came out with a brand new version of our face recognizer. As is usually the case, once they are happy with the accuracy they hand it off to me to see if I can speed it up.
for i in range(face_list1):
for j in range(face_list2):
answer[i,j] = compare(face_list1[i], face_list2[j])
When you look at this code, notice how threadable this problem is. These comparison operations are completely independent, numerous, and sizable amounts of computation. This is a concurrent programmer's DREAM problem. So when it came time to parallelize this algorithm, I wasn't worried. This is what one might refer to as an embarrassingly parallel problem. Our speed-up should be identical to the number of CPU cores.
So I sat down, as I've done many times, whipped up a little thread pool, instantiated my thread-safe queue with a mama-process for feeding the workers. We generate a to-do list, dump them all on the queue, and have the worker threads go to town on the other end of that queue. This is a pretty standard solution. Before lunch I was ready to fire it up and I saw the most beautiful thing a concurrent programmer can ever see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32603 louis 16 0 246m 24m 4620 S 800 4.8 0:09.12 recognizeJust in case you aren't familiar with the output of top, this is showing my program using all 8 CPU cores (dual quad-core). Problem solved. I went to lunch.
# pythonic psuedo code
for face1 in face_list_a: # on the order of 1000
for face2 in face_list_b: # on the order of 1000
for feature in features:
feature_list[feature] = extract_feature(feature, face1, face2)
answer[face1][face2] = classify(classifier, feature_list)
This may require some explanation for people without any pattern recognition background. First, every face is made up of a set of features. Think of a feature as some measurement (like eye color). How are features chosen, or generated? Go take that pattern rec class. We extract all of the features from our faces and dump them into our classifier which is able to determine a likelihood of a match.
for f in feature_list:
for face1 in face_list_a: # on the order of 1000
for face2 in face_list_b: # on the order of 1000
feature_lists[face1][face2][f] = extract_feature(f, face1, face2)
for face1 in face_list_a:
for face2 in face_list_b:
answer[face1][face2] = classify(classifier, feature_lists[face1][face2])
This means we use the same feature for every input first and once we are done with it, we move on to the next one. This allows each successive comparison between faces to take advantage of the caching done in previous comparisons.
Not surprisingly, the above algorithm improved the performance of the single threaded recognizer by a factor of 6. It also removed the cache bottleneck. With the algorithm no longer cache bound, our speed-up factor equaled our core count. Once we combined all of the optimizations, we ended up with a 25x speed-up of our 8-threaded version on our octo-core machine.
Do you see which abstraction died? The compare(face1, face2) function.
The loop reordering requires a set of input faces to work properly so that a single compare() function goes back to the slower, cache-inefficient algorithm. It would be natural for a programmer to decide that you need a compare(face1, face2) function. Unfortunately, however, implementing multiple comparisons by making multiple calls to a single compare() function leads to the misuse of cache.
Writing cache-friendly algorithms isn't necessarily easy. Usually, it involves breaking some nice abstractions. This cache-friendliness problem is exacerbated, greatly, in the multi-core generation where cache performance can act as a hidden serializing force. You are doubly punished for misusing cache in this new world. Added to this is the fact that writing cache friendly, highly scalable, concurrent algorithms is a black, black art. For any embarrassingly parallel problem, like mine, the solution might be bit messy but it'll work. For more difficult problems, good luck.