How We Made Our Face Recognizer 25x Faster
This is a war story from my real day job. I work at a start-up company in Pittsburgh that does face detection, tracking, and recognition software. This particular problem involved optimization of the face recognizer.
I’m going to try to simplify our face recognition algorithm as much as possible to get to the point I’m trying to make. If you are reading this to try to learn our algorithmic secrets, you’ll be sorely disappointed. What you’ll see here is in the first few chapters of a pattern recognition textbook.
Recently, we came out with a brand new version of our face recognizer. As is usually the case, once they are happy with the accuracy they hand it off to me to see if I can speed it up.
The Problem
Comparing faces tends to grow quadratically with the number of input faces. When you imagine comparing 1000 faces to another 1000, that equals a million comparisons. Here’s the pseudo-code of that operation:
for i in range(face_list1):
for j in range(face_list2):
answer[i,j] = compare(face_list1[i], face_list2[j])
When you look at this code, notice how threadable this problem is. These comparison operations are completely independent, numerous, and sizable amounts of computation. This is a concurrent programmer’s DREAM problem. So when it came time to parallelize this algorithm, I wasn’t worried. This is what one might refer to as an embarrassingly parallel problem. Our speed-up should be identical to the number of CPU cores.
So I sat down, as I’ve done many times, whipped up a little thread pool, instantiated my thread-safe queue with a mama-process for feeding the workers. We generate a to-do list, dump them all on the queue, and have the worker threads go to town on the other end of that queue. This is a pretty standard solution. Before lunch I was ready to fire it up and I saw the most beautiful thing a concurrent programmer can ever see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32603 louis 16 0 246m 24m 4620 S 800 4.8 0:09.12 recognize
Just in case you aren’t familiar with the output of top, this is showing my program using all 8 CPU cores (dual quad-core). Problem solved. I went to lunch.
Not So Fast
A funny thing happened over lunch. My test finished running and the results weren’t what I expected. Having an embarrassingly parallel problem and a CPU pegged at 800%, you’d expect an 8x speed-up. That’s not at all what happened. We ended up with about 3x improvement. Since our CPU is pegged at 800%, it’s not a synchronization problem. Having been down this road a time or two, I was fairly certain of the culprit but took a moment to prove it. It wasn’t due to file I/O or anything like that (since none of that stuff was present). With a little use of oprofile (which deserves its own post, one day), I could tell the time was spent in user space, and not the kernel, which removes things like page zeroing or page faulting. Cache misses, on the other hand, are “assigned” to the user process.
Our code was cache bound.
The Inner Loop
I’m going to strip out all of the layers and show you the entire algorithm, simplified. Take my word for it, however, that in our ordinary code base, these loops were separated by numerous layers of abstraction, function calls, and a gigantic mass of book-keeping. It should be fairly trivial to understand what the algorithm is doing. How and why it works, though, are topics for a pattern recognition class.
for face1 in face_list_a: # on the order of 1000
for face2 in face_list_b: # on the order of 1000
for feature in features:
feature_list[feature] = extract_feature(feature, face1, face2)
answer[face1][face2] = classify(classifier, feature_list)
This may require some explanation for people without any pattern recognition background. First, every face is made up of a set of features. Think of a feature as some measurement (like eye color). How are features chosen, or generated? Go take that pattern rec class. We extract all of the features from our faces and dump them into our classifier which is able to determine a likelihood of a match.
The Solution – Becoming cache-friendly
The reason this is thrashing cache was pretty straight forward for me. The inner loop is iterating over features. Since each feature computation is relatively expensive, “iterating” over features is awful for cache performance. The solution is a very simple reordering of the loops:
for feature in feature_list:
for face1 in face_list_a: # on the order of 1000
for face2 in face_list_b: # on the order of 1000
feature_lists[face1][face2][feature] = extract_feature(feature, face1, face2)
for face1 in face_list_a:
for face2 in face_list_b:
answer[face1][face2] = classify(classifier, feature_lists[face1][face2])
This means we use the same feature for every input first and once we are done with it, we move on to the next one. This allows each successive comparison between faces to take advantage of the caching done in previous comparisons.
Not surprisingly, the above algorithm improved the performance of the single threaded recognizer by a factor of 6. It also removed the cache bottleneck. With the algorithm no longer cache bound, our speed-up factor equaled our core count. Once we combined all of the optimizations, we ended up with a 25x speed-up of our 8-threaded version on our octo-core machine.
Dead Abstractions
When I put the three loops next to each other, it appears like a textbook case of loop reordering. Sometimes, however, reordering loops isn’t as easy as you’d like to believe. In this case, these loops were scattered over multiple levels of abstractions.
Do you see which abstraction died? The compare(face1, face2) function.
The loop reordering requires a set of input faces to work properly so that a single compare() function goes back to the slower, cache-inefficient algorithm. It would be natural for a programmer to decide that you need a compare(face1, face2) function. Unfortunately, however, implementing multiple comparisons by making multiple calls to a single compare() function leads to the misuse of cache.
Writing cache-friendly algorithms isn’t necessarily easy. Usually, it involves breaking some nice abstractions. This cache-friendliness problem is exacerbated, greatly, in the multi-core generation where cache performance can act as a hidden serializing force. You are doubly punished for misusing cache in this new world. Added to this is the fact that writing cache friendly, highly scalable, concurrent algorithms is a black, black art. For any embarassingly parallel problem, like mine, the solution might be bit messy but it’ll work. For more difficult problems, good luck.
October 27th, 2008 at 10:33 am
You’re using Python for the pseudocode; I presume that the original program is written in C? It’d be interesting to see if you can do measurements like that for a high-level language like Python.
October 27th, 2008 at 11:20 am
Don’t you feel bad about what your doing? Can you see how facial recognition has huge potential for misuse?
Whenever I read a story about a new invasive technology or killer drones, I always wonder about what goes through the minds of those who created it.
I doubt it’s “I Am Become Death, Destroyer of Worlds”.
Is it “I Am Become Rich”?
October 27th, 2008 at 11:26 am
Luke: Don’t worry, facial recognition works much better at extracting money from clueless bureaucracies than at recognizing faces.
October 27th, 2008 at 11:27 am
@Luke: get a life
October 27th, 2008 at 11:29 am
Yeah, I had the same thought as Luke. I feel fortunate that the most unethical thing I’ve done as a programmer was to help develop systems that could be used to send junk mail. I’ve had offers, though, to work on software in the field of \law enforcement\, which seems very much like a double-edged sword…
October 27th, 2008 at 11:30 am
I would just like to point out here that the ethical abuses being raised about facial recognition cuts both ways for all kinds of technology. Facial recognition could as easily be used for refugee management as it can for “big brother activities”.
In fact, if you back up a step and consider this rationally, you’ll see that ALL technology has the potential for this abuse, right down to the most basic discovery/invention: fire. Fire can be used to cook food and make accessible nutrition which would otherwise not be safe to consume (e.g. potatoes). Fire can also be used to burn down your rivals home/village/city as has been demonstrated everywhere from Africa to Germany.
I was following this thread to see how this conversation developed because I also was interested in Python vs. C for implementation and I wanted to see the response(s) to the ethics argument. But in the process, I see that the author has chosen to remove ethical objections so far, for whatever reason. I think that’s a mistake. I may only be a temporary visitor from reddit, but you could do us all a service by welcoming real discussion rather than dismissing it.
Thanks!
October 27th, 2008 at 11:35 am
Hi,
I work at a company that sells imaging API and hardware acceleration cards. I have read through a few of your posts and find them very beneficial to my own code. Just to let you know to keep up the good work and selfless sharing of your knowledge. I will be checking back often
October 27th, 2008 at 11:43 am
The ethics concern me as well. Any programmer worth his salt knows the potential for abuse of these kinds of systems is fantastic, and once it exists, it will be used. The problem isn’t just that it -can- happen here, the problem is that if it does, it’s going to be much, much more worse due to this kind of development.
Not that you stopping would stop it. Not that it is stoppable. Maybe we should be thinking about what the technological answer is to some of these questions. Police having to wear cameras would be a start, and the equivalent of that sort of transparency should extend in all directions.
October 27th, 2008 at 11:46 am
Thanks, nice story!
October 27th, 2008 at 12:10 pm
I used to work on a defense contract that did multiple kinds of biometric comparison and matching. To all those who think it is unethical: you are all hypocrites. Every one of you. Almost every common technology we now enjoy has been highly furthered by military funding. Genetically enhanced crops, jet propulsion, the *Internet* – all greatly enhanced from military spending on research and development. So if you are on the Internet or have ever flown in a modern airplane? Stop talking about the sky falling. You are running on hypothetical fumes, and they are annoyingly close-minded.
October 27th, 2008 at 12:12 pm
@Orestis Markou
Yes, the real code is all written in low-level languages. I chose Python-esque pseudo-code because I’m a big fan of Python. I’m not sure which ‘measurements’ you are referring to. Are you referring to counting cache misses and things like that? I’m not terribly familiar with using oprofile in languages like python (which have considerably more complicated run-times). I’m sure using oprofile with python is doable but I have never done it.
@everyone else,
Yes, there are grave ethical concerns working on face recognition. We talk about them frequently. I just added that to my list of topics to write about. The short version of my opinion is that those of you worried about face recognition are worried for very good reasons but the CSI/alias-effect in facial recognition is strong and you don’t have as much to worry about as you think. I’ll cover my opinion in detail in some future post because it’s a good topic.
October 27th, 2008 at 12:57 pm
This is one step closer to helping SkyNet achieve sentience.
October 27th, 2008 at 1:01 pm
Nice article. Are you using assembly code for the inner loops? The SSE code generated by gcc isn’t that great, so that might be another avenue for optimization.
October 27th, 2008 at 1:41 pm
Suppression of knowledge rarely helps to better humanity because even when you’ve eliminated a particular means of performing some technical feat, you still havenot eliminated the incentive for it. Suppression only serves to hinder, not prevent – not to mention it would be most effective against those who are law-abiding or complacent anyway.
Take for example, genetic research and the question of human cloning. We may beg our politicians to prohibit it, but it will go on nonetheless under less restrictive governments! Researchers who find no avenue to perform their work in this country, will likely go elsewhere where they are able.
Let us instead foster an environment with ethical, productive outlets for research so there is an incentive to do good when there would otherwise only remain an opportunity to do the opposite.
October 27th, 2008 at 2:39 pm
Greetings.
Would you say that this is only marginally different than most database issues, where reordering the way data sets are joined (large set picking small sets vs. small sets picking large sets) can drastically increase the performance of the slow query?
Just trying to find a closer analogue I can relate to. Thank you for the excellent lunchtime reading!
October 27th, 2008 at 2:52 pm
Nice post. Your presentation of a fairly complex subject comes across straightforward and easy to understand.
Re: ethics in facial recognition – when’s this stuff going to cheap and available so I can organize all my photos? Maybe even scarier/convenient, when’s Facebook going to buy it and start auto-tagging photos.
October 27th, 2008 at 3:24 pm
Wouldn’t a 1000 faces compared be more like 499,499 as you only need to compare each face to each other face once and not itself either; think of a 1000×1000 triangular grid. It looks like in your pseudo code that you are comparing every item to every other item twice and that is definitely not needed.
So if the loops where
for feature in feature_list
for face1 in face_list_a
for face2 in face_list_b
change it to something like (psuedo code)
for feature in feature_list
for n = 0 to face_list_a.size( )
face1 = face_list_a[n];
for m = n+1 to face_list_b.size( )
face2 = face_list_b[m];
Unless my understanding is wrong, but something like this would cut your problem size in half whether or not it was threaded.
October 27th, 2008 at 3:31 pm
Have you looked at GPGPU, CUDA or OpenCL for even more cheap performance gains?
October 27th, 2008 at 3:39 pm
oh yeah, it sounds like you rolled your own kernel vector machine. You should check out these two sites for better performance/accuracy:
http://kernel-machines.org/
http://web.mit.edu/dhlin/www/research/index.html
October 27th, 2008 at 3:44 pm
@Ajay: Absolutely. We use hand-optimized SSE whenever its needed & applicable especially in the image processing.
@Pat: Our recognizer is still very much ‘beta’ but right now its good enough for that application. The problem is really the integration step.
@Darrell: That would be absolutely true only if the lists were the same. In my example above I said they were different lists precisely to avoid that optimization (because it would have confused from the point I wanted to make about cache friendliness).
@mycall: I’ve done a fair bit of work with GPGPU albeit with the previous generation nVidia cards and using our face tracking, not recognition. It’s on our list to revisit. I haven’t looked, in depth, at CUDA or OpenCL yet but I will when I revisit the topic.
October 28th, 2008 at 9:23 am
@Pat: You can use it to organize your photos now: http://ub0.cc/7Y/0c – thanks to google one scary company if you want to go into ethics…
October 28th, 2008 at 10:35 am
Wouldn’t a checksum (hash) give you a constant time comparison?
My big O notation is very bad, but it should be o(n).
You calculate the hash value of all your faces and afterwards you can compare the hashes.
October 28th, 2008 at 11:50 am
Checksums do operate in constant time, but that’s O(1), not O(n) which is linear and much worse:
http://en.wikipedia.org/wiki/Big_O_notation#Orders_of_common_functions
October 29th, 2008 at 4:58 pm
Yay for python. Our whole app is written in it. yay for face recognition, who wants to carry a stupid key around anywhere in the first place. i wish my car would just start when i walked up to it.
October 29th, 2008 at 5:07 pm
@big curt: Then car thieves would be cutting off people’s heads to get into their cars.
October 29th, 2008 at 5:18 pm
Don’t think face recognition will ever be as good as a good password
October 29th, 2008 at 5:19 pm
You can’t do a checksum on a face…
October 29th, 2008 at 5:25 pm
Just wondering why you don’t store the features of the face that are extracted and then just extract from the compared face to those extracted features? I have never done any facial recognition stuff so this may not even be possible.
October 29th, 2008 at 5:28 pm
smarttowers,
The terminology gets a bit confusing. In normal English, “features” tend to be something on a particular face, but in pattern recognition the term is much more general to be anything you can classify. In this case, the “features” are a single measurement of difference between the two faces. Any math we can do on the face, alone, we do once and definitely don’t repeat.
So, in other words, features in this case are “joint” and depend on both faces. We generate a “feature” between the two faces which is really a measure of how similar they are. Every feature, then, is measuring the similarity of the two faces in a different way. So once you have all of these features, you can toss them into a classifier.
Now how do you “measure” a feature and how to measure “difference”… these are all substantially more complicated problems.
October 29th, 2008 at 5:35 pm
Don’t they teach algorithms any more? red black tree dude. or voxels.
October 29th, 2008 at 5:59 pm
Have you tried logarithms?
October 29th, 2008 at 7:13 pm
OHHH WE SHOULD NOT DESIGN NOR BUILD CARS, THEY CAN BE MISUSED TO KILL PEOPLE!!!! NOOOO
October 29th, 2008 at 7:25 pm
“As is usually the case, once they are happy with the accuracy they hand it off to me to see if I can speed it up.”
Sounds like a job I’d love. Compare to my current, which is “they hand it off to me to see if I can fix it.” Hell, any work I spend accelerating a function does not feel like work at all, it feels like crack (when crack feels good, you know). Sometimes, I cheat and spend a little time tweaking a sluggish function. Shaved off hours of processing in some cases.
Great blog, Louis. Welcome to my favourites
October 29th, 2008 at 8:44 pm
What?!?!?! YOU MEAN METAL GEAR LAUNCHES NUKES?!?!?!?!?!
Who cares? Leave the issues like that to the people in charge.
And I will take a pattern revognition class now.
October 29th, 2008 at 10:37 pm
Thanks for sharing the analysis and resolving of your problem. Bookmarking you for the day I might actually get to try this type and level of programming.
As far as the ethics, etc… VPLATT is absolutely correct; ALL technology can be used for good or evil… the general technology itself is not, just the specifics of how it is implemented and for what motives. Ditto for societal constructs; ie. corporations, governments, etc…
What makes things “good” or “evil”, are the choices made by the people controlling those “things”.
Of course a good percentage of our neural nets are now conditioned (by self and others) to not accept reality as it is, and instead we filter events through our experiences as some twisted thing much more complicated than it really is.
I wonder if we’ll be able to program our AIs to avoid that and stick with “what is”?
Facial recognition by the way, is simply ONE routine our brains know how to do at birth.
HA! Duh! See! The technology that was decried by others as being inherently EVIL is in fact 100% NATURAL existing in (almost) EVERY HUMAN baby at birth!
October 29th, 2008 at 10:41 pm
I don’t think it can be emphasized enough that technology and even many of its applications are ethically neutral. One thing that I’ve missed in this particular discussion is someone pointing out that technology WILL be mishandled and abused if good men are unable or unwilling to “dirty” their hands with it.
“All that is necessary for evil to triumph is for good men to do nothing.” — attrib. Edmund Burke
“I am only one, but still I am one. I cannot do everything, but still I can do something; and because I cannot do everything I will not refuse to do the something I can do.” — Edward Everett Hale
October 29th, 2008 at 10:42 pm
“Since each feature computation is relatively expensive, “iterating” over features is awful for cache performance.”
Are you referring to the instruction cache or the data cache?
October 30th, 2008 at 6:54 am
Working in image processing myself… you got me to subscribe your feed!
Thanks for sharing – I’ll soon be heading down that road, optimizing years of “single threaded” code to make use of newer multi-core CPUs.
October 30th, 2008 at 7:34 am
I’m a Computer Science Major, 3rd year, absolutely fascinated by facial recognition.
A: I don’t suppose you can mention what company it is you work for? And are you interested in hiring summer students?
B: What books can you recommend for pattern recognition (especially facial)? My school works hard to continue to dumb down my program (thus increasing tuition income), so I’m now in the position of needing to learn some of this stuff myself.
October 30th, 2008 at 7:50 am
Mabbo,
My ‘about’ page has a link to my place of work. It’s called Pittsburgh Pattern Recognition and it’s a CMU spin-off (www.pittpatt.com).
And yes, as a matter of fact, we are looking for summer interns. Feel free to send a resume to careers@pittpatt.com.
As for books, I can’t make a good recommendation. The book I used in grad school was “Pattern Classification” by Duda, Hart, and Stork. To be perfectly honest, I have no idea how it would work as a self-learning textbook. My suggestion is to hit up the websites of pattern rec/computer vision grad classes at “big” universities, check out their books, and then hit up amazon to see if any other self-learners have reviewed them. You will need to have taken math up to linear algebra to really understand it.
Hope that helps.
October 30th, 2008 at 11:12 am
First of all, excellent post about a highly complex issue! Keep that text coming, because I’m subscribing it!
@Mabbo: have a look at the Stanford Lectures on Machine Learning (includes the very good videos of the classes), freely available at:
http://see.stanford.edu/SEE/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
Gt
October 31st, 2008 at 7:55 am
First: Super nice quick understandable read. I had not thought about “cache bound” on code performance in parallelized code. I will certainly follow this blog a bit longer to see what else pops up.
Second: As a physicist who worked on laser surgery research (in the 80s), I felt I was doing good for mankind. However, I should point out that the Armed Forces paid for this research. Ultimately, the reason this was funded was so that they could do a better quicker job of repairing their “soldiers”. It’s a twisted notion – doing good to do bad, or in the case of war, doing bad to do good (rarely if ever). All technologies have some potential for good and some potential for bad. Sometimes the potential for good is worth the risk of the bad. Ultimately, developing and “enforcing”
a strong ethical framework is what is needed for all human endeavors, technology free or not. The potential for abuse of a technology should always be considered. It’s almost never a simplistic black and white case, and treating it that way will not enable you to make progress in warning people about the potential abuse.
Third: I am not absolutely sure anyone should be anonymous in public. (Think back to small towns where everyone knew everyone else, and the “strangers” were clearly identifiable.) Yes, this could suppress some forms of political dissent, but that’s why we have The First Amendment.
I’m not certain I understand exactly how facial recognition is anything more than an enhancement in public presence of people who already exist. Not all “police” are bad and the bad ones can already do plenty of bad with the technology they already have. Their behavior is a question of morality and not of technology. The real problem might come when people relinquish their judgment to a machine.
November 1st, 2008 at 2:16 pm
Nice post. If you were using a language like C++, you might want to look at an article by Herb Sutter on Dr. Dobb’s Journal regarding cache issues and how to help your parallel/multi-threaded application scale in newer multi-core systems.
You might also want to think about “zipping” through the data and storing the features to be compared adjacent to one another (using tuples or a struct that contains both elements) and then looping through the zipped tuples one by one. This is also cache-friendly and allows you to write a better more readable loop (of course, this means you’ll need to copy features instead of refer to them by pointer).
You might also get better performance out of an optimizing compiler (something better than GCC, like Intel’s C++ compiler (again only if you’re using C++)).
November 6th, 2008 at 4:19 am
Thanks for correcting my big O notation.
Why can’t you do a Checksum on a face?
I guess you could do a checksum on the proportions of a face. If you make that general enough, you should be able to group people so you can find a number of similar people in one go. Then do the same on a more specific feature of the face, going down three or four levels (face, nose, eyes, ears…). The last step would then be to cycle through the last subgroup an see their specifics.
My 0,02€
January 1st, 2010 at 10:44 pm
Every new technology is a double-edged sword when placed in the hands of humanity. It’s not the technology that is the problem. It is simply human nature. There will always be those who use something for good and those that use the same thing for evil.
If we didn’t develop new technologies we wouldn’t be human, because our curiosity and ingenuity is part of what makes us human. Evolution results from need. For further explanation refer to Maslow’s hierarchy of needs (Wikipedia has an article).
A good example of the argument between whether something that could be developed should be developed is the case for cloning. Ian Malcolm in the Jurassic Park movie made this argument, but he is a little hypocritical as a mathemetician using all the combined knowledge of those mathemeticians before him to develop his chaos theory, with no concern for how his theory will be used (which may have hypothetically been the new doomsday weapons of tomorrow).
Technology is a natural human pursuit, and saying that one technology is more evil than another is just demonstrating ignorance as to how each technology could possibly be used. It would be easy to argue that nuclear weapons are worse than farming crops, but the technology tree leading to nuclear weapons would not have been possible without the development of fertilizers. All technology development is interrelated by ancestor technologies. If you oppose one technology, you may as well oppose all technologies, but to do that without being a hypocrit you would have to forfeit all use of technology (including the use of a PC and the Internet required to post in this forum).