{on programming and the internets, every monday}

Go-To-Market Plans (Gone Awry)

I work for a tech start-up. I’ve taken a very active interest in the business half of our start-up company. Consequently, I try to keep up with start-up news. I read start-up blogs. I follow other start-ups. One of my absolute favorite guilty pleasures is learning about a particular start-up’s “business guy’s” version of their go-to-market plan.

A go-to-market plan is an important part of any start-up company, whether the plan is formalized or not. You work for awhile building something you can sell and then you need to figure out how to get the ball rolling in your market of choice. Sometimes this is a simple problem. Sometimes it’s incredibly difficult. I don’t envy people who need to come up with solutions when this is a difficult problem. And those people who solve it successfully deserve their success. Those guys certainly earn their money. This story, though, is not about those people. I’ve developed an extremely strong distrust of marketing and marketing speak from non-technical people. I freely admit that it might be more about my own cynicism than their methods. I’ve tried my best to disabuse myself of this bias but stories like what follows keep adding evidence to my prejudice.

I want to relay to you a real example of a “to-market” plan I was exposed to from a real company, from their non-technical founder. This particular company was doing both software and a gadget that were useful separately but also had synergy. The actual details aren’t really relevant. Here was their two-step “to-market” plan, phrased exactly as they did:

  1. Be popular on tech and gadget blogs
  2. Make viral videos

Immediately, I was struck by the certainty. They aren’t going to to “try” to be popular, they just will be. Maybe that’s that start-up confidence. I waited for the specifics. Someone asked what demos or cool things they’d do to “be popular” or “become viral”. They froze up a little. Oh my god, I thought. They didn’t seem to have specifics. They actually thought it was this easy.

Break Time

I need to make something clear before I continue: I don’t have a lot of real world experience in this arena. I haven’t started multiple successful start-ups or anything like that. I’m just a guy who reads a lot. That’s the extent of my expertise. So while I do hold strong opinions, I understand fully that I might have missed something critical in my reasoning. I am always open to correction.

So, all that being said, someone please explain to me how this is not the dumbest conceivable plan. I was stunned when I heard this. It blew my mind. They actually thought it was this easy!

Being Popular is Not Trivial

Shouldn’t this go without saying? Listen start-up guy: if you think “being popular” on blogs or “making viral videos” count as a marketing plan, you need to stop, right now. You aren’t any good at this. I’m serious. If it was that simple, you’d be King Midas. Every pile of nonsense you touched would turn to gold. If you really can wave a magic wand and turn videos viral, you don’t need to make whatever stupid widget you are trying to make. Your skills can be leveraged elsewhere.

Chances are good, however, that you aren’t King Midas. You are just a guy who has absolutely no idea how much of a non-plan being “viral” is. You are in way over your head. Pretending you can just “go viral” is like believing in magic. Or that you can make money by stealing people’s underpants. Going viral is something that happens to a large extent at random. You can study all the real-world cases you want. There might be some lessons to be learned but at the end of the day most of it is just pure dumb luck. It’s basically akin to rounding up all of last’s years lotto winners and trying to find lessons. Ooohhhhh, they all eat cabbage.

The lesson is simple. If you had specific ideas for demos, blogs, or other marketing things that you think has a good chance to be well-spread and fairly viral, that can constitute a plan. “Be popular” isn’t a plan unless you can answer the question: how? These specific ideas for how represent your “to-market” plan. If you don’t have any specifics, you don’t have a plan.

OProfile: Profiling in Linux for Fun and Profit

Profiling code to find performance bottlenecks is a relatively common operation. My goal here isn’t to give you a detailed understanding of OProfile. I just want to convey to you that it’s incredibly easy to use and extremely powerful. If you don’t know much about profiling, or use only gprof because it’s the only profiler you know, please read on.

OProfile is a system-profiler for the Linux platform that has been my absolute favorite tool for profiling code. I have no idea why it’s not more popular or so unknown (maybe I just don’t run in the right circles). OProfile does have somewhat onerous setup requirements but it has become standard fair for many distributions that now have excellent support for it. It’s become substantially easier to get set up over the past few iterations of my favorite distributions (why am I lying? I only really use Ubuntu… but it’s still true!). For many distributions, it’s become trivial to setup and use. It should be a standard part of any Linux developer’s toolkit. (Windows and Mac developers have similar tools that work on similar principles: V-Tune and Shark.)

The most commonly used profiler is gprof. Most programmers tend to know about gprof and can probably profile something given enough googling. In my opinion, given the choice, OProfile is a far superior tool. I’d like to first discuss the primary differences between profilers like gprof and profilers like OProfile. If you happen to be a gprof ninja and notice any mistakes, please let me know. Hopefully, though, I can convince you that OProfile is both technically superior and easier to use.

OProfile vs gprof

The Inner Workings

Both OProfile and gprof work based on a statistical sampling method. They periodically poke into your program(s), figure out what code is currently being called, and increment the counter for that symbol. If you let this run long enough, with a high enough sample rate, and you’ll get a pretty accurate distribution of how the code works. The primary difference, however, between OProfile and gprof, is what triggers these samples.

Gprof uses a special system call that will periodically sample the currently running process only. This means gprof will only be aware of things that happen in “user time” for your process. You will not see bottlenecks that occur in external shared libraries (like libc) or the kernel. This can result in gprof’s results being very skewed for certain types of bottlenecks (page faults, file i/o, memory fragmentation, etc.).

If you are using gprof and the ‘time’ command doesn’t agree with gprof’s cumulative total time, you’ve likely hit this limitation.

OProfile, on the other hand, is a system-wide profiler that triggers on hardware events and will record not only all of the current symbols being executed but also which process they belong to. This means OProfile sees all. If your code is causing a major bottleneck in the kernel or in libc, OProfile will see it and gprof will not. The downside of this kind of omniscience is that OProfile requires special permissions. Typically, this means the sudo password. If you are doing computational optimization, this is usually not a problem.

Call Graphs

Gprof requires code be built with the -pg flag. This instruments the actual code with information that will help build an accurate call graph. The upside here is you get your call graph and you also have cumulative timings (how much time did we spend inside this function, and all of its children functions). The downside is that you have to instrument your code and you also need to recompile it with special flags. There’s two issues here that are worth noting. First, the -pg flag can interfere with other flags (like -fomit-frame-pointer). This can make certain bits of code no longer compile or work correctly (for example, inline assembly). Second, adding instrumentation to the code can cause fundamental changes. There is a Heisenberg effect that by trying to measure your code, you affect its true performance.

OProfile, on the other hand, requires no instrumentation of the code. This means OProfile does not need your code to be recompiled with any special flags (so long as the binary in question hasn’t had the symbols stripped). Since OProfile’s only source of information is the hardware-based event sampling, it doesn’t have any real information about call graphs. OProfile is able to build “statistical” call graphs where it can make guesses about which functions are calling which. This typically requires some interpretation on your part to fully decipher. If you need an accurate call graph, OProfile might not be the best tool.

Multithreaded Code

I have no idea if this issue is resolved in gprof yet but gprof does not (or, at least, did not) support multithreaded code. Oprofile does.

Summary

OProfile pros:

  1. Supports hardware based events (more than just CPU clock cycles: cache misses, etc.)
  2. Supports multi-threaded code.
  3. Sees all processes and will find bottlenecks in other places (like kernel and libc).
  4. Does not require any special compilation flags, or even recompiled code.

Cons:

  1. Requires root access
  2. Requires kernel support
  3. Does not provide (precise) call graphs nor cumulative timings
  4. Not portable (Linux only)

Setting Up OProfile

Most popular distributions have an OProfile package that can be easily installed. The only caveat is that some distribution flavors come with kernels that don’t have OProfile support built in. This means you’ll need a new kernel (they might offer an alternate in their package system, or else you will have to… compile it yourself… gah). I’ve found over the last few years many of the most popular distributions are shipping with kernels that support OProfile. Ubuntu, for example, has for awhile (while the server version, I believe, does not).

Installing the tool is usually fairly trivial (vanilla Ubuntu users need only: sudo apt-get install oprofile). Once installed the only setup required is:

sudo opcontrol --no-vmlinux

This tells OProfile that you do not have an uncompressed binary of your kernel (your vmlinux file). This means that OProfile will assign all kernel samples to a black-box called “no-vmlinux”. If you see “no-vmlinux” high on the charts, you will know you are having bottlenecks inside the kernel. You can peer into the kernel by telling OProfile where your vmlinux for the kernel can be found. If you build your own kernel, you should already know what to do. By default, however, most distributions ship with only a compressed kernel (vmlinuz). Typically, there is a package that you can download that will also give you an uncompressed version of the kernel. For example, Ubuntu (Hardy Heron) calls it linux-image-debug-{version}. If you install that package, it will put a vmlinux file in /boot/. If you point oprofile to that file, you will be able to get a breakdown of what is happening inside your kernel.

Usually, by default, your system’s prebuilt binaries are stripped of their symbols. This means that all of these applications and libraries will be black boxes to OProfile as well. If, for whatever reason, a particular program seems to be bottlenecked in libc for example, and you want the profiler to break down whats happening in libc, you will need to install the version of libc that contains symbols (look for the -dbg package). Furthermore, if you want to profile some pre-built binary, you will need a version without the symbols stripped out.

Using OProfile

It’s really this simple:

sudo opcontrol --reset
sudo opcontrol --start
./run_my_code
sudo opcontrol --shutdown
opreport -lt1

Those are petty much all of the commands you need to know to use oprofile. You reset and start the profiler, run your code, and then shut it down. The final step is viewing the results. The opreport command in my example has taken two flags. The -l flag shows you inside each process (otherwise you will get just an overview of each process and its respective usage of the system resources). The -t1 flag sets a threshold at 1% so you don’t see every symbol.

Advanced Beginner Usage

Cache Misses

On x86 machines, it’s also fairly easy to track cache misses using OProfile. By default, OProfile tracks CPU_CLK_UNHALTED hardware events. This is really a measure of how long your code takes to run. You can change the hardware event (sudo opcontrol –list-events to see all the other options). In particular, you can switch to the L2_REQUESTS event with a mask that includes only ‘INVALID’ requests. This requires a bit of internet searching, or the Intel optimization manuals. I’ve gone ahead and looked it up for you, though:

sudo opcontrol --event=L2_RQSTS:1000:0xf1

Events are specified in this way with the first number being the count (how many of these events cause a sample to occur, lower is better but requires more overhead). The second number is the unit mask. In this case, the 0xF1 means across all CPUs, but only INVALID L2_RQSTS.

WTF moments

Have you ever built in some complicated new package and aren’t getting results you think you should be? If you are anything like me your first thought is “Am I even running the new code?”. Maybe you open your editor, dig down, and drop in a few printf()s to figure out if you are even calling the new super fantastic function. This obviously isn’t the smartest way to do this. OProfile solves this problem trivially by letting you sample any compiled code. You can just fire up the profiler and actually look at all the symbols that were caught by the profiler. This is usually the fastest way to figure out if the code you wanted to be called is being called.

Immortality

It is commonly said that the human body replaces its cells every 7 years. It is less commonly known that 98% of the atoms of your body are replaced every year. The actual absolute truth of either notion is probably disputable, but they are close enough to serve my point. After a few years, virtually everything that made you you is gone. This leaves an intriguing question: what exactly are you? While this is also an interesting philosophical question, I’m really working on the more practical angle. Here’s basically the answer: you are a low entropy depression in the universe. You suck good quality energy from your surroundings and spit it back out as heat and waste all in an effort to… exist. Thinking like a universe, you are essentially a weather phenomenon, a low-pressure front.

I consider this interesting because you aren’t really this well-defined thing that exists from birth until death. You aren’t so much a solid object but a place in the universe where things are flowing in and out in some orderly way. The primary difference between you and the weather, then, is if you are fit, in the natural sense, you might be lucky enough to spin off a few thunderstorm copies of yourself. Through this incredibly ephemeral existence your genes are passed down. So even though the atoms are replaced continuously, the pattern in your genetic code lives on. Your genes are potentially immortal.

Immortality may be a strong word but if there is anything of this Earth that deserves the title, it would be our genes. The most stable (and important) bits of genetic code can have their lifespans measured in the hundreds of millions of years, possibly billions. To put this into perspective, our universe is about 15 billion years old. In your DNA is a self-replicating pattern that has existed for a non-trivial portion of the history of the universe (where 150 million years represents about 1% of the age of the universe). That’s pretty impressive for a bunch of carbon chains.

The Internet

This all brings me to my point. I’m pretty sure the Internet is close to a similar level of immortality. What exactly is the internet? It’s not the wires, nor the servers, nor the hard-drives (nor the tubes and it’s certainly not the dumptrucks). It’s analogous to the view of the human body that I made at the beginning. The internet is a gigantic pile of information. The internet probably changes out all its “cells” every 10 years or so, on the same order as the human body.

With all of these physical pieces of the internet being constantly replaced, it still comes as no surprise that the information is preserved. Sites like Google’s cache and the Internet Archive have saved large sections of the internet. The biggest difference in our little analogy is that the Internet and its information, unlike us, is not self-aware and probably won’t be self-replicating any time soon. As long as we keep those hard drives from failing, though, the information contained therein is going to live on.

I saw one of my recent posts in Google cache. This got me thinking. Like the billion-year-old bits of genetic code in my DNA, how long could I reasonably expect my post to live? What about something that’s been saved in the Internet Archive?  Do you think its a stretch to say it might live as long as the human race? I don’t think it’s a stretch at all. The Internet Archive project might run out of money or suffer some catastrophic data loss, but chances are good that someone will step up to preserve the archive and/or a backup exists somewhere to replace the lost data. As time goes on, it will become increasingly more important to save this for historic purposes. When does it end? Maybe never.

If you don’t already know the term, now is a good time to refer you to Wikipedia’s entry on memes. Memes are units of cultural information that are passed around our collective consciousness. Memes in the cultural world are equivalent to genes in the biologic world. They undergo natural selection, evolution, and all the other tenants of Darwinism. The fittest survive and are replicated. The bad ones go away. The internet has given memes the type of immortality previously only afforded to genes.

Every time you and I post to the internet, someone somewhere might archive it. And those archives will keep getting saved. Saved and copied and saved again. These archives may seem, now, to be only a novelty, but their importance as a historical tool will only grow as time moves forward. They will become increasingly more sacred. It’s kind of funny to think about historians digging through the early Internet Archives and writing a paper on the Rick Roll meme. What’s interesting, though, is that I’m pretty certain almost every participant in these “early” internet times (ie, us) is going to leave some mark in that archive. That means every time you write something, it might get saved for the rest of humanity to read, ever after. I don’t think ever before has every person had this much access to this type of longevity. If the Internet is immortal, then we in it shall be remembered.

How We Made Our Face Recognizer 25x Faster

This is a war story from my real day job. I work at a start-up company in Pittsburgh that does face detection, tracking, and recognition software. This particular problem involved optimization of the face recognizer.

I’m going to try to simplify our face recognition algorithm as much as possible to get to the point I’m trying to make. If you are reading this to try to learn our algorithmic secrets, you’ll be sorely disappointed. What you’ll see here is in the first few chapters of a pattern recognition textbook.

Recently, we came out with a brand new version of our face recognizer. As is usually the case, once they are happy with the accuracy they hand it off to me to see if I can speed it up.

The Problem

Comparing faces tends to grow quadratically with the number of input faces. When you imagine comparing 1000 faces to another 1000, that equals a million comparisons. Here’s the pseudo-code of that operation:

for i in range(face_list1):
   for j in range(face_list2):
      answer[i,j] = compare(face_list1[i], face_list2[j])

When you look at this code, notice how threadable this problem is. These comparison operations are completely independent, numerous, and sizable amounts of computation. This is a concurrent programmer’s DREAM problem. So when it came time to parallelize this algorithm, I wasn’t worried. This is what one might refer to as an embarrassingly parallel problem. Our speed-up should be identical to the number of CPU cores.

So I sat down, as I’ve done many times, whipped up a little thread pool, instantiated my thread-safe queue with a mama-process for feeding the workers. We generate a to-do list, dump them all on the queue, and have the worker threads go to town on the other end of that queue. This is a pretty standard solution. Before lunch I was ready to fire it up and I saw the most beautiful thing a concurrent programmer can ever see:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
32603 louis     16   0  246m  24m 4620 S  800  4.8   0:09.12 recognize

Just in case you aren’t familiar with the output of top, this is showing my program using all 8 CPU cores (dual quad-core). Problem solved. I went to lunch.

Not So Fast

A funny thing happened over lunch. My test finished running and the results weren’t what I expected. Having an embarrassingly parallel problem and a CPU pegged at 800%, you’d expect an 8x speed-up. That’s not at all what happened. We ended up with about 3x improvement. Since our CPU is pegged at 800%, it’s not a synchronization problem. Having been down this road a time or two, I was fairly certain of the culprit but took a moment to prove it. It wasn’t due to file I/O or anything like that (since none of that stuff was present). With a little use of oprofile (which deserves its own post, one day), I could tell the time was spent in user space, and not the kernel, which removes things like page zeroing or page faulting. Cache misses, on the other hand, are “assigned” to the user process.

Our code was cache bound.

The Inner Loop

I’m going to strip out all of the layers and show you the entire algorithm, simplified. Take my word for it, however, that in our ordinary code base, these loops were separated by numerous layers of abstraction, function calls, and a gigantic mass of book-keeping. It should be fairly trivial to understand what the algorithm is doing. How and why it works, though, are topics for a pattern recognition class.

for face1 in face_list_a:                                    # on the order of 1000
   for face2 in face_list_b:                                 # on the order of 1000
     for feature in features:                           
         feature_list[feature] = extract_feature(feature, face1, face2)                                
     answer[face1][face2] = classify(classifier, feature_list)

This may require some explanation for people without any pattern recognition background. First, every face is made up of a set of features. Think of a feature as some measurement (like eye color). How are features chosen, or generated? Go take that pattern rec class. We extract all of the features from our faces and dump them into our classifier which is able to determine a likelihood of a match.

The Solution - Becoming cache-friendly

The reason this is thrashing cache was pretty straight forward for me. The inner loop is iterating over features. Since each feature computation is relatively expensive, “iterating” over features is awful for cache performance. The solution is a very simple reordering of the loops:

for feature in feature_list:   
   for face1 in face_list_a:                                    # on the order of 1000
      for face2 in face_list_b:                                 # on the order of 1000                            
         feature_lists[face1][face2][feature] = extract_feature(feature, face1, face2)                                
for face1 in face_list_a:
   for face2 in face_list_b:
      answer[face1][face2] = classify(classifier, feature_lists[face1][face2])

This means we use the same feature for every input first and once we are done with it, we move on to the next one. This allows each successive comparison between faces to take advantage of the caching done in previous comparisons.

Not surprisingly, the above algorithm improved the performance of the single threaded recognizer by a factor of 6. It also removed the cache bottleneck. With the algorithm no longer cache bound, our speed-up factor equaled our core count. Once we combined all of the optimizations, we ended up with a 25x speed-up of our 8-threaded version on our octo-core machine.

Dead Abstractions

When I put the three loops next to each other, it appears like a textbook case of loop reordering. Sometimes, however, reordering loops isn’t as easy as you’d like to believe. In this case, these loops were scattered over multiple levels of abstractions.

Do you see which abstraction died? The compare(face1, face2) function.

The loop reordering requires a set of input faces to work properly so that a single compare() function goes back to the slower, cache-inefficient algorithm. It would be natural for a programmer to decide that you need a compare(face1, face2) function. Unfortunately, however, implementing multiple comparisons by making multiple calls to a single compare() function leads to the misuse of cache.

Writing cache-friendly algorithms isn’t necessarily easy. Usually, it involves breaking some nice abstractions. This cache-friendliness problem is exacerbated, greatly, in the multi-core generation where cache performance can act as a hidden serializing force. You are doubly punished for misusing cache in this new world. Added to this is the fact that writing cache friendly, highly scalable, concurrent algorithms is a black, black art. For any embarassingly parallel problem, like mine, the solution might be bit messy but it’ll work. For more difficult problems, good luck.

Algorithms In Real Life

Sometimes I notice computer science or other assorted math nerdery in real life. These are their stories.

Alphabetizing Papers

Have you ever seen a teacher alphabetizing a couple dozen papers? This is basically a sorting problem, right? And you are a computer scientist, so sorting problems should be interesting to you. Have you ever analyzed how people alphabetize papers? How does she (for the sake of brevity I’ve made the sexist assumption that this teacher is female) do it? Almost always, she puts all the ‘A’ names in this pile, ‘B’ names in that one, and so on. The group ranges my vary (maybe A through C in this pile, etc.), but the algorithm is almost always the same. Once she’s done with that, what does she do? She tends to go letter by letter and use some other algorithm on each group. In my experience, she uses an insertion sort.

Did it ever strike you as odd that human beings used such an algorithm? Did you ever, as a young, naive, computer programmer, snicker to yourself thinking that she was using an inferior algorithm? I did. Everyone knows that quicksort is the best and fastest way to sort, right? Why don’t teachers use quicksort? My original answer to this question was that unlike computers, the human brain doesn’t do all comparisons equally. It’s just “easier” for our brains this way. Splitting into letter groups makes each smaller problem more manageable.  I was certain, in those times, there might exist some faster algorithm that teachers could use based on a quicksort. I was wrong.

Back to the computer science part: do you know the name of the sorting algorithm that these teachers use? It’s called a bucket sort. A bucket sort followed by individual insertion sorts (exactly what teachers tend to do) is a linear time sorting algorithm. Yes, linear time. Did you know there are linear time sorting algorithms? But you thought n*log(n) was the best possible sorting algorithm? That’s only the bound for comparison based sorting. When you have some notion of the distribution of the items to be sorted, you can break through that boundary and do linear time sorting. The catch with linear time sorting is that the input must follow some known distribution. This, by the way, is why teachers instinctively break the piles into various types of groupings. If there are alot of papers, you need finer group ranges. Furthermore, the ideal bucket setup would distribute the papers roughly evenly. The letter S might need its own bucket, but you can put all the letters up through F in their own bucket. Teachers have alot of experience with both the general problem and their specific problem (for example, the peculiarities of a particular class’ name distribution) and so they are optimizing the algorithm given the known distribution. They are setting up the parameters of the linear time sort (number of buckets, bucket ranges, etc.) exactly as they should to optimize the sort time.

The main downside to these linear sort algorithms is that they require alot of extra space (versus comparison-based sorting). You tend need to need an auxiliary bookkeeping array on the order of the original problem to do them. This isn’t a problem in real life! She just needs a larger table. In a very real sense, this supposedly “naive” algorithm that teachers use is among the very best possible. And woe to any computer science student who thinks to himself that she’s doing it wrong.

The Coffee Problem

So I woke up the other day, took my shower, got dressed and got on the elevator. My apartment complex has complimentary coffee in the lobby area (that’s why the rent is so high). This particular morning, when I got there, I noticed all of the little tags that tell me which dispenser was which were missing. So there I stood, with five identical coffee dispensers, virtually certain at least one of them was decaf. I don’t want decaf. What do I do?

If I was David Carruso or that dude from CSI: Las Vegas, I could have smelled the coffee, detected the bean’s nation of origin and deduced which was decaf based on the trade distribution of that country’s beans throughout the continental United States. Or maybe I could have used some nearby chemical MacGuyver style to determine which was caffeine. I’m just a dude needing caffeine, so I thought for a second, and then went for the middle one. How did they likely set these up? It seemed logical to me that since there were five, no more than two would be decaf. That seems like a good assumption. Furthermore, my “opponent”, being a rational person, likely put the decafs together, and on one end. That means the middle one is probably caffenieted.

(Another good answer I received from someone else is just to mix them all up. At worst you get 60% caffeine.)