The Day I Got Git (with some help from github)
Unlike other programmer holy wars, distributed version control versus centralized version control actually has interesting underlying issues. Distributed version control systems are all the rage these days and git seems to be leading the charge. I use both svn and git so I’ll use them as my examples of the two paradigms.
When you first learn about distributed version control, it takes awhile to figure out why you’d ever use something like it. I understand why the Linux kernel people need it (or other very distributed development teams). But why would I need it? Seems like overkill. It’s not until that first use-case hits you and all of the sudden the world explodes. What I’m presenting here is a particular use-case in which git shines and where subversion seems quite lacking.
This is not to say, however, that git is superior to subversion (that tired argument can be found elsewhere). There are most certainly situations where subversion is clearly the correct choice. Even at our office, our main code base is in subversion because we have a work-flow that matches so well to subversion.
Frame-accurate seeking
Awhile back we decided that since we work on face recognition, we really need a good solution for video input. Our requirements were a little different than most, though. We don’t care much about audio, or time information because we aren’t doing playback. We have more of a “video processing” need. That meant we needed a video input solution that was:
- cross platform
- supported a wide range of codecs and formats
- supported frame accurate seeking
The only video i/o solution that comes close (you pretty much lose everything else at “cross platform”) is ffmpeg.
You wouldn’t think this was a big list of requirements. I didn’t. And boy was I wrong. As it turns out (that’s code for “there’s alot more to this painful painful story, but the wounds are still too fresh to talk about”), it’s quite difficult to do frame accurate seeking in a format or codec agnostic fashion. And by “difficult” of course I mean “essentially impossible”. It struck me as amazing, at first, that simple questions like “how many frames does this video have?” aren’t exactly easy to get (the only foolproof way to do that is to decode them and count them, and even that doesn’t always work). Even worse, trying to seek to a particular frame of a particular video (let’s say frame 192) turns out to be even more impossible. However, the utter painfulness of modern digital video is a topic for a different day.
So we sat down and decided to write a little wrapper library that was reasonably capable of doing what we needed. We did a pretty good job in the end, but I’m certain there are still a few bugs. If you ever need frame accurate seeking, I urge you to give it a try.
Throwing our project “over the wall”
The point of this whole story is simple: this is a perfect project to open source. So I sat my bosses down and made the easiest pitch ever. It’s not our core competency (we do face recognition), we have a vested interest in it working as well as possible, and we don’t devote the time and effort to make it perfect (it would be a Herculean effort). Let’s open source it and if just one person fixes one bug, it’ll be better for us in the end. They agreed, I was happy, so let’s do it. I went ahead and made a Sourceforge page for my Frame Accurate Seeking Extension to ffmpeg page and uploaded the code.
The project wasn’t really a success. Only a handful of people used it, but that was alright. It was useful. Early on, I enjoyed reading emails and answering questions from the few people who were using it. Over time, though, I moved back to working on what it is I normally do, and the video stuff became a bit of a burden. The problem was, though, I was the “owner” of this project on Sourceforge, it didn’t have any sort of critical mass, and I had no interest in all the administrative nonsense that went along with such a position. I didn’t really want to answer people’s emails, or test people’s patches, etc. I wanted to “let it go”. The problem was, when using centralized version control through Sourceforge, “letting go” of such a small project is essentially equivalent to killing it.
I wanted to take this corporate project and toss it over the proverbial wall.
Enter git (with the help of github)
There is a very obvious parallel between git/svn and the whole web 2.0 craze. Subversion is Slashdot. Git is Digg/Reddit. Does that make sense? There is no centralized governor in git. In general, git repositories end up with very obvious centralized branches that form the backbone of the project but this is arrived at organically. Git, in this sense, is a much more “social” tool where it’s very easy for anyone to contribute and the make-up of the full project is the sum of all these individuals. Because git is so much more “social” than subversion, it makes sense that a “social coding” sites like github would spring up.
The beauty of distributed version control is that there is no owner. There is no red-tape or permissions or any of that stuff. It’s the perfect place for a single person to work on whatever it is they want but within public view. That way, if someone else is interested, they can come along, with no permission from the first, and add whatever it is they want. This is incredibly desirable for situations just as mine.
Git and github work together to form some bizarre social distributed version control system. The way it works is simply that you host your git projects on github, such that anyone can fork it, and all of the branches are tracked through a nice online and visible interface. You can then go back at some later date, see all the forks and all the patches in those forks, and incorporate whatever you want back into your project. It’s a very slick solution to both hosting and tracking forks.
It’s hard to explain to someone who has hasn’t “seen the light” how perfect this synergy is. The best way I can describe it is that git and github make open sourcing and collaborating easier than it has ever been. It greases the wheels between individuals to such a degree that you can (almost) remove all of the organizational overhead in your average open source project. It lowers the bar of “life” for a project so low that they cannot really “die” as they will just sit there waiting for someone to fork the project and improve it.
Letting go with git + github
So since I realized I didn’t really want to actively manage my little project, I wanted to leave it in a state that anyone could easily take over, if they wanted. Git and github provided a substantially better mechanism for this than subversion and sourceforge. So, I moved the project over to github. It may die a quiet death there (which will be alright) but at least this way I’ve provided it for anyone who might find it useful and I’ve removed all the painful red tape of anyone else creating, modifying, and building on what I started.
We all have those projects that would otherwise die on our hard drive. They aren’t worth uploading to some “heavyweight” hosting site. This is where github kicks in to perfection. Just throw them up there. Chances are really good that no one will ever look at it. But what’s the alternative? Let it die when you next wipe the hard drive? Let’s say you share it and 6 months later someone wants to change it, all they have to do is fork it and change it. They don’t need to email you to talk about it and send you a patch that you need to test (can you trust their code?) etc, etc. And if you don’t care anymore, don’t care. Ignore it, if you want. If 6 months later, you come back to the problem, you can see the work they did when you are ready, and pull in any changes you want.
By lowering the friction coefficients due to administrations and collabaration, git + github have created this entire new class of open source projects that are “healthy” despite their smallness (whether in audience or even usefulness). This is a place where people can upload the smallest of their personal work and together with a few other people, seperated by a great deal of time, cobble to together small but useful things. Connection and collaboration at this level was just too onerous to be useful, in the past.
Got a New Dog
I haven’t had time to write anything this week because we just got a new dog. So, I’ll make this short and sweet:
Have you ever noticed that the ld command has three magic flags? omagic, nmagic, and qmagic. I think the world would be a better place if all our programs had a few magic flags.
The Dumbest Interview Question
There are many dumb interview questions. Programming, itself, has a wide range of extra potential for stupidity. But the question that irks me the most is universal to just about every field:
What is your greatest weakness?
Well, my greatest weakness is a complete and utter intolerance for inane questions. You see, this stupid question is designed to be rightly answered by someone who has carefully crafted an answer that turn some apparent weakness into a perceived strength. Maybe I work too hard. Maybe I care too much. Maybe sometimes I am too passionate. Blah, blah, blah. What have we learned about each other? Well, first, we’ve learned that you somehow believe in this stupid question. Secondly, we’ve learned that I’ve read a website on how to give a trite, non-committal answer that conveys absolutely no information. Do you honestly believe that I’m a perfectionist?
It’s a game. You are asking me a question to see if I’ve learned how to properly not answer it. What a joke. This question is just a vapid mating ritual, void of any real information. We are like a pair of insects. You shake your little antennae and I gotta make sure I do the proper mating dance or you’re just going to eat me.
Why are we playing this game? Is it just to make sure we are all “in” on the game? Is this the business world’s equivalent of a modem handshake? You just want to make sure I haven’t been living under a rock? Maybe we can replace this waste of time with some secret hand signals or something. You tap your foot three times, and I respond with two eye brow raises. It’ll save the time and communicate the same information. We both know the ritual.
On second thought, my greatest weakness is that I don’t have the fortitude of my principles to tell you straight-away that this question is asinine and I feel the need to give a long detailed explanation of why such inanity has no value.
Now that we’ve established that you can ask the standard questions and I’ve prepared carefully crafted (and likely untrue) responses, can we move on? Or do I need to keep dancing?
In the Negative?
Some people might argue that the question (and others like it) are valuable in the “negative” direction. In other words, bad answers help you eliminate bad people. I guess. I haven’t spent many years asking these types of questions to know how often you receive horrible answers. I wonder how many times, honestly, that a candidate would have gotten a job except he named his greatest weakness as clubbing baby seals.
My suspicion is that anyone incapable of getting the basic interview rituals correct had red flags all over the place.
Congratulations to Us
Today we add to the growing list of start-up success stories. For those of you unaware, I work at start-up company in Pittsburgh. It’s called Pittsburgh Pattern Recognition. We are a spin-off of CMU doing face recognition.
Before we get to the good news, let me tell you a story. At some point just after I started working there, I distinctly remember us receiving annoying phone calls from the Yellow Pages. Sometime thereafter, we ended up getting listed. Where do the Yellow Pages put a face recognition company named Pittsburgh Pattern Recognition? Under Pattern Makers, naturally. We found this particular error humorous but never bothered to fix it. Why should we? It’s not like anyone is looking up “face recognition” in the yellow pages.
We pretty much forgot about it with the exception of the (very) occasional calls we’ve received on the topics of quilting and knitting. What was the point of this story? Read on.
Hardwork Pays Off
A few years on and we’ve bore so much fruit from our labor: face detection, face tracking, landmark localization, face recognition, speed-up factors in the 100s, highly threadable versions, and so on. Finally, with so much hard work, and dedication, it finally happened. On December 4th, a day sure to live on in our company’s memory forever, we received the following email:
I am pleased to announce that Pittsburgh Pattern Recognition has been selected for the 2008 Best of Pittsburgh Award in the Pattern Makers category by the U.S. Local Business Association (USLBA).
In recognition of your achievement, a 2008 Best of Pittsburgh Award plaque has been designed for display at your place of business:
Hot damn! We’re #1!
Given our new standing as the #1 pattern makers in the Pittsburgh area, I’ve decided to go ahead and drop some knowledge on you lesser mortals. Check it out.
Whoa damm.
In the spirit of teaching (because when we are gone, who will make the patterns?), I’ll divulge my secret recipe:
- Open GIMP. Create new Image (any size).
- Use the fill tool, with a warning pattern (yellow and black diagonal stripes)
- Use Filters/Enhance/Antialias (a few times, remove the hard edges)
- Use Filters/Distort/Emboss (default settings)
- Use Filters/Artistic/Clothify (default settings)
Let us take a moment and pause to reflect on the genius.
Seriously
Ok, seriously. Can this be rightly called a scam? This question has been bothering me.
On the one hand, it’s deceptive, for sure. But on the other, if you were a local mom & pop shop, it would give you something to put on the wall. You can see this type of stuff on the walls of restaurants and shops all over the place. Does that type of wall art actually influence customers and win you business? Maybe. Does a “real” award carry more weight than one of these? Probably not.
I just can’t believe this is someone’s model. Go through the yellow pages, email everyone there and declare them the winner. Give them a sweet press release they can quote, and let them buy a plaque from you. My disbelief slowly became amazement. This is actually genius.
Sad to say this story has an unhappy ending. Despite my pleading, our company will not be purchasing our plaque. It appears we have our sights set higher.
Algorithms in Real Life: Gradient Descent in Social Media (3/3)
While the last two posts were about “usual” gradient descent, this one is about a concept that is very close by analogy. It’s not quite the same but I think the grouping is natural. In this case, the guy in charge is much more in control of the terrain than the direction of travel. This is about social online communities. For the purposes of our discussion, I’m refering specifically to any product that relies on the community for its content (usually with both community contribution, and community voting as a filter). In our analogy, the point we are trying to optimize is the community that develops around our product.
When your product depends on the crowd for content, you are quite literally on a slippery slope. Since the crowd creates the content and determines what is popular, the things that become popular also attract the type of people who like that content. There is positive feedback in this loop. If we think in terms of a gradient descent, the situation is changed. You don’t get to choose which way is downhill because the crowd pushes you there. Your site becomes like a ball on a hill. It’s going to go where the hill takes it. When a social site starts out, and is in its formative stage, the community is sort of like a ball on the top of a hill. With a tiny push in any particular direction, you can drastically alter the eventual outcome. In this case, the outcome is the particular community that develops. Small descisions early on can have drastic consequences. Once a community has its particular set of groupthinks, it’s extremely difficult to push that ball out of that valley and into some (hopefully) better valley.
From a founder’s perspective, the community that develops is of paramount importance. Good communities can dominate some niche, be extremely large, and be an extremely attractive crowd for advertisers. For one reason or another, though, founders need to think about how to best guide a ball to wherever it is they want to be.
The initial conditions help to determine that initial community. And from there, everything can follow.
Positive Reinforcement
It is important when trying to build a community based site to understand this principle. One particular community can all but remove the “opposing” community from existence. People like to go where other people agree with them. And there are certain issues that cause people to become extremely tribalistic and do their best to make opposing views as unwelcome as possible. These issues very quickly “break symmetry” and become dominated by a single viewpoint. There are obviously other reasons why certain viewpoints began with a large majority, but the fact that the positive reinforcements very quickly drives that ball downhill needs to be understood. If it is your business it might be in your interest to steer that ball away from certain valleys.
For example, Hacker News has done this by using moderation to simply declare a site-wide topic and stick with it. By and large, the community that has developed is there to read about that topic. They steered that ball early and attracted the type of community they wanted. Consequently, it has become largely self policing, now. Even still, the moderators have to step in occasionally.
There is another way to solve the same problem. A really good example of this comes from reddit.com. Many people who’ve been around reddit for any length of time remember how frequently and how pervasive the 9/11 conspiracy theories were on the site. They were common in the political reddits and often made their way on to the front page. Once a ball like that gets rolling, it becomes extremely difficult to stop. People who want to read conspiracy theory stuff can go there and look at it, and those that don’t will stop going. The more momentum the movement has, the more difficult it becomes to stop. (I think) reddit was in real danger of falling off that conspiracy cliff and becoming a full-fledged conspiracy clearing house. The positive reinforcement can be extremely strong, in this type of situation. It turned out, however, not to go down that path.
What saved reddit? (Has it even been saved?) Well, there were certainly external factors (like the election, etc.) that made the topic diminish in importance in the political reddits. I think there’s more to the story, though.
Reddit is pretty well-separated and easily customizable. If you only go to programming.reddit, you won’t see much politics. Same thing with lolcats. Or bacon (’nuff said). Many of these well-separated communities thrived. So while the front-page of reddit (and the political sections of reddit) had high conspiracy content, these other communities were largely left alone. Reddit also made it extremely simple to customize (many people are presumably like me and use reddit with all the political stuff turned off). This lets reddit build multiple overlapping communities.
In the end, these sub-communities grew faster than the conspiracy theorists and eventually drowned it off the front page. Over time, less and less 9/11 conspiracy theories have appeared on the front page of reddit (or even politics.reddit). To this day, though, because of this early movement, reddit is one of the most conspiracy-friendly places on the “main” internet. People who go to reddit are at least tolerant of conspiracy theories because those that aren’t haven’t survived this long.
digg vs reddit
Another really interesting application of this kind of thought is how much the layout of a site effects the eventual people who use it. digg is much more visual than reddit. Early on, when the only social site was digg, it was an extremely technical site. This has changed alot as its popularity grew. Since digg is so visual, it lends itself much more to pictures and videos. Consequently, digg has become much more open to general internet hilarity. LOLcats and funny videos are par for the course on digg. As this popularity grew, it became difficult to stop, until the highly technical articles could no longer compete with funny pictures of cats for front page space. The highly technical things moved to other places. Like programming.reddit. Since that community moved on, highly technical articles are almost never seen on the digg front page anymore.
The second half of this argument centers on why programming reddit become one of the best places for technical information (and give digg users a place to flee towards). Well, reddit, unlike digg, has (or at least had) a lot of information crammed into a small space. There was no room for pretty graphics. One of my personal rules of thumb is that junkies love information density (the quintessential example is the Drudge Report). Reddit is one of the most dense sites in terms of information. Given the type of people who enjoy high density, and the digg refugees, made programming.reddit ripe to become the hub of highly technical information.
As these sites travel downhill, they leave voids that others fill on their way. Over time, they each settle into niches. Any time any of these sites are “on the move”, pretty much everyone who was there “before” sees it as the site’s “death”. This is another interesting phenomenon in its own right. I’ve often wondered where these sites are headed as they become more popular. Are they converging? To what? I think I’ve figured out the answer to this question. As a site becomes more and more popular, its user base becomes indistinguishable from 4chan.
