This MUST already exist
TweetI encounter this feeling frequently. I have a sudden need (or idea) and I realize that such and such a tool or service must already exist. As a software engineer, I’ve learned to not let that feeling pass. Usually, it’s a chance to learn something (because usually it does exist) and sometimes it’s a chance to to build something useful and new.
The time that my google-fu failed
In our work, we tend to chain a bunch of command together like this:
./face_detector input.mpg | ./face_tracker | ./visualize
This is a great formalism. The problem is that sometimes something goes horribly, horribly wrong. Let’s say this time the error occured between step #2 and step #3. This isn’t so bad unless, say, step #1 takes a few hours to run and now standard-output has vanished into the never ending ether. And now we’ve lost the output of step #1 and need to rerun it. This isn’t a complicated problem. All I really need is a program that takes what comes in on standard input, writes it to a file, and then pipes it straight back out to standard output. How does this not exist? I had this problem a while ago and searched the internet. Nothing. I wrote a 5 line python script to do it for me.
Six months later I’m wandering around reddit reading some topic about “nifty unix programs that no one knows about” and someone mentions tee. Well, I’ll be. There it is! I guess I don’t need my little python script anymore.
Four more months go by, and I finally need tee again. Except now I don’t remember what it is called. And I can’t find it or the reddit post that contained it!
Two more months go by and I decide to write a post about it. I still can’t find the reddit post. I suck at the internet.
Someone already built it
This is an actual IM conversation I had recently:
me: I've had this song stuck in my head for like 5 days....
me: I think it's a christmas song... but I can't remember any of the words
me: no words = no google = pure torture
me: i looked through lists of xmas songs... nothing
me: i even searched for sites that let me hum the song and they tell me the name..
him: there is one, saw it on reddit
him: you can "tap it out" on some webpage
him: and it tells you what it is
me: this is a lie
him: there is a webpage i have no idea how well it works
<tick-tock tick-tock>
me: HOLY
me: http://www.bored.com/songtapper/s/tappingmain.bin?dotap=1
me: it worked!
me: are you fucking kidding me?
It was Greensleeves, by the way.
This MUST exist — Parallelizing commands on the command line
One thing that is constantly happening at our office is the need to fire off a whole bunch of really time consuming jobs. Here is an example. Let’s say you want to test how the face detector behaves at various resolutions. So, you want to resize a directory full of jpegs to some other resolution. Here’s how you might do it:
$ ls -1 *.jpg | awk -F. '{printf "convert %s.jpg -resize 320x240 %s-sm.jpg\n",$1,$1}'
convert 92.jpg -resize 320x240 92-sm.jpg
convert 93.jpg -resize 320x240 93-sm.jpg
convert 94.jpg -resize 320x240 94-sm.jpg
...
$ !! | sh
Quick word about what this does: The first command creates the commands and just prints them out. The second line pipes the output (!! means repeat the previous command) into sh. In other words, it executes the previous output as commands.
The problem with this solution is that it is single-threaded. [update: as I write this, I realize a second problem and yet another 'this must already exist moment'. Using awk in this manner is something I tend to do constantly and it seems a bit awkward. Maybe there is a cleaner way to do this?]. On most of our machines we have 4 cores, and it would be really nice to have something like this finish 4x as quickly by using all of the cores on my machine. What I want to do is this:
$ls *.jpg | awk -F. '{printf "convert %s.jpg -resize 320x240 %s-sm.jpg\n",$1,$1}'
convert 92.jpg -resize 320x240 92-sm.jpg
convert 93.jpg -resize 320x240 93-sm.jpg
convert 94.jpg -resize 320x240 94-sm.jpg
...
$ !! | gogo -t4
The mythical ‘gogo’ program would take a list of commands on std-in and do them N at a time (4, in this case). It is apparently a well-kept secret that (gnu’s) xargs is capable of this:
$ls *.jpg | awk -F. '{printf "%s.jpg -resize 320x240 %s-sm.jpg\n",$1,$1}'
92.jpg -resize 320x240 92-sm.jpg
93.jpg -resize 320x240 93-sm.jpg
94.jpg -resize 320x240 94-sm.jpg
...
$ !! | xargs -0 -n 1 -P 4 convert
You’ll notice you have to move the command (convert) to the xargs command, and then remember all of the parameters (-0, -n, -P). I find the above command to be extremely awkward to use, in practice, and almost never do it. The problem is that xargs allows you to do so many other things that if you don’t use it frequently, it’s terribly easy to forget all the parameters you need to do this properly.
Tweet
twitter
October 12th, 2009 at 8:19 am
what’s with the -0? i don’t see anything putting nulls in your input stream
xargs can also do find -exec-style token replacement with -i, so no need for awk
ls *.jpg | xargs -i -n 1 -P4 convert ‘{}’ -resize 320×240 ‘{}’-sm.jpg
October 12th, 2009 at 8:40 am
I think you forgot a tr ‘\n’ ” before the xargs
Also have a look at http://github.com/erikfrey/bashreduce
October 12th, 2009 at 9:05 am
@aaron, the -0 ensures i get each line as a whole instead of having to do -n N where N is the number of arguments on each line… in the multi-argument example I showed, you need it. I guess you don’t need -n 1 in that case. I don’t put nulls in manually but it is apparently done because the -0 does what I want, which is treat each line of input as a single job for xargs to parallelize.
Also, I know you can use the replacement string but I often use awk with -F to parse out things like paths or file extensions. Your solution would work fine if I wanted “asdf.jpg-sm.jpg” instead of the cleaner “asdf-sm.jpg”. Would there be a better awk-less way to do this?
@padraig, not sure what you mean about removing newlines… and thanks for the link, I’ll go look at bashreduce now.
October 12th, 2009 at 5:13 pm
My comment is confusing I think as wordpress mangled it. I’ve posted my suggestion using my own comment engine which I wrote in 20 mins
http://comments.pixelbeat.org/lbrandy.com
Summarizing that linked command:
I used `find` as it’s better for piping,
whereas `ls` is generally best for human consumption.
You need the `tr` to delimit each line with NUL so that xargs treats each line as a separate parameter.
I also passed a -r to xargs so that the command is not run if there are no arguments.
I also used sed as an alternative to the awk manipulation, just for illustration.
October 13th, 2009 at 9:10 am
the intended use of -0 is to secure against buggy/hacked filenames by terminating on nulls, which are guaranteed by POSIX never to appear in filenames (cf find -print0)
this will get the filenames as you want, though it’s getting a bit complicated:
ls *.jpg|xargs -i -n1 basename ‘{}’ .jpg|xargs -i -n1 -P4 convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg
note the “secret mode” of basename
unfortunately basename is archaic and has really awkward argument parsing (a sane version would allow “basename –strip-extension .jpg 92.jpg 93.jpg 94.jpg”, and could be called here with a plain “xargs basename -s .jpg”)
you could also use cut instead of basename:
ls *.jpg|cut -d. -f1|xargs -i -n1 -P4 convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg
for correct use of -0, consider pathological pix with newlines in the name
(at this point, i have to give up on most of the normal shell tools, very few of which can handle null-terminated input, and fall back on perl)
$ ls -b1 *.jpg
92\ \nc2.jpg
92.jpg
93.jpg
94.jpg
$ ls *.jpg|cut -d. -f1|xargs -i -n1 -P4 echo convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg
convert 92 .jpg -resize 320×240 92 -sm.jpg
convert c2.jpg -resize 320×240 c2-sm.jpg
convert 93.jpg -resize 320×240 93-sm.jpg
convert 92.jpg -resize 320×240 92-sm.jpg
convert 94.jpg -resize 320×240 94-sm.jpg
$ find -maxdepth 1 -name \*.jpg -print0|perl -ap0le ‘s/(.*)\..*/\1/s’|xargs -0 -i -n1 -P4 echo convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg
convert ./92.jpg -resize 320×240 ./92-sm.jpg
convert ./92
c2.jpg -resize 320×240 ./92
c2-sm.jpg
convert ./93.jpg -resize 320×240 ./93-sm.jpg
convert ./94.jpg -resize 320×240 ./94-sm.jpg
October 14th, 2009 at 6:08 pm
It’s funny how this post builds on itself. What I really needed was a way to make xargs treat each line as one input (and do one at a time). I get the feeling “this MUST exist” and start trying xargs parameters until I find -0, which does what I want. I’ve been doing that for a long time now. I never bothered to learn what it’s really for or why it works. I’m guessing from ya’lls reaction my use of -0 is extremely non-standard and/or not even correct (despite the fact that it apparently works fine for what I’ve been doing)
At any rate, none of the solutions are as “nice” or clean as I’d like… even using xargs to parallelize jobs seems awkward to me.
October 14th, 2009 at 6:09 pm
I’ve had to run those kinds of jobs before and wanted exactly that kind of tool, not being able to find it at the time, I wrote one:
http://asymmetrical-view.com/personal/code/perl/parallel-jobs.readme
http://asymmetrical-view.com/personal/code/perl/parallel-jobs
Example ‘command file’:
# specify several downloads to be run in parallel
wget http://some.host/software.tar.gz
wget http://some.host/database.mdb
wget http://some.host/movie-trailer.mpeg
wget http://some.host/linux-distrubtion.iso
Run as:
parllel-jobs –cmdfile=file.cmd –maxjobs=4
October 14th, 2009 at 6:42 pm
You can also put the commands into a makefile and use make -j
For example, I have a shell script which builds a makefile from a list of URLs then calls make -j 4 to download for URLs in parallel using curl. e.g.
#!/bin/sh
(
echo “all: download_all”
echo “objects =“
for link in “$@“; do
file=”${link##*/}” # basename
echo “${file}:”
echo “\tcurl -s -S -O ‘$link’\”
echo “objects += $file
done
) > Makefile
make -j 4
Let’s hope my comment makes it through the unscathed…
October 14th, 2009 at 6:43 pm
Ugh, sorry for the smartquotes. It mostly made it through unscahted. There’s a trailing quote missing on one line, but you get the idea.
October 14th, 2009 at 7:09 pm
Guess you haven’t seen the mogrify command in ImageMagick
My understanding of ImageMagick is that a mogrify command on a set of files will be executed in series, but that ImageMagick may be able to parallelize each operation (resizing in this case) to make good use of multiple cores:
http://www.imagemagick.org/script/architecture.php#threads
October 14th, 2009 at 7:28 pm
Using ls or find to get filenames seems wasteful; the * expansion will do that for you anyway. I’d do something like this:
for i in *.jpg; do echo convert $i -resize 320×240 ${i%.jpg}-sm.jpg; done
If I wanted to run these commands in parallel, I’d just take out the echo and use & (which won’t restrict the number of processes to the number of cores, but ah well).
for i in *.jpg; do convert $i -resize 320×240 ${i%.jpg}-sm.jpg & done; wait
October 14th, 2009 at 7:39 pm
Came across this the other day which looks like it’s exactly what you want:
parallelize: Shell utility to execute command batches in parallel:
http://www.marco.org/36737831
October 14th, 2009 at 8:35 pm
screen?
October 14th, 2009 at 9:44 pm
A kind chap wrote this Perl script for me, after I complained about a similar situation you posed in your last question:
http://gist.github.com/131021
I call it parallellpool, runs -s number of -c processes at a time, using the file passed to -i as arguments (the output of ls works too)
Although I wish it had some better output (progress bar? pv?) and monitoring. Although at that point I usually just graduate to an actual work queue server.
October 14th, 2009 at 11:07 pm
I’ve faced similar issues in the past. A couple of points,
—-
Dont pipe from ls, instead do:
for i in *; do echo $i ; done
—-
Use GraphicsMagick ( http://www.graphicsmagick.org/ ) instead of ImageMagick.
October 14th, 2009 at 11:20 pm
tee?
October 14th, 2009 at 11:22 pm
Sorry, duh, I posted before I read the whole post. Please delete, I’m “and idiot”.
October 15th, 2009 at 7:27 am
I think you want runN (http://aaroncrane.co.uk/2008/07/runN/). It’s, I think, exactly what you were looking for. (Look at the bottom of the blog post for a link to the source).
October 15th, 2009 at 8:04 am
afaik, “treat each line as one input” is the default mode of xargs. “do one at a time” is -n1. the intended purpose of -0 is to change the notion of a “line” from newline-termination to null-termination, so that files with newlines in their names can be processed properly.
October 15th, 2009 at 4:34 pm
See also mdm.
January 27th, 2012 at 12:13 am
It’s been forever since I posted a comment here, and just was reminded of this post when I was going through bookmarks.
Check out GNU Parallel for doing exactly this, and so much more! http://www.gnu.org/software/parallel/
February 9th, 2012 at 4:46 am
So do not trip lasted three hours, thinking how air conditioners, and 6. But the important thing was that the air conditioners safely arrived at the resort, except for pain in the stomach, but this experience.
климатици
February 21st, 2012 at 9:41 am
Joinery long time watching the board, which was written daily menu, as well as easy to guess a person would in this pub menu, and finally chose a glass of tap water and a salad of tomatoes, because as she explained, cucumbers were done. влагоуловители
February 22nd, 2012 at 10:08 am
And then mad. Did not have drain, but it requires strong, hands, feet, nails, plates, find what. отпушване на канали
February 27th, 2012 at 6:23 am
How is the price of windows? To be most accurate answers to this question must follow the trail of PVC windows at the factory where it is produced. интериорни врати
February 29th, 2012 at 11:45 pm
http://aaroncrane.co.uk/2008/07/runN/
The weather in Barbados is split into two distinct seasons. The wet season lasts from June to November, with rainfall of between 40 and 90 inches during that period.
March 5th, 2012 at 8:24 am
At the end of each financial year (November 30) the Court shall prepare an annual report on the regularity of income and expenditure of the EU and its bodies, to be submitted to Parliament and the Council of Ministers, and other Community institutions. Димитър Веселинов Калинов
March 8th, 2012 at 3:28 am
The ideas of federalism are particularly close to the founders of Western European integration, Schuman, Monnet, Adenauer.Tehnite ideas are presented in istiricheskata declaration of the French Government dated 9 May 1950., Known as Plan “Schuman”. електротехник
March 9th, 2012 at 3:09 am
Criteria for inclusion of a developing country in the group of least developed countries are determined by the United Nations Trade and Development (UNCTAD) and periodically updated. електро услуги
March 12th, 2012 at 3:49 am
These rivers are fed by rains and volume varies widely. During the dry season they are low water, and often overflow during the wet and out of their banks, дърводелски услуги causing some of the most devastating floods on the planet.
March 13th, 2012 at 8:59 am
To make Investments should only inform the Resource Bank of India to bring in before and again 30 days before buying shares.
смяна на щрангове
March 14th, 2012 at 7:39 am
Leprosy is another serious disease, common in India. About 60% of cases worldwide are in India. вик ремонти
March 20th, 2012 at 9:44 am
Grow mainly rice, wheat, cotton, sugarcane, potatoes, etc.. Increases production of fruits and vegetables and attempt to intensify production. бойлер
March 28th, 2012 at 6:15 am
Important role in the upsurge in the Indian economy and play by the refusal of the Delhi Policy rigid isolationism. мобилни климатици
April 17th, 2012 at 9:08 am
Fixed with brackets to the building facade and can easily be κλιματιστικά damaged by adverse weather conditions.
May 16th, 2012 at 7:26 am
For this function usually employ a gardener or maintenance man to open spaces. автоклиматици pvc дограма