{ on programming and the internets }


by Louis Brandy

This MUST already exist

I encounter this feeling frequently. I have a sudden need (or idea) and I realize that such and such a tool or service must already exist. As a software engineer, I’ve learned to not let that feeling pass. Usually, it’s a chance to learn something (because usually it does exist) and sometimes it’s a chance to to build something useful and new.

The time that my google-fu failed

In our work, we tend to chain a bunch of command together like this:

./face_detector input.mpg | ./face_tracker | ./visualize

This is a great formalism. The problem is that sometimes something goes horribly, horribly wrong. Let’s say this time the error occured between step #2 and step #3. This isn’t so bad unless, say, step #1 takes a few hours to run and now standard-output has vanished into the never ending ether. And now we’ve lost the output of step #1 and need to rerun it. This isn’t a complicated problem. All I really need is a program that takes what comes in on standard input, writes it to a file, and then pipes it straight back out to standard output. How does this not exist? I had this problem a while ago and searched the internet. Nothing. I wrote a 5 line python script to do it for me.

Six months later I’m wandering around reddit reading some topic about “nifty unix programs that no one knows about” and someone mentions tee. Well, I’ll be. There it is! I guess I don’t need my little python script anymore.

Four more months go by, and I finally need tee again. Except now I don’t remember what it is called. And I can’t find it or the reddit post that contained it!

Two more months go by and I decide to write a post about it. I still can’t find the reddit post. I suck at the internet.

Someone already built it

This is an actual IM conversation I had recently:

me: I've had this song stuck in my head for like 5 days....
me: I think it's a christmas song... but I can't remember any of the words
me: no words = no google = pure torture
me: i looked through lists of xmas songs... nothing
me: i even searched for sites that let me hum the song and they tell me the name..
him: there is one, saw it on reddit
him: you can "tap it out" on some webpage
him: and it tells you what it is
me: this is a lie
him: there is a webpage i have no idea how well it works
<tick-tock tick-tock>
me: HOLY
me: http://www.bored.com/songtapper/s/tappingmain.bin?dotap=1
me: it worked!
me: are you fucking kidding me?

It was Greensleeves, by the way.

This MUST exist — Parallelizing commands on the command line

One thing that is constantly happening at our office is the need to fire off a whole bunch of really time consuming jobs. Here is an example. Let’s say you want to test how the face detector behaves at various resolutions. So, you want to resize a directory full of jpegs to some other resolution. Here’s how you might do it:

$ ls -1 *.jpg | awk -F. '{printf "convert %s.jpg -resize 320x240 %s-sm.jpg\n",$1,$1}'
convert 92.jpg -resize 320x240 92-sm.jpg
convert 93.jpg -resize 320x240 93-sm.jpg
convert 94.jpg -resize 320x240 94-sm.jpg
...
$ !! | sh

Quick word about what this does: The first command creates the commands and just prints them out. The second line pipes the output (!! means repeat the previous command) into sh. In other words, it executes the previous output as commands.

The problem with this solution is that it is single-threaded. [update: as I write this, I realize a second problem and yet another 'this must already exist moment'. Using awk in this manner is something I tend to do constantly and it seems a bit awkward. Maybe there is a cleaner way to do this?]. On most of our machines we have 4 cores, and it would be really nice to have something like this finish 4x as quickly by using all of the cores on my machine. What I want to do is this:

$ls *.jpg | awk -F. '{printf "convert %s.jpg -resize 320x240 %s-sm.jpg\n",$1,$1}'
convert 92.jpg -resize 320x240 92-sm.jpg
convert 93.jpg -resize 320x240 93-sm.jpg
convert 94.jpg -resize 320x240 94-sm.jpg
...
$ !! | gogo -t4 

The mythical ‘gogo’ program would take a list of commands on std-in and do them N at a time (4, in this case). It is apparently a well-kept secret that (gnu’s) xargs is capable of this:

$ls *.jpg | awk -F. '{printf "%s.jpg -resize 320x240 %s-sm.jpg\n",$1,$1}'
92.jpg -resize 320x240 92-sm.jpg
93.jpg -resize 320x240 93-sm.jpg
94.jpg -resize 320x240 94-sm.jpg
...
$ !! | xargs -0 -n 1 -P 4 convert

You’ll notice you have to move the command (convert) to the xargs command, and then remember all of the parameters (-0, -n, -P). I find the above command to be extremely awkward to use, in practice, and almost never do it. The problem is that xargs allows you to do so many other things that if you don’t use it frequently, it’s terribly easy to forget all the parameters you need to do this properly.

trackback

20 Responses to “This MUST already exist”

  1. October 12th, 2009 at 8:19 am

    Aaron Davies says:

    what’s with the -0? i don’t see anything putting nulls in your input stream

    xargs can also do find -exec-style token replacement with -i, so no need for awk

    ls *.jpg | xargs -i -n 1 -P4 convert ‘{}’ -resize 320×240 ‘{}’-sm.jpg

  2. October 12th, 2009 at 8:40 am

    Pádraig Brady says:

    I think you forgot a tr ‘\n’ ” before the xargs :)

    Also have a look at http://github.com/erikfrey/bashreduce

  3. October 12th, 2009 at 9:05 am

    louis says:

    @aaron, the -0 ensures i get each line as a whole instead of having to do -n N where N is the number of arguments on each line… in the multi-argument example I showed, you need it. I guess you don’t need -n 1 in that case. I don’t put nulls in manually but it is apparently done because the -0 does what I want, which is treat each line of input as a single job for xargs to parallelize.

    Also, I know you can use the replacement string but I often use awk with -F to parse out things like paths or file extensions. Your solution would work fine if I wanted “asdf.jpg-sm.jpg” instead of the cleaner “asdf-sm.jpg”. Would there be a better awk-less way to do this?

    @padraig, not sure what you mean about removing newlines… and thanks for the link, I’ll go look at bashreduce now.

  4. October 12th, 2009 at 5:13 pm

    Pádraig Brady says:

    My comment is confusing I think as wordpress mangled it. I’ve posted my suggestion using my own comment engine which I wrote in 20 mins :)
    http://comments.pixelbeat.org/lbrandy.com
    Summarizing that linked command:

    I used `find` as it’s better for piping,
    whereas `ls` is generally best for human consumption.

    You need the `tr` to delimit each line with NUL so that xargs treats each line as a separate parameter.

    I also passed a -r to xargs so that the command is not run if there are no arguments.

    I also used sed as an alternative to the awk manipulation, just for illustration.

  5. October 13th, 2009 at 9:10 am

    Aaron Davies says:

    the intended use of -0 is to secure against buggy/hacked filenames by terminating on nulls, which are guaranteed by POSIX never to appear in filenames (cf find -print0)

    this will get the filenames as you want, though it’s getting a bit complicated:

    ls *.jpg|xargs -i -n1 basename ‘{}’ .jpg|xargs -i -n1 -P4 convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg

    note the “secret mode” of basename :)

    unfortunately basename is archaic and has really awkward argument parsing (a sane version would allow “basename –strip-extension .jpg 92.jpg 93.jpg 94.jpg”, and could be called here with a plain “xargs basename -s .jpg”)

    you could also use cut instead of basename:

    ls *.jpg|cut -d. -f1|xargs -i -n1 -P4 convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg

    for correct use of -0, consider pathological pix with newlines in the name

    (at this point, i have to give up on most of the normal shell tools, very few of which can handle null-terminated input, and fall back on perl)

    $ ls -b1 *.jpg
    92\ \nc2.jpg
    92.jpg
    93.jpg
    94.jpg
    $ ls *.jpg|cut -d. -f1|xargs -i -n1 -P4 echo convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg
    convert 92 .jpg -resize 320×240 92 -sm.jpg
    convert c2.jpg -resize 320×240 c2-sm.jpg
    convert 93.jpg -resize 320×240 93-sm.jpg
    convert 92.jpg -resize 320×240 92-sm.jpg
    convert 94.jpg -resize 320×240 94-sm.jpg
    $ find -maxdepth 1 -name \*.jpg -print0|perl -ap0le ’s/(.*)\..*/\1/s’|xargs -0 -i -n1 -P4 echo convert ‘{}’.jpg -resize 320×240 ‘{}’-sm.jpg
    convert ./92.jpg -resize 320×240 ./92-sm.jpg
    convert ./92
    c2.jpg -resize 320×240 ./92
    c2-sm.jpg
    convert ./93.jpg -resize 320×240 ./93-sm.jpg
    convert ./94.jpg -resize 320×240 ./94-sm.jpg

  6. October 14th, 2009 at 6:08 pm

    louis says:

    It’s funny how this post builds on itself. What I really needed was a way to make xargs treat each line as one input (and do one at a time). I get the feeling “this MUST exist” and start trying xargs parameters until I find -0, which does what I want. I’ve been doing that for a long time now. I never bothered to learn what it’s really for or why it works. I’m guessing from ya’lls reaction my use of -0 is extremely non-standard and/or not even correct (despite the fact that it apparently works fine for what I’ve been doing)

    At any rate, none of the solutions are as “nice” or clean as I’d like… even using xargs to parallelize jobs seems awkward to me.

  7. October 14th, 2009 at 6:09 pm

    Kyle Burton says:

    I’ve had to run those kinds of jobs before and wanted exactly that kind of tool, not being able to find it at the time, I wrote one:

    http://asymmetrical-view.com/personal/code/perl/parallel-jobs.readme

    http://asymmetrical-view.com/personal/code/perl/parallel-jobs

    Example ‘command file’:
    # specify several downloads to be run in parallel
    wget http://some.host/software.tar.gz
    wget http://some.host/database.mdb
    wget http://some.host/movie-trailer.mpeg
    wget http://some.host/linux-distrubtion.iso

    Run as:

    parllel-jobs –cmdfile=file.cmd –maxjobs=4

  8. October 14th, 2009 at 6:42 pm

    Jay Soffian says:

    You can also put the commands into a makefile and use make -j

    For example, I have a shell script which builds a makefile from a list of URLs then calls make -j 4 to download for URLs in parallel using curl. e.g.

    #!/bin/sh
    (
    echo “all: download_all”
    echo “objects =“
    for link in “$@“; do
    file=”${link##*/}” # basename
    echo “${file}:”
    echo “\tcurl -s -S -O ‘$link’\”
    echo “objects += $file
    done
    ) > Makefile
    make -j 4

    Let’s hope my comment makes it through the unscathed…

  9. October 14th, 2009 at 6:43 pm

    Jay Soffian says:

    Ugh, sorry for the smartquotes. It mostly made it through unscahted. There’s a trailing quote missing on one line, but you get the idea.

  10. October 14th, 2009 at 7:09 pm

    Mike Dillon says:

    Guess you haven’t seen the mogrify command in ImageMagick ;)

    My understanding of ImageMagick is that a mogrify command on a set of files will be executed in series, but that ImageMagick may be able to parallelize each operation (resizing in this case) to make good use of multiple cores:

    http://www.imagemagick.org/script/architecture.php#threads

  11. October 14th, 2009 at 7:28 pm

    Zededarian says:

    Using ls or find to get filenames seems wasteful; the * expansion will do that for you anyway. I’d do something like this:

    for i in *.jpg; do echo convert $i -resize 320×240 ${i%.jpg}-sm.jpg; done

    If I wanted to run these commands in parallel, I’d just take out the echo and use & (which won’t restrict the number of processes to the number of cores, but ah well).

    for i in *.jpg; do convert $i -resize 320×240 ${i%.jpg}-sm.jpg & done; wait

  12. October 14th, 2009 at 7:39 pm

    Duncan Lock says:

    Came across this the other day which looks like it’s exactly what you want:

    parallelize: Shell utility to execute command batches in parallel:

    http://www.marco.org/36737831

  13. October 14th, 2009 at 8:35 pm

    Dude says:

    screen?

  14. October 14th, 2009 at 9:44 pm

    Kevin Marsh says:

    A kind chap wrote this Perl script for me, after I complained about a similar situation you posed in your last question:

    http://gist.github.com/131021

    I call it parallellpool, runs -s number of -c processes at a time, using the file passed to -i as arguments (the output of ls works too)

    Although I wish it had some better output (progress bar? pv?) and monitoring. Although at that point I usually just graduate to an actual work queue server.

  15. October 14th, 2009 at 11:07 pm

    Imran Fanaswala says:

    I’ve faced similar issues in the past. A couple of points,

    —-
    Dont pipe from ls, instead do:
    for i in *; do echo $i ; done

    —-
    Use GraphicsMagick ( http://www.graphicsmagick.org/ ) instead of ImageMagick.

  16. October 14th, 2009 at 11:20 pm

    Steve says:

    tee?

  17. October 14th, 2009 at 11:22 pm

    Steve says:

    Sorry, duh, I posted before I read the whole post. Please delete, I’m “and idiot”. ;)

  18. October 15th, 2009 at 7:27 am

    brunov says:

    I think you want runN (http://aaroncrane.co.uk/2008/07/runN/). It’s, I think, exactly what you were looking for. (Look at the bottom of the blog post for a link to the source).

  19. October 15th, 2009 at 8:04 am

    Aaron Davies says:

    afaik, “treat each line as one input” is the default mode of xargs. “do one at a time” is -n1. the intended purpose of -0 is to change the notion of a “line” from newline-termination to null-termination, so that files with newlines in their names can be processed properly.

  20. October 15th, 2009 at 4:34 pm

    Sam says:

    See also mdm.

Leave a Reply


Need a new job?