GNU Parallel can do anything, but scripting may be the better option

GNU Parallel is a utility that lets you run command jobs in parallel; on local and on remote hosts over the network. It’s incredibly powerful when you need something more flexible than xargs, and it’s especially useful with small computer clusters.

You don’t need to set up and configure anything. There is no long-lived daemon process or central scheduler, as is customary in distributed compilers. You only need to set up key-based authentication for each remote host (so parallel can log in automatically), and install Perl and Parallel on each system. That’s it.

You can configure it to transfer files from anywhere on your local system to a remote host. However, it can’t transfer files back to any directory other than the current working directory. For whatever reason, my odd jobs often involve moving files between different directories, and GNU Parallel complicates that tremendously. You’ll need to introduce more complications in the form of another tool to move files elsewhere on your system.

GNU Parallel can help you get jobs done quicker, but you need to do a cost-benefit analysis before you use it. Its command syntax is unique and non-intuitive for command-line users. Getting it up and running can be a real head-scratcher. I spend hours in the command line every week, but I can’t make heads or tails of Parallel’s unique syntax.

I recently wanted to use GNU Parallel for a time-consuming job that was excellently suitable for parallelization. I needed something to send files to a remote cluster, execute a few commands, wait for them to finish, and transfer the results back. GNU Parallel is perfectly suited, but it took me hours to get going, and even longer to get it just right.

Most of that time was spent reading and re-reading the manuals for parallel, rsync, and ssh. GNU Parallel is built on top of the latter two tools, and you need to be very familiar with both to intuit how parallel will behave in different situations. The different arguments you give parallel often influence the underlying tools in ways not detailed in the manual.

Over the years, I’ve often expected to be able to quickly set up GNU Parallel and put it to use for odd jobs now and then. However, I never seem to develop any familiarity with it and always run into unique problems that I’ve never experienced before. My requirements change just enough every time that my cheatsheet and notes from our previous encounter are of no help to me in the present.

Its manual page is excellently written, but also practically incomprehensible unless you’re already very familiar with the tool. The tool is meant to solve complex problems, so its workings and documentation are necessarily also complex. However, the examples often demonstrate how one command works in conjunction with other commands and special syntaxes. The examples make me feel like I’m watching a magician pull tricks up from a hat, and I still don’t have a firm grasp of how the command is supposed to work.

There are also many situations where GNU Parallel behaves unexpectedly, or when combining arguments cause the underlying tools (primarily rsync) to change behavior in unexpected ways. For example, the --cleanup argument is outright dangerous! It tries to recursively remove files and directories it has transferred files into instead of just the files themselves. You can easily end up instructing it to delete system-critical directories like /tmp or personal directories like ~/Documents. Use it with great care!

As many times before, I managed to get GNU Parallel to work in the end. I felt like it fought me every step of the way, though. However, I ended up spending way more time than I’d expected, and it needed a metric ton of scaffolding and helper scripts. I needed multiple scripts to prepare the data sources and arguments into a format useable to GNU Parallel. I needed a script to move the returned result files to the right locations. It’s tricky to work with GNU Parallel with files existing outside the current working directory. GNU Parallel can do it, but the specifics require digging through the manual for an hour.

I believe part of the problem is that I’ve attributed GNU Parallel more credit than it’s due. I often seem to shoe-horn it into my workflow, but it just never seems to fit. In essence, GNU Parallel handles queueing of multiple input data sources, transferring files to remote hosts with rsync, and executing commands remotely with ssh. It’s frankly easier to do this with a quick Ruby script than with GNU Parallel.

This is the core of my problem with GNU Parallel: The simple tasks it can solve can also be achieved with a simpler tool like xargs, and the more difficult tasks are easier to accomplish with a full scripting language. I end up writing scripts to pre-process and prepare the tasks anyway. I can just as well execute the tasks through the same task generator script; without complicating the process by involving GNU Parallel. I don’t believe I’ll reach for GNU Parallel again in the future.