/ python

Yet another parallelizing your python* workflow tutorial

Python and parallelization have an awkward relationship full of 'gotchas' and the GIL. This post takes a slightly different approach on how to speed up your python* workflows through parallelization.

*Also works great for other programming languages!

Python has an internal lock called the GIL that only allows one line of python to be run per instance of python running. So, despite what libraries like multiprocessing/threads/asyncio promise, unless you are careful (yes, you can do it) you might unknowingly end up with 'parallelized' code that blocks and ends up running sequentially. Here, we are going to let the GIL do its job and shift the parallel abstraction to a slightly different level -- the shell.

My favorite way to run python code in parallel has a three-step process:

  1. Identify variables in your script that you can parallelize
  2. Wrap the script/program in a thin command line interface (CLI)
  3. Run the script from the shell using GNU parallel

Note, this method covers cases that are commonly called "embarrassingly" parallel. Meaning that we are not doing any fancy inter-process communication or any shared, in-memory data. This approach is useful for when you have a task and you want to do that task many times -- similar to the split-apply-combine approach.

The reason I like this method is because it off-loads the operations for each piece in to a realm where each step is good at what it does. Python can do all the heavy lifting and has great tools for making CLIs, the shell is great for invoking processes, and GNU parallel has some very nice syntax for running things, well, in parallel.

I have a program called Camoco that calculates the 'overlap' between Genome Wide Association Studies and gene co-expression networks. Don't worry about what Camoco is doing under the hood -- it doesn't matter here. What is important is that Camoco takes as input two mandatory arguments: a GWAS dataset and a network. It also has some numerical parameters that allow for fine tuning the analysis.

Python has great built in libraries for quickly adding a command line interface to a script with the added benefit that you get nice auto generated help pages and usage strings. Running the camoco overlap command with no parameters gives us:

$ camoco overlap       
usage: camoco overlap [-h] [--genes [GENES [GENES ...]]] [--gwas GWAS]
                      [--go GO] [--terms [TERMS [TERMS ...]]]
                      [--skip-terms [SKIP_TERMS [SKIP_TERMS ...]]]
                      [--min-term-size 2] [--max-term-size None]
                      [--snp2gene effective] [--strongest-attr pval]
                      [--strongest-higher] [--candidate-window-size 1]
                      [--candidate-flank-limit 0] [--num-bootstraps auto]
                      [--out OUT] [--dry-run]
                      cob {density,locality}
camoco overlap: error: the following arguments are required: cob, method

Again, the specifics of these parameters isn't important, what is important is that after some tweaking, I can get an instance of Camoco to calculate overlap for some specific parameters:

$ camoco overlap ZmPAN density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_density_50000_1.csv

Here, I have several parameters that run a specific analysis. We can break down the syntax of the command like so:

$ camoco overlap <Network> <Method> --gwas <GWAS_Dataset> --candidate-window-size <Window_Size> --candidate-flank-limit <Flank_Limit> --out <Output_File_Name>

Suppose I wanted to run several analyses and compare the results from each run. Here are the parameter lists I want to generate commands with:

<Network> = {ZmPAN, ZmSAM, ZmRoot}
<Method> = {density, locality}
<GWAS_Dataset> = {ZmWallace, ZmIonome}
<Window_Size> = {50000, 100000, 500000}
<Flank_Limit> = {1, 2, 5}
<Output_File_Name> = ???

Given these parameters, it looks like I want to run 3x2x2x3x3=108 different runs of this program. Also, in the original command, notice how the output filename is a function of all the variables used to build the command? This means that the output is a composite variable and needs to by dynamically generated. For now we've just labeled that set with ???.

Enter GNU Parallel

GNU parallel is a program that is used to build and execute shell commands in parallel. It has tons of ways to accomplish this, but today I am going to focus on its ability to wrap a shell command and interpolate variables. Here is our first example.

$ parallel --dry-run camoco overlap {1} density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_density_50000_2.csv ::: ZmSAM ZmPAN ZmRoot
camoco overlap ZmSAM density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_50000_2.csv                                                                                                              
camoco overlap ZmPAN density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_50000_2.csv                                                                                                              
camoco overlap ZmRoot density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_50000_2.csv   

First notice the general syntax of parallel:

$ parallel --dry-run <command> ::: {arguments}

Notice that, with the --dry-run flag, parallel will not execute the command, instead it will print it out to the shell so you can see if everything got build correctly. Also notice that since parallel isn't executing the camoco command (instead just printing the command that it would have run) you can try this out on your own without having Camoco installed.

If we look closely, we can see that parallel interpolates instances of {1} with each argument after the :::, namely ZmSAM, ZmPAN and ZmRoot. If you look even closer, you can see our first bug! We forgot to replace the argument in the --out portion of the command. This can be fixed by adding in an additional {1}.

$ parallel --dry-run camoco overlap {1} density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_{1}_density_50000_2.csv ::: ZmSAM ZmPAN ZmRoot
camoco overlap ZmSAM density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmSAM_50000_2.csv
camoco overlap ZmPAN density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_50000_2.csv
camoco overlap ZmRoot density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmRoot_50000_2.csv

That's better! Now lets tackle to next variable interpolation. We do this by adding in an additional set of ::: arguments and using the replacement token {2}.

$ parallel --dry-run camoco overlap {1} {2} --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_{1}_{2}_50000_2.csv ::: ZmSAM ZmPAN ZmRoot ::: density locality
camoco overlap ZmSAM density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmSAM_density_50000_2.csv
camoco overlap ZmSAM locality --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmSAM_locality_50000_2.csv
camoco overlap ZmPAN density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_density_50000_2.csv
camoco overlap ZmPAN locality --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmPAN_locality_50000_2.csv
camoco overlap ZmRoot density --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmRoot_density_50000_2.csv
camoco overlap ZmRoot locality --gwas ZmWallace --candidate-window-size 50000 --candidate-flank-limit 1 --out ZmWallace_ZmRoot_locality_50000_2.csv

Here, {1} values get replaced with the first set of ::: values and {2} gets replaced by the second set, and so on. Let's skip ahead and look at the fully built parallel command.

$ parallel --dry-run camoco overlap {1} {2} --gwas {3} --candidate-window-size {4} --candidate-flank-limit {5} --out {3}_{1}_{2}_{4}_{5}.csv ::: ZmSAM ZmPAN ZmRoot ::: density locality ::: ZmWallace ZmIonome ::: 50000 100000 500000 ::: 1 2 5 

I don't show the output here since it is 108 lines long. But, again since we are using the --dry-run argument, you can try this yourself and see the output.

Notice how we dynamically built the --out filename, even when the numbers were out of order!

Controlling your new parallel powers

If we just remove the --dry-run argument from the beginning of the command, the commands will be exectued instead of printed. All in parallel. I'm not sure what kind of computer you are running, but if I kicked off 108 instances of Camoco on my computer it would quickly use up all the available resources. We can control the number of processes that are running at any given time with the -j flag. Lets be conservative and only run 4 instances at a time.

$ parallel -j 4 --dry-run camoco overlap {1} {2} --gwas {3} --candidate-window-size {4} --candidate-flank-limit {5} --out {3}_{1}_{2}_{4}_{5}.csv ::: ZmSAM ZmPAN ZmRoot ::: density locality ::: ZmWallace ZmIonome ::: 50000 100000 500000 ::: 1 2 5 

Now, only a maximum of 4 processes will be run at a time. Once a job is finished, the next one is started. The jobs are started as fast as they can spawn, and in the case of Camoco firing off parallel jobs at the same time causes some of the database tables to get spammed, which makes them unhappy. No worries though! Parallel can stagger the start of its commands with the --delay flag.

parallel --delay 5 -j 4 --dry-run camoco overlap {1} {2} --gwas {3} --candidate-window-size {4} --candidate-flank-limit {5} --out {3}_{1}_{2}_{4}_{5}.csv ::: ZmSAM ZmPAN ZmRoot ::: density locality ::: ZmWallace ZmIonome ::: 50000 100000 500000 ::: 1 2 5 

Now parallel will wait 5 seconds between spawning a new process.

Future steps

One last thing to keep in mind is the amount of memory each process will require. We limited the number of processes parallel kicks off with the -j command which will usually be limited by the available CPUs your computer has. But, if each of the processes started by parallel require a large amount of memory (say 50%!), running 4 processes at the same time will quickly eat up all the memory on the system. One way to mitigate this is by reducing the number of processes running (i.e. setting -j to 2). Another way (and the way Camoco solves this) is by storing large data structures in a database, namely ones that can be read in parallel. While building these databases are out of the scope of this post, the idea remains similar to how this post was structured. Since Python was not necessarily built to be run in parallel, it doesn't have very friendly tools (that I know of) for sharing data in memory. To solve this, we simply do not use Python. Instead we shift the problem to another tool and store the data in a database then use a nice python API to access it in parallel (e.g. SQLite). The database handles the dirty work of fielding all the footwork for all the different processes asking for data. This does not come without limitations, we had to delay our processes as to not overload the database.

By shifting the task of parallelizing our script to the shell and leveraging the concise and powerful syntax of parallel, we can quickly run our python script in parallel. Unfortunately, all of our processes are isolated which makes it difficult for complicated tasks that require inter-process communication or shared data. However the same is true with many of the built in python libraries for parallelization The recipe described here, covers a large number of use cases you might run into when you want to run your python scripts in parallel. And this strategy is not limited to just python. We can run an appropriate number of processes and they will do their job and we can control them at a high level using flags built into parallel.

Happy hacking!


Cover Photo by Dmitri Popov on Unsplash