Loops in Bioinformatic Pipelines

A common requirement of many bioinformatics pipelines is completing the same analysis on a list of files. This is simple to do with a bit of Bash scripting:

#!/bin/bash
#SBATCH --job-name=ustacks
#SBATCH --output=ustacks.log
#SBATCH --time=360:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM

M=4

reads_dir=./prort/concat/
out_dir=ustacks
threads=8

while read SAMPLE; do
    echo "$SAMPLE"
    ustacks -t gzfastq -f ${reads_dir}${SAMPLE}.1.fq.gz \
            -o ${out_dir} -m 3 \
            --name ${SAMPLE} -M $M -p ${threads}
done <SAMPLES.txt

This script reads each line of the file SAMPLES.txt, passes that value to the program ustacks.

That’s fine, but it can require a lot of cluster time. In this case, I have 96 files listed in SAMPLES.txt, and each of them will take three or four hours to process. That’s 15 days. In reality, it may take much longer, as the cluster may not be able to schedule my job immediately given the length (at least on the cluster I use!).

Replacing Loops with Array Jobs

However, each of these 96 jobs can be processed at the same time. There’s no interaction between them. If instead of asking for a single 15 day job, I ask for 96 six-hour jobs, it’s easier for the cluster to schedule them, and my total run time drops to something closer to the length of a single job (maybe 6-12 hours, depending on how quickly they get scheduled).

This is what array jobs do. I can replace the script above with this:

#!/bin/bash
#SBATCH --job-name=ustacks%a
#SBATCH --output=ustacks_%a.log
#SBATCH --time=6:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM

M=4

reads_dir=./prort/concat/
out_dir=ustacks
threads=8

SAMPLE=$(sed "${SLURM_ARRAY_TASK_ID}q;d" SAMPLES.txt)

echo "$SAMPLE"
ustacks -t gzfastq -f ${reads_dir}${SAMPLE}.1.fq.gz \
        -o ${out_dir} -m 3 \
        --name ${SAMPLE} -M $M -p ${threads}

It’s nearly the same as the original. The key difference is I’ve replaced the while loop with a single variable, set using a line of sed:

SAMPLE=$(sed "${SLURM_ARRAY_TASK_ID}q;d" SAMPLES.txt)

Sed is a very powerful tool for processing text. In this case, I’ve asked it to search through the file SAMPLES.txt, skip forward to the line ${SLURM_ARRAY_TASK_ID}, print that line, and then quit.

Where does SLURM_ARRAY_TASK_ID come from? That’s variable is set for us automatically when we submit our script as an array job. Instead of the usual invocation of Slurm:

sbatch my_script.sh

We’ll do this:

sbatch --array=1-96 my_script.sh

When we do this, it tells Slurm that we want to submit 96 jobs. Each job will be run with identical code, except for one thing: Slurm will insert the variable SLURM_ARRAY_TASK_ID with a different value into each script. That value is also inserted into my job name and log file, using the %a placeholder:

#SBATCH --job-name=ustacks%a
#SBATCH --output=ustacks_%a.log

With that little change, instead of waiting a month for my 15 day job to run, I wait a day for 96 six-hour jobs to run.

Selectively Repeating Failed jobs

Another benefit of this approach is that it allows you to easily resubmit only a subset of your jobs. In the example above I’ve asked for six hours for each file. If some of those files happen to require more time than that, they will fail with a timeout. You’lll see this in your job logs. Here’s an example from an array where I requested 36 hours per job:

date
sacct --jobs=3069231 --format=jobid,jobname,state,elapsed | grep -v 'batch\|extern'

: Fri 24 Jan 2025 12:10:04 PM EST
: JobID           JobName      State    Elapsed 
: ------------ ---------- ---------- ---------- 
: 3069231_1    ustacks23+  COMPLETED   19:14:14 
: 3069231_2    ustacks23+  COMPLETED   17:51:58 
: 3069231_3    ustacks23+  COMPLETED   13:20:56 
: 3069231_4    ustacks23+  COMPLETED   23:43:24 
: 3069231_5    ustacks23+    TIMEOUT 1-12:00:01 
: 3069231_6    ustacks23+  COMPLETED   17:43:17 
: 3069231_7    ustacks23+  COMPLETED   17:07:07

You can see that that wasn’t enough time for the fifth job. I can rerun just that job (after increasing the requested time) with:

sbatch --array=5 my_script.sh

Or if there are a group of failed jobs, I can combine them like this:

sbatch --array=5,7,20-23 my_script.sh

Much easier than repeating the analysis of all 96 samples just because one or two of them needed more time.

Tips for Using Array jobs

Automating Making Lists to use in Array Jobs

You can create the lists of samples or files you want to cycle over by hand. That’s probably the fastest way to do it, if you only need to do it once. But as soon as you start repeating an analysis on different files, you might benefit from some automation.

Bash scripting provides a huge variety of useful tools for creating lists of files or samples you want to cycle through with SLURM_ARRAY_TASK_ID. I won’t provide a full tutorial, but a few examples may be useful in showing you why it might be worth investing some time to learn more about this.

Filtering Filenames with Bash Pipes

I have a directory containing a list of files like this:

B12A10.1.fq.gz	    BWA11.rem.2.fq.gz  HHA19.rem.1.fq.gz  RBB8.1.fq.gz
B12A10.2.fq.gz	    BWA1.2.fq.gz       HHA19.rem.2.fq.gz  RBB8.2.fq.gz
B12A10.rem.1.fq.gz  BWA13.1.fq.gz      HHA1.rem.1.fq.gz   RBB8.rem.1.fq.gz
B12A10.rem.2.fq.gz  BWA13.2.fq.gz      HHA1.rem.2.fq.gz   RBB8.rem.2.fq.gz

Each sample has a file of matched forward (.1.fq.gz) and reverse reads (.2.fq.gz), and a file of forward and reverse unmatched reads (.rem.1.fq.gz and .rem.2.fq.gz). In my processing I’ll want to loop over each individual sample, not over all four files for each sample. I do this with the following line:

ls reads/*.1.fq.gz | grep -v rem | \
    xargs -n 1  basename -s .1.fq.gz > 2023_list.txt

The first clause, ls reads/*.1.fq.gz lists all the forward read files. This will include both B12A10.1.fq.gz and B12A10.rem.1.fq.gz.

Next, I use grep -v rem. That tells grep to filter the values we give it for the string rem. The -v argument tells grep to ‘invert’ the search. In other words, remove all the values that include rem. At this point, we have a single file name per sample, e.g. B12A10.1.fq.gz and RBB8.1.fq.gz.

The next step is a combination. First, we pass the values from grep to the program xargs with the argument -n 1. That tells xargs to pass the values it recieves one at a time to the following command. That command is basename, which will strip off the suffix we identify with the -s argument: .1.fg.gz.

Finally, the results are stored in a new file, 2023_list.txt, which will look like this:

B12A10
BWA11
HHA19
RBB8
...

If you’d rather not have another file cluttering up your directory, you can include the entire pipeline in your Slurm script

#!/bin/bash
#SBATCH --job-name=ustacks%a
#SBATCH --output=ustacks_%a.log
#SBATCH --time=6:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM

M=4

reads_dir=./prort/concat/
out_dir=ustacks
threads=8

SAMPLE=$(ls reads/*.1.fq.gz | grep -v rem | \
    xargs -n 1  basename -s .1.fq.gz | \
    sed "${SLURM_ARRAY_TASK_ID}q;d")

echo "$SAMPLE"
ustacks -t gzfastq -f ${reads_dir}${SAMPLE}.1.fq.gz \
        -o ${out_dir} -m 3 \
        --name ${SAMPLE} -M $M -p ${threads}

Parameter Substitutions

Another set of useful tools are shell parameter substitutions. They can be a convenient way of trimming prefixes and suffices off of names, particularly if you use (or received) concistently formatted file names. For example, if my samples are coded as population_individual.suffix, I can isolate the various bits like this:

FILE=NS01_01.1.fq.gz
SAMPLE=${FILE%.1.fq.gz}
POP=${FILE%_*}
IND=${SAMPLE#${POP}_}

echo File: ${FILE}
echo   Sample: ${SAMPLE}, Pop: ${POP}, Ind: ${IND}

File: NS01_01.1.fq.gz
Sample: NS01_01, Pop: NS01, Ind: 01

Tips for Emacs Users

I’ve written the above for use by anyone working on a cluster that uses Slurm for job submission. However, as I explained in my previous post, Emacs snippets provide an extra layer of convenience for managing Slurm scripts. Following the approach I outline there, I write my Slurm scripts as code blocks in https://orgmode.org/. In this context, an array job will look something like this:

#+begin_src bash :results output
  date
  sbatch --array=1-96 <<SUBMITSCRIPT
  #!/bin/bash
  #SBATCH --job-name=my_job_%a
  #SBATCH --output=my_job_%a.log
  #SBATCH --time=1-00:00:00
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=4
  #SBATCH --mem-per-cpu=96GB

  source ~/miniconda3/etc/profile.d/conda.sh
  conda activate stacksM

  THREADS=8

  FILE=\$(ls stacks23_out/*alleles.tsv.gz | grep -v catalog | sed -n "\${SLURM_ARRAY_TASK_ID}p")
  DIR=\$(dirname \$FILE)
  SAMPLE=\$(basename -s .alleles.tsv.gz \$FILE)

  echo "\$DIR : \$SAMPLE"

  cd \${DIR}
  sstacks -c ../\${DIR} -s ../\${DIR}/\$SAMPLE -p \$THREADS

  SUBMITSCRIPT              
#+end_src