Loops in Bioinformatic Pipelines
A common requirement of many bioinformatics pipelines is completing the same analysis on a list of files. This is simple to do with a bit of Bash scripting:
#!/bin/bash
#SBATCH --job-name=ustacks
#SBATCH --output=ustacks.log
#SBATCH --time=360:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM
M=4
reads_dir=./prort/concat/
out_dir=ustacks
threads=8
while read SAMPLE; do
echo "$SAMPLE"
ustacks -t gzfastq -f ${reads_dir}${SAMPLE}.1.fq.gz \
-o ${out_dir} -m 3 \
--name ${SAMPLE} -M $M -p ${threads}
done <SAMPLES.txt
This script reads each line of the file SAMPLES.txt
, passes that value to
the program ustacks
.
That’s fine, but it can require a lot of cluster time. In this case, I
have 96 files listed in SAMPLES.txt
, and each of them will take three or
four hours to process. That’s 15 days. In reality, it may take much longer,
as the cluster may not be able to schedule my job immediately given the
length (at least on the cluster I use!).
Replacing Loops with Array Jobs
However, each of these 96 jobs can be processed at the same time. There’s no interaction between them. If instead of asking for a single 15 day job, I ask for 96 six-hour jobs, it’s easier for the cluster to schedule them, and my total run time drops to something closer to the length of a single job (maybe 6-12 hours, depending on how quickly they get scheduled).
This is what array jobs do. I can replace the script above with this:
#!/bin/bash
#SBATCH --job-name=ustacks%a
#SBATCH --output=ustacks_%a.log
#SBATCH --time=6:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM
M=4
reads_dir=./prort/concat/
out_dir=ustacks
threads=8
SAMPLE=$(sed "${SLURM_ARRAY_TASK_ID}q;d" SAMPLES.txt)
echo "$SAMPLE"
ustacks -t gzfastq -f ${reads_dir}${SAMPLE}.1.fq.gz \
-o ${out_dir} -m 3 \
--name ${SAMPLE} -M $M -p ${threads}
It’s nearly the same as the original. The key difference is I’ve replaced
the while
loop with a single variable, set using a line of
sed:
SAMPLE=$(sed "${SLURM_ARRAY_TASK_ID}q;d" SAMPLES.txt)
Sed is a very powerful tool for processing text. In this case, I’ve asked
it to search through the file SAMPLES.txt
, skip forward to the line
${SLURM_ARRAY_TASK_ID}
, print that line, and then quit.
Where does SLURM_ARRAY_TASK_ID
come from? That’s variable is set for us
automatically when we submit our script as an array job. Instead of the
usual invocation of Slurm:
sbatch my_script.sh
We’ll do this:
sbatch --array=1-96 my_script.sh
When we do this, it tells Slurm that we want to submit 96 jobs. Each job
will be run with identical code, except for one thing: Slurm will insert
the variable SLURM_ARRAY_TASK_ID
with a different value into each script.
That value is also inserted into my job name and log file, using the %a
placeholder:
#SBATCH --job-name=ustacks%a
#SBATCH --output=ustacks_%a.log
With that little change, instead of waiting a month for my 15 day job to run, I wait a day for 96 six-hour jobs to run.
Selectively Repeating Failed jobs
Another benefit of this approach is that it allows you to easily resubmit only a subset of your jobs. In the example above I’ve asked for six hours for each file. If some of those files happen to require more time than that, they will fail with a timeout. You’lll see this in your job logs. Here’s an example from an array where I requested 36 hours per job:
date
sacct --jobs=3069231 --format=jobid,jobname,state,elapsed | grep -v 'batch\|extern'
: Fri 24 Jan 2025 12:10:04 PM EST
: JobID JobName State Elapsed
: ------------ ---------- ---------- ----------
: 3069231_1 ustacks23+ COMPLETED 19:14:14
: 3069231_2 ustacks23+ COMPLETED 17:51:58
: 3069231_3 ustacks23+ COMPLETED 13:20:56
: 3069231_4 ustacks23+ COMPLETED 23:43:24
: 3069231_5 ustacks23+ TIMEOUT 1-12:00:01
: 3069231_6 ustacks23+ COMPLETED 17:43:17
: 3069231_7 ustacks23+ COMPLETED 17:07:07
You can see that that wasn’t enough time for the fifth job. I can rerun just that job (after increasing the requested time) with:
sbatch --array=5 my_script.sh
Or if there are a group of failed jobs, I can combine them like this:
sbatch --array=5,7,20-23 my_script.sh
Much easier than repeating the analysis of all 96 samples just because one or two of them needed more time.
Tips for Using Array jobs
Automating Making Lists to use in Array Jobs
You can create the lists of samples or files you want to cycle over by hand. That’s probably the fastest way to do it, if you only need to do it once. But as soon as you start repeating an analysis on different files, you might benefit from some automation.
Bash scripting provides a huge variety of
useful tools for creating lists of files or samples you want to cycle
through with SLURM_ARRAY_TASK_ID
. I won’t provide a full tutorial, but a
few examples may be useful in showing you why it might be worth investing
some time to learn more about this.
Filtering Filenames with Bash Pipes
I have a directory containing a list of files like this:
B12A10.1.fq.gz BWA11.rem.2.fq.gz HHA19.rem.1.fq.gz RBB8.1.fq.gz
B12A10.2.fq.gz BWA1.2.fq.gz HHA19.rem.2.fq.gz RBB8.2.fq.gz
B12A10.rem.1.fq.gz BWA13.1.fq.gz HHA1.rem.1.fq.gz RBB8.rem.1.fq.gz
B12A10.rem.2.fq.gz BWA13.2.fq.gz HHA1.rem.2.fq.gz RBB8.rem.2.fq.gz
Each sample has a file of matched forward (.1.fq.gz
) and reverse reads
(.2.fq.gz
), and a file of forward and reverse unmatched reads
(.rem.1.fq.gz
and .rem.2.fq.gz
). In my processing I’ll want to loop
over each individual sample, not over all four files for each sample. I do
this with the following line:
ls reads/*.1.fq.gz | grep -v rem | \
xargs -n 1 basename -s .1.fq.gz > 2023_list.txt
The first clause, ls reads/*.1.fq.gz
lists all the forward read files.
This will include both B12A10.1.fq.gz
and B12A10.rem.1.fq.gz
.
Next, I use grep -v rem
. That tells grep
to filter the values we give it
for the string rem
. The -v
argument tells grep
to ‘invert’ the
search. In other words, remove all the values that include rem
. At this
point, we have a single file name per sample, e.g. B12A10.1.fq.gz
and
RBB8.1.fq.gz
.
The next step is a combination. First, we pass the values from grep to the
program xargs
with the argument -n 1
. That tells xargs
to pass the
values it recieves one at a time to the following command. That command is
basename
, which will strip off the suffix we identify with the -s
argument: .1.fg.gz
.
Finally, the results are stored in a new file, 2023_list.txt
, which will
look like this:
B12A10
BWA11
HHA19
RBB8
...
If you’d rather not have another file cluttering up your directory, you can include the entire pipeline in your Slurm script
#!/bin/bash
#SBATCH --job-name=ustacks%a
#SBATCH --output=ustacks_%a.log
#SBATCH --time=6:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM
M=4
reads_dir=./prort/concat/
out_dir=ustacks
threads=8
SAMPLE=$(ls reads/*.1.fq.gz | grep -v rem | \
xargs -n 1 basename -s .1.fq.gz | \
sed "${SLURM_ARRAY_TASK_ID}q;d")
echo "$SAMPLE"
ustacks -t gzfastq -f ${reads_dir}${SAMPLE}.1.fq.gz \
-o ${out_dir} -m 3 \
--name ${SAMPLE} -M $M -p ${threads}
Parameter Substitutions
Another set of useful tools are shell parameter
substitutions.
They can be a convenient way of trimming prefixes and suffices off of
names, particularly if you use (or received) concistently formatted file
names. For example, if my samples are coded as
population
_individual
.suffix
, I can isolate the various bits like this:
FILE=NS01_01.1.fq.gz
SAMPLE=${FILE%.1.fq.gz}
POP=${FILE%_*}
IND=${SAMPLE#${POP}_}
echo File: ${FILE}
echo Sample: ${SAMPLE}, Pop: ${POP}, Ind: ${IND}
File: NS01_01.1.fq.gz
Sample: NS01_01, Pop: NS01, Ind: 01
Tips for Emacs Users
I’ve written the above for use by anyone working on a cluster that uses Slurm for job submission. However, as I explained in my previous post, Emacs snippets provide an extra layer of convenience for managing Slurm scripts. Following the approach I outline there, I write my Slurm scripts as code blocks in https://orgmode.org/. In this context, an array job will look something like this:
#+begin_src bash :results output
date
sbatch --array=1-96 <<SUBMITSCRIPT
#!/bin/bash
#SBATCH --job-name=my_job_%a
#SBATCH --output=my_job_%a.log
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=96GB
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM
THREADS=8
FILE=\$(ls stacks23_out/*alleles.tsv.gz | grep -v catalog | sed -n "\${SLURM_ARRAY_TASK_ID}p")
DIR=\$(dirname \$FILE)
SAMPLE=\$(basename -s .alleles.tsv.gz \$FILE)
echo "\$DIR : \$SAMPLE"
cd \${DIR}
sstacks -c ../\${DIR} -s ../\${DIR}/\$SAMPLE -p \$THREADS
SUBMITSCRIPT
#+end_src