Routine data maintenance
Processing large numbers of (large) files is a common task when working on a cluster. For operations that take only take a few seconds, maybe a few minutes total for the full set of files, you can do this right on the head node. If you need to, you can run a job on the head node for a few hours. But anything that takes more than 15 or 30 minutes is worth considering for submission as a job for the scheduler. Among the many advantages, this allows you to use multiple CPUs to speed up your work.
A simple Bash loop
For processing files, this often involves loops. A previous post explains how to use array jobs to automate submitting each cycle through a loop as a separate job.
Another option, especially handy for Bash scripts, is to execute each command as a background process. For example, lets start with this loop:
for FILE in raw_data/*
do
FILE=$(basename $FILE)
echo ${FILE}
processor all_files/${FILE} > clean_data/${FILE}
done
This loops over all files in the raw_data
directory, runs the program
processor
on them, and makes a copy in the clean_data
directory. That
should work, but it only processes one file at a time.
Running commands in the background
If we direct the program to run the processor
in the background, we can
process multiple files at once:
for FILE in raw_data/*
do
(
FILE=$(basename \$FILE)
echo ${FILE}
processor all_files/${FILE} > clean_data/${FILE}
) &
done
The &
tells the cluster to run the previous expression in the background,
and immediately proceed to the next command. In this case, that means the
next trip through the loop.
Controlling the number of background jobs
That’s an improvement, but if we have 1000 files to process, we probably don’t have enough CPUs to actually run them all at once. The job may fail, or the cluster may try to manage the competing processes itself. This will be slow.
Better to specifically request the number of CPUs we want, and then to limit our loop to never use more than that many at a time. We can do this with the following:
N=8
for FILE in raw_data/*
do
(
FILE=$(basename \$FILE)
echo ${FILE}
processor all_files/${FILE} > clean_data/${FILE}
) &
if [[ $(jobs -r -p | wc -l) -ge $N ]]; then
wait -n
fi
done
The jobs
program reports the process ID (via the -p
flag) of the
running jobs (the -r
flag), one ID per line. The wc -l
program tells us
how many lines (i.e., how many running jobs) there are in the output. -ge $N
checks if that number is greater than or equal to the value of N
,
which we set to 8
on the first line of the script.
If that’s true (there are 8 jobs running), the wait
command will pause
all further commands until any of the jobs finishes (the -n
flag).
That’s exactly what we needed. Now we can ask for the number of CPUs we want in our SLURM script:
#SBATCH --cpus-per-task=8
And make sure that number matches the value we set for N
in our Bash
loop. The higher the number, the more files will get processed at once. But
it may also take longer for your job to get scheduled, depending on the
size of your cluster and how busy it is.
References
I found this approach on StackOverflow.