Routine data maintenance

Processing large numbers of (large) files is a common task when working on a cluster. For operations that take only take a few seconds, maybe a few minutes total for the full set of files, you can do this right on the head node. If you need to, you can run a job on the head node for a few hours. But anything that takes more than 15 or 30 minutes is worth considering for submission as a job for the scheduler. Among the many advantages, this allows you to use multiple CPUs to speed up your work.

A simple Bash loop

For processing files, this often involves loops. A previous post explains how to use array jobs to automate submitting each cycle through a loop as a separate job.

Another option, especially handy for Bash scripts, is to execute each command as a background process. For example, lets start with this loop:

for FILE in raw_data/*
do
    FILE=$(basename $FILE)
    echo ${FILE}
    processor all_files/${FILE} > clean_data/${FILE}
done

This loops over all files in the raw_data directory, runs the program processor on them, and makes a copy in the clean_data directory. That should work, but it only processes one file at a time.

Running commands in the background

If we direct the program to run the processor in the background, we can process multiple files at once:

for FILE in raw_data/*
do
    (
    FILE=$(basename \$FILE)
    echo ${FILE}
    processor all_files/${FILE} > clean_data/${FILE}
    ) &
done

The & tells the cluster to run the previous expression in the background, and immediately proceed to the next command. In this case, that means the next trip through the loop.

Controlling the number of background jobs

That’s an improvement, but if we have 1000 files to process, we probably don’t have enough CPUs to actually run them all at once. The job may fail, or the cluster may try to manage the competing processes itself. This will be slow.

Better to specifically request the number of CPUs we want, and then to limit our loop to never use more than that many at a time. We can do this with the following:

N=8

for FILE in raw_data/*
do
    (
    FILE=$(basename \$FILE)
    echo ${FILE}
    processor all_files/${FILE} > clean_data/${FILE}
    ) &

    if [[ $(jobs -r -p | wc -l) -ge $N ]]; then
        wait -n
    fi

done

The jobs program reports the process ID (via the -p flag) of the running jobs (the -r flag), one ID per line. The wc -l program tells us how many lines (i.e., how many running jobs) there are in the output. -ge $N checks if that number is greater than or equal to the value of N, which we set to 8 on the first line of the script.

If that’s true (there are 8 jobs running), the wait command will pause all further commands until any of the jobs finishes (the -n flag).

That’s exactly what we needed. Now we can ask for the number of CPUs we want in our SLURM script:

#SBATCH --cpus-per-task=8

And make sure that number matches the value we set for N in our Bash loop. The higher the number, the more files will get processed at once. But it may also take longer for your job to get scheduled, depending on the size of your cluster and how busy it is.

References

I found this approach on StackOverflow.

Parallelizing Bash loops

Routine data maintenance

A simple Bash loop

Running commands in the background

Controlling the number of background jobs

References