6 min read

Introduction to Org mode for cluster computing

§emacs §bioinformatics

Getting Started

This tutorial builds on my previous post Emacs posts (ie., introduction, Orgmode, R and ESS). If you’re not familiar with Emacs, you might want to look through those first, particularly Orgmode.

In this post, I will show you how to setup an Org mode (or Orgmode, org-mode) file on your local machine (laptop or office desktop) to manage and run a cluster computing project.

While not absolutely necessary, working this way is much easier if you’ve configured your machine to provide keywordless access to the server. I’ve talked about setting up your .ssh/config file to manage usernames and hostnames/addresses previously, as well as setting up RSA keys so you can securely access remote servers without a password.

In addition, within Emacs you’ll likely want to configure the variable org-confirm-babel-evaluate to nil, so that you don’t get asked for confirmation each time you try to evalute a code block, and also to configure which languages you would like to use in code blocks, via the Org Babel Load Languages setting. I step through this in my previous post.

Org file headers

First off, we want to tell Org Mode where we want to run our code. Putting this information in the file header ensures it will apply to all the code blocks in the file (unless we explicitly change it for a particular block).

My file template includes the following two lines as the first two lines of my .org file:

# -*- org-export-babel-evaluate: nil -*-
#+PROPERTY: header-args:bash :results output :dir /ssh:gpsc:./path/to/my/project

The first line tells Orgmode not to evaluate code blocks if/when we export our source file (i.e., if we want to create a pdf or html report). If we don’t do this, then every time we export the document it will resubmit all of our cluster jobs, which is almost certainly not what we want.

The second line sets the default properties for bash code blocks. :results output means we want the text printed out by the code in our code blocks inserted into the file. :dir /ssh:gpsc:./path/to/my/project means we want these code blocks to be run on the remote host gpsc, which we access through ssh, and in the directory ./path/to/my/project (relative to my home directory).

When this is properly configured, we can edit the org file on our local machine, and send all the code to the server to run!

If you make a mistake in the header lines, or just want to change the path or other details, be sure to update the settings by pressing C-c C-c with the cursor on the header line, or, in the menus, select ‘Org’ -> ‘Refresh/Reload’ -> ‘Refresh setup current buffer’.

Executing code on the server

Now we’re ready to run some code. To create a block, enter the following text in the file:

#+begin_src bash
  date
  ls
#+end_src

The lines #+begin_src ... and #+end_src denote the code block, with the argument bash indicating that this is the language for this block.

With the cursor (point) anywhere in the block, hit C-c C-c to evaluate the code. The results will show up in a few moments:

#+RESULTS:
: Thu 28 Mar 2024 09:31:21 PM UTC
: hostname.tld.ca
: bin  data  man	miniconda3  ovbin  rubus  slurm_sample.sh  srcs  wp2

This is the contents of my home directory on hostname.tld.ca - success!

Write & submit job requests (with slurm)

Now we can run bash scripts on the head node. That’s fine for installing packages with conda/mamba or whatever system you’re using. But for real work, we’ll need to submit a job request. In my case we use the slurm scheduler. I use two different methods to run these scripts.

Heredocs

In most cases, I include the entire script as a heredoc:

#+begin_src bash :results output
  sbatch <<SUBMITSCRIPT
  #!/bin/bash
  #SBATCH --job-name=myjob
  #SBATCH --output=myjob.log
  #SBATCH --time=24:00:00
  #SBATCH --cpus-per-task=1
  #SBATCH --mem-per-cpu=1G

  VARIABLE="My variable"

  source ~/miniconda3/etc/profile.d/conda.sh
  conda activate bowtie2

  echo \$VARIABLE

  bowtie2 --help

  SUBMITSCRIPT
#+end_src

(Of course the actual details will depend on how your server is set up, with slurm, qsub, or whatever is set up).

Note that when you include bash variables in scripts submitted as heredocs, you need to escape the $ when you reference them (e.g., \$VARIABLE).

Again, we can submit this script by pressing C-c C-c when the cursor is anywhere in the code block. In this case, the output we get is from sbatch:

#+RESULTS:
: Submitted batch job 2075188

Having a record of the job number allows us to track its progress with another code block:

#+BEGIN_SRC bash
sacct --jobs=2075188 --format=jobid,jobname,state,elapsed,ReqMem,MaxRSS
#+END_SRC

Running this (C-c C-c again) returns the status of my job:

#+RESULTS:
: JobID           JobName      State    Elapsed     ReqMem     MaxRSS 
: ------------ ---------- ---------- ---------- ---------- ---------- 
: 2075188         ustacks    RUNNING   00:00:07        64G            
: 2075188.bat+      batch    RUNNING   00:00:07                       
: 2075188.ext+     extern    RUNNING   00:00:07                       

Now we’re getting somewhere! In this one file we can include our freehand notes and metadata, along side the scripts we used in the analysis. One file per project, so we don’t need to keep track of different script versions for each step in a long workflow.

Standalone scripts

Most job scripts are pretty simple, so the heredoc approach is fine, and keeps all the details together in the org file. However, if you have a more involved script, you may want to keep it in a file of its own. That will give you access to the language-specific editing features of Emacs, and may help keep your main org file to a manageable length.

If you choose this approach, you can include a link to the external script with the format:

[[file:/ssh:servername:path/to/script.sh][script.sh]]

You’ll want this file to be in the same directory you specified in the org file header.

Using this link format, pressing enter with the cursor on the link text will open that file for you, allowing you to read and edit it with ease.

To submit the script, we can use a much smaller code block:

sbatch script.sh

This would be the way I would submit jobs in other languages as well, if you use Perl, Python, or R in your analyses.

Streamlining all the text entry

This system works very well, and being based on plain text files it’s pretty robust and portable. However, there are a lot of bits of text to enter, and remembering all the syntax can be tedious. orgmode provides some conveniences to simplify this.

To enter a new code block, you can use the shortcut C-c C-, to open the block menu, and select source code with s. Then add the header bash and you’re ready to go. You can do the same thing via the menus with ‘Org’ -> ‘Editing’ -> ‘Add block structure’.

The headers for slurm scripts can be quite involved as well. I use the yasnippet extension to insert templates for these. I’ll leave the details for another post.

What about R?

You can run R scripts in batch mode on a server just like any other script. But you might want to run an interactive process for exploring the results of your scripts. I do this by passing the output from code blocks run on the server to a local instance of R. I’ll explain that in a future post as well.