Managing Absolute Paths in Reproducible Analyses

In a previous post on reproducible analysis, I explained the importance of using relative paths in your scripts, and organizing your data in a single directory, in order to maintain portability. You want to be able to pack up your analysis in a zip file, or upload it as a single directory to GitHub or Dropbox, in order to share it with colleagues, or transfer it to a new computer.

This is best practice, but you may run into problems achieving this. One challenge is dealing with large data sets that you will use in multiple analyses. For example, we use WorldClim for a lot of our distribution work. Each copy of the global 30s dataset fills nearly 10GB (compressed). That would quickly fill my laptop harddrive if I stored a separate copy for each project.

I have created a separate directory to store such datasets, so that I only need to maintain a single copy on my computer. All analyses that use the WorldClim data will look for it in ~/data/worldclim/ on my laptop. On my workstation, I store the same data in ~/data/enm/worldclim. I could simplify this by using the same absolute path on both machines, but that wouldn’t help anyone else trying to use my script on their own machine.

The approach I’ve come up with for managing this requires a few lines of code to set the paths appropriately:

switch(system2("hostname", stdout = TRUE),
       LAPTOP.HOSTNAME =  ## my laptop:
         {worldClimPath <- "~/data/worldclim/"}, 
       WORKSTATION.HOSTNAME =   ## my workstation:
         {worldClimPath <- "~/data/enm/worldclim/"},
       {worldClimPath <- "dl/worldclim/"}) ## default

First, I check for the machine name with the function system2("hostname", stdout = TRUE). This calls the hostname command on the underlying operating system (which should work on Linux, Windows, and Mac). hostname returns the network hostname for your computer, which should be unique (at least within your organization). The switch function then compares this value to the names for my different machines, which I’ve already looked up. I can then use that information to set the correct path for my shared data.

In the case that I’m not on either of my machines, I set a default, relative path. That will allow other people to use my script without using my hard-coded paths.

In the case of WorldClim, the geodata package provides a convenient way to download the rasters:

library(geodata)
bio <- worldclim_global("bio", res = 10,
                        path = worldClimPath)

The function worldclim_global will check the path argument. If it finds the requested data there, it loads the local copy. If the data isn’t there, it downloads a new copy from the internet, and stores it there.

This makes for a convenient solution: running on my computers, all of my analyses will use the same shared data, and I won’t have to wait for downloads or exhaust my hard drives. But I can also share my code as-is with collaborators, and it will just work, without their having to change any paths.