In a previous post on reproducible analysis, I explained the importance of using relative paths in your scripts, and organizing your data in a single directory, in order to maintain portability. You want to be able to pack up your analysis in a zip file, or upload it as a single directory to GitHub or Dropbox, in order to share it with colleagues, or transfer it to a new computer.
This is best practice, but you may run into problems achieving this. One challenge is dealing with large data sets that you will use in multiple analyses. For example, we use WorldClim for a lot of our distribution work. Each copy of the global 30s dataset fills nearly 10GB (compressed). That would quickly fill my laptop harddrive if I stored a separate copy for each project.
I have created a separate directory to store such datasets, so that I only
need to maintain a single copy on my computer. All analyses that use the
WorldClim data will look for it in ~/data/worldclim/
on my laptop. On my
workstation, I store the same data in ~/data/enm/worldclim
. I could
simplify this by using the same absolute path on both machines, but that
wouldn’t help anyone else trying to use my script on their own machine.
The approach I’ve come up with for managing this requires a few lines of code to set the paths appropriately:
switch(system2("hostname", stdout = TRUE),
LAPTOP.HOSTNAME = ## my laptop:
{worldClimPath <- "~/data/worldclim/"},
WORKSTATION.HOSTNAME = ## my workstation:
{worldClimPath <- "~/data/enm/worldclim/"},
{worldClimPath <- "dl/worldclim/"}) ## default
First, I check for the machine name with the function system2("hostname", stdout = TRUE)
. This calls the hostname
command on the underlying
operating system (which should work on Linux, Windows, and Mac). hostname
returns the network hostname for your computer, which should be unique (at
least within your organization). The switch
function then compares this
value to the names for my different machines, which I’ve already looked up.
I can then use that information to set the correct path for my shared data.
In the case that I’m not on either of my machines, I set a default, relative path. That will allow other people to use my script without using my hard-coded paths.
In the case of WorldClim, the geodata
package provides a convenient way
to download the rasters:
library(geodata)
bio <- worldclim_global("bio", res = 10,
path = worldClimPath)
The function worldclim_global
will check the path
argument. If it finds
the requested data there, it loads the local copy. If the data isn’t there,
it downloads a new copy from the internet, and stores it there.
This makes for a convenient solution: running on my computers, all of my analyses will use the same shared data, and I won’t have to wait for downloads or exhaust my hard drives. But I can also share my code as-is with collaborators, and it will just work, without their having to change any paths.