13 min read

Data Management for Reproducible Science

§r §data

Introduction

Research is reproducible when others can reproduce the results of a scientific study given only the original data, code, and documentation (Alston and Rick, 2021)

Benefits to the Author:

  1. Clear and complete documentation of your work makes it easier to share, write up and extend in future work, including responding to reviewers and developing new projects
  2. Conscientious documentation of your work involves a great deal of error-checking, which is reassuring to you – that you haven’t missed anything, or mis-remembered what you did; and to your readers – that you have conducted your work in a rigorous manner
  3. Reproducible work gets cited more, and developing a data archive creates a new citable product from your research.

Benefits to the Community:

  1. Increases the speed and fidelity with which we can learn and apply new approaches.
  2. Makes it easier to avoid mistakes (through the care and attention required to create the archive), and to detect and correct them if they do happen (by allowing others to critically review your work)

How to make your work reproducible

There are two related goals in producing a reproducible analysis: portability, and reproducibility. Reproducibility is perhaps obvious, but it’s not enough that you can reproduce your analysis on your computer. You are the only person with access to your computer. Your work should be reproducible on anyone’s computer.

Portability

To achieve portability, we need to know which files are needed for an analysis, and to organize them in a way that they can be readily moved from one computer to another. In practice, this means they’ll all be in a single directory, and that directory will only contain files for that particular analysis. You may have several related analyses that share a directory. That’s ok, but take time to organize them in a sensible way.

While you may have related analyses together in a directory, you don’t want to mix unrelated files and data in this directory. That will make it harder to keep track of what is needed and what isn’t, and will waste space on the computers where this work is ultimately archived.

README.txt

There are many ways you can organize your work within this directory. One absolute requirement is there needs to be a clear guide to your files, a “Table of Contents”. This should be a ‘plain text’ file - something that can be opened by any text editor. Your archive might be around for decades, and you don’t know if your readers will be able to find a copy of MSWord97 when they need to read it. We can be reasonably confident that plain text files will be accessible for a long time to come.

By convention, this file is called README.txt, and some data archiving services (DataDryad) require that you include a file with this name. You should probably use this name, unless you have a good reason not to.

One minor exception: if you use markdown, or RMarkdown, you can use these formats for your README. They are both plain text, and even if your audience doesn’t use them, they can open a README.md or README.Rmd file in any text editor. There are other, similar simple markup formats used in different coding communities. As long as they are saved as plain text files they will meet our requirements.

The contents of your README should describe as clearly as possible the contents of your archive. Here is an excerpt from one of mine:

Start with trich.Rmd to read our draft manuscript. See trich-prep.Rmd for the bulk of the code used in generating this manuscript.

File List:

  • trich.Rmd : main manuscript file

  • plantarum.json : bibliography (Zotero bibliography)

  • trich-prep.Rmd : The bulk of the code used in the analysis. Loaded from trich.Rmd to regenerate the figures and tables there.

  • data/
    • data/ssr_raw.csv : raw microsatellite data. See trich-prep.Rmd (Loading Data) for code to load and translate this to a genind object

    • data/survey-pops.csv : coordinates of sampled populations

    • data/eval.opt.2020-06-24.Rda : the output of the Maxent modeling, saved as a binary R Data object. Load into R with the load() function, so you don’t need to repeat the lengthy Maxent analysis.

    • data/trich-gbif.csv : GBIF records used in the Maxent analysis

    • data/trich_soil.csv : Soil analysis for each sampled population

    • data/maps : Maps (shapefiles and rasters) used in the Maxent analysis, and for some of the manuscript plots

I prefer to use RMarkdown to develop my manuscripts, as this allows me to keep the code for figures, and the resulting images, together with the text that describes the methods and interprets the results. If you prefer to manage your code separately from your writing that’s fine too, but you may end up structuring your archive a little differently.

Directory Organization

If you only have a few files, you may not need to do any further organization of your archive. I find it helpful to use subfolders to keep things organized. Depending on the needs of a project, I use:

data/
unprocessed data used in the analysis
processed/
intermediate data files, generated from data and not stored permanently
downloads/
storage of large external datasets used in the analysis, but not stored permanently in the archive (e.g., WorldClim Climate Data); be sure to include links to the source of any external data you are not archiving yourself!
plots/
images generated by the analysis
code/
code used in the analysis, if it’s not in the top-level directory

These are not hard rules. You can use whatever structure suits your project. Just be sure to explain it in your README.

Reproducibility

File organization gets us most of the way to portability. There are a few things we need to do in our coding to complete this arrangement, and of course to ensure we can reproduce the analysis once we move it to a new computer.

Use Relative Paths

A absolute path is the location of a file on a particular computer. The absolute path to the file I’m editing right now is /home/smithty/blogdown/content/tutorials/2022-10-17-data-management/index.md. That location will only ever exist on a Linux computer with a user named smithty. If I moved it to a Windows computer, it might be located at C:\Users\smithty\Documents\blogdown/content/tutorials/2022-10-17-data-management/index.md. If I try to refer to this file by its absolute path on Linux on the Windows machine, I won’t find it.

On the other hand, from the top directory of my blog, blogdown/, this file will have the same relative path on both machines: content/tutorials/2022-10-17-data-management/index.md. That means links to this file using the relative path will work just fine on both machines.

We should use relative paths in our analyses too.

For example, I could use an absolute path to load my data:

samples <-
  read.csv("/home/smithty/nextcloud/trich/2020-06-25/data/survey-pops.csv") 

But this won’t load on anyone else’s computer. If we do this instead:

samples <-
  read.csv("data/survey-pops.csv") 

Anyone with our archive can run the code as it is, as long as they have the working directory set properly. For my projects, I do this by running my code from the top directory of the archive. In this case, I have the directory structure:

├── data
│   └── survey-pops.csv
├── README.md
├── trich-prep.Rmd
└── trich.Rmd

When I run code from trich.Rmd, I set the working directory to the location of that file. RStudio manages this for you with its project support. If you don’t use that feature, you can tell R to use that location when it starts. After that, stick to relative file paths, and you don’t need to worry about your code breaking when you move to a different computer.

Structure Your Code to Run On Its Own

Two common problems that make it hard to run your code are mixing your ‘good’ code with non-working code, and writing code that requires you to update it by hand to finish your analysis.

Mixing good and bad code might look like this:

## Load in our data:
myData <- read.data("data/myfile.csv")
myDataScaled <- scale(myData)
myDataScaled <- scale(myData, center = FALSE)

myData <- myData[, -1]
myDataScaled <- scale(myData, scale = FALSE)

It’s not unusual to accumulate various versions of code (should I scale this, center it, both? Do I need the first column?). While you’re actively working on your code, you may find you have multiple versions in the same file. They need to be clearly commented, for your own benefit! But more importantly, when you decide which version you’re going to use, remove the rest. Don’t expect yourself (and certainly don’t expect anyone else) to figure out which lines they should run, and which they should skip.

## Load in our data:
myData <- read.data("data/myfile.csv")

## drop the name columnn
myData <- myData[, colnames(myData) != "name"]

## don't scale, use raw data
## myDataScaled <-scale(myData)

If you have a few lines of code that you might want to revisit, comment them out and leave yourself a note in a comment. If you have large blocks of code you aren’t using, but want to keep a record of, put them in a separate file. The end product should be a file that you can run from start to finish, without deciding which lines to skip and which to run.

A similar problem occurs when you have code that works, but requires you to manually update it in order to complete your analysis. e.g.,

myData <- read.table("data/experiments.csv")

myExperiment <- subset(myExperimentData, exp == 1)

myExperimentResult <- processingCode(myExperiment)

## repeat for exp 1-20

Code structured like this requires you to edit and re-edit the code many times to complete your work. That is tedious, and it’s easy to make a mistake. And when you update processingCode, you need to rerun everything by hand. You can avoid this with loops and lists.

R Programming: Do Not Save Your Workspace!

R offers to save your current workspace when you close the terminal. This is sometimes convenient, but can cause hard to detect problems with your analysis. If you happen to alter one of the objects you are working with in an R session, but don’t capture the code in your script file, you won’t have a record of what you’ve done. If you then save your workspace, the next time you work on your code, you will be able to keep using that modified object. At this point, your code and data are out of sync, with nothing to indicate how they differ.

At best, this is inconvenient. At worst, if undetected, you can waste weeks or months analyzing the wrong data! Better to avoid the risk and set R to never save your workspace.

Metadata: data about your data

If your archive includes data files, these should also be in an open format, such as comma-separated or tab-separated text files (typically with a name like FILE.csv, DATA.txt, or RECORDS.txt. That ensures that anyone can open your data without needing a proprietary program to do so.

Data Tables

In addition, you need to document how your data is coded. This includes things like:

  • Variable names and descriptions
  • Definition of codes and classification schemes
  • Codes of, and reasons for, missing values
  • Definitions of specialty terminology and acronyms

Be sure to include things like units (meters or feet?), and what your special codes mean (is pet_l the petal length or the petiole length?, what is lf1, lf2 and lf3?). This is also a chance to review your data for consistency - are you using NA and -1 for missing values? Do you have multiple different phrasings for the same thing?

Depending on your project, you may also need to distinguish between absent evidence (you didn’t sample on a day) and evidence of absence (you sampled on a day, but didn’t find any events/individuals). If your analysis will include sampling events that didn’t result in any observations, you’ll need to document these ‘true negatives’ in your data table.

Data Sources

If you’re using data from outside sources, like GBIF.org or WorldClim.og, be sure to record their citation details, including a DOI, if available, when you download them. This will ensure you can properly cite them later, and that your readers will be able to access the same data if they want to reproduce your work. Most data providers have clear policies as to how you should cite them, and what you’re allowed to do with the data they share (i.e., can you share it or archive it yourself). For example, here are the policies for WorldClim and GBIF.

How to get there from here

Now we have an idea of what we want our data archive to look like, how do we get there? Start by setting up your directory structure, and making a README file. If you’re starting from scratch, make it a practice to review your directory regularly, to see what you’ve done, what you’re no longer using, and updating your README to capture that. This kind of regular reflection is useful to track your progress, and helps you keep on top of your archive work so you don’t have a huge mess to wrangle at the end of your project.

If you’re already well-along in your project, it may be easier to create an ‘aspirational’ directory and populate it from your existing work. I do this regularly! I often find I’ve charged into an analysis without thinking about archiving, and after a few weeks I have an unholy mess of files and data to deal with. In that case, I create a new directory, a README, and copy over the main code file I’m using to that directory. When I get to the first data file in that directory, I copy it over to the data/ directory in the new archive, adjust the code to use the relative path, if necessary, and continue. This can be very helpful in clarifying what code and files you actually need and use, and what can be left behind.

This is also a good opportunity to ensure that your analysis is structured in a way that it can run start-to-finish without manual editing

You don’t need to delete anything in the old directory, you can keep it in case you later decide you want to revisit some ideas in there.

What to do with it all

One of the benefits of structuring your code in a single directory is there are lots of tools that you can use to manage it. git and GitHub are popular, and very powerful for managing code, especially tracking different versions of the same files. However, they require a certain amount of discipline to get the full benefit, and they are challenging to learn. RStudio does have good support for GitHub repositories.

You can also keep it simple, and sync your directory to Google Drive, Dropbox, Nextcloud, or many other options. Once your work is published, you can archive it permanently on Zenodo, DataDryad, or other online services.

All of these options will support housing a single directory and its subdirectories. None of these options will be easy to deal with if you have files spread across multiple directories and mixed with files from other projects!

Examples

I’ve been doing some version of this for my own work for years, but have only recently moved to permanent, public archives of my work. Here are three recent examples:

I’m still figuring out how best to do this, and my practice will definitely continue to change and evolve. Regardless of the specifics, I have benefited enormously from investing the time needed to make coherent archives of my projects.

References

Alston, J. M., and J. A. Rick. 2021. A Beginner’s Guide to Conducting Reproducible Research. The Bulletin of the Ecological Society of America 102: e01801.

Foster, S. L., H. M. Kharouba, and T. W. Smith. 2022. Testing the assumption of environmental equilibrium in an invasive plant species over a 130 year history. Ecography: e06284.

Hayes, A., S. Wang, A. T. Whittemore, and T. W. Smith. 2022. The Genetic Diversity of Triploid Celtis Pumila and its Diploid Relatives C. Occidentalis and C. Laevigata (Cannabaceae). Systematic Botany 47: 441–451.

Nowell, V. J., S. Wang, and T. W. Smith. 2022. Conservation assessment of a range-edge population of Trichophorum Planifolium (Cyperaceae) reveals range-wide inbreeding and locally divergent environmental conditions. Botany 100: 631–642.