Reproducible Research with R

Tyler Smith
February 22, 2014

Overview

  • What is reproducible research?
  • Setting up a RR project
  • Using R Markdown and knitr

Acknowledgements

This presentation uses material prepared by Roger D. Peng, including two of his images and a lot of knitr slides. Source:

https://github.com/rdpeng/courses

https://www.coursera.org/specialization/jhudatascience/1

Dr. Peng's material and this presetnation may be shared, subject to the Creative Commons Attribution NonCommercial ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_GB).

Motivation

  • years of data collated in an Excel sheet
  • data cleaned & corrected in Excel
    • data1.xls, data2.xls, data07.xls …
  • analysis spread across multiple programs, different parameters
    • analysis1.txt, analysis2.stat, final.an.R, real-final.an.R
  • files scattered across several folders/computers

Motivation

  • six months later: collaborators finally send you their data; you need to revise the analysis
  • a year later: reviewer requests changes
  • 3 years later: you complete a follow-up study, need to revisit analysis

Which data?
Which analysis?
Which files?

Reproducible Research: Definition

  • true replication is rarely feasible
    • collecting data is difficult & expensive

Reproducible research involves the careful, annotated preservation of data, analysis code, and associated files, such that statistical procedures, output, and published results can be directly and fully replicated

ropensci.org/blog/2014/02/20/dvn-dataverse-network/

Requirements

  • data are available
  • code is available
  • code and data are documented
  • code, data and documentation are stored in a format that enables storage & distribution

Getting Started

  • keep your project in one directory
    • more complex may require nesting
    • monophyletic structure is best!
  • use relative file names for portability
    • bad: C:\Documents\Tyler\project1\myfile.txt
    • good: .\myfile.txt

Processing Data

  • clean & process your data with scripts, not GUIs
mydat <- read.table("2014-02-20_data.csv")

## Correct pH data
mydat$pH[mydat$pH == 0] <- NA

## exclude rows with missing data:
mydat <- mydat[ , complete.cases(mydat)]

## exclude sample 67, failed reaction:
mydat <- subset(mydat, rownames(mydat) != "samp67")

Processing Data

  • Permanent audit trail
    • what did you do, why you did it
    • reversing your actions is possible & straightforward

Getting Started

  • keep your project in one directory
  • use relative file names for portability
  • clean & process your data with scripts, not GUIs
  • keep analytic code and data together
  • link analytic code and presentation
    • eliminate cutting & pasting
    • literate programming

Literate Programming

  • one document contains text and code
  • from this single source:
    • extract code
    • extract text
    • compile code weave results into text

knitr

  • R package written by Yihui Xie
    • available on CRAN, integrated with RStudio
  • Supports mixing R code and markup languages like HTML, LaTeX and Markdown
  • Export to PDF, HTML (-> .doc), .R

knitr Requirements

  • a recent version of R
  • RStudio (any text editor will do, RStudio is easiest to start with)
  • understanding of Markdown (which you'll soon have!)

Markdown

  • a very simple markup language
  • described in the RStudio Help -> Markdown Quick Reference