Using package repositories to
recreate the past, distribute the present, and protect against the future
by Gabriel Becker (@groundwalkergmb)
Bioinformatics and Computational Biology
Genentech Research and Early Development
1. Have you ever needed to reach into the distant past …
to recreate a years old result? Take - as an arbitrary example - Anders and Huber's paper on Differential expression analysis for sequence count data from 2010. "2010", the excitable among you might exclaim, "were they programming their R scripts on punch-cards and running them on coal-powered computers??" Perhaps they were, my hyperbolic readers, but hardware is not our concern today.
Replicable scientific results are the engine for advancing human knowledge; non-replicable results, simply put, are not. This holds true regardless of how many eons - in computer years - have passed since the results were originally generated.
2. They got what??
Anders and Huber provide Sweave output (a PDF) that contains the code for the plots and output directly reported in their paper, as well as all the datasets used by the code. So let's reproduce their results! Getting the code into an executable form for this is painful, but that is a subject for another time.
Below is their code (I added the supressMessages() call) for identifying differentially expressed genes for their fly data with a False Discovery Rate (FDR) cutoff of .1
suppressMessages(library(DESeq)) countsTableFly <- read.delim( "fly_RNA_counts.tsv" ) condsFly <- c( "A", "A", "B", "B" ) # Add dummy names to avoid confusion later rownames( countsTableFly ) <- paste( "Gene", 1:nrow(countsTableFly), sep="_" )
cdsFly <- newCountDataSet( countsTableFly, condsFly ) cdsFly <- estimateSizeFactors( cdsFly ) cdsFly <- estimateVarianceFunctions( cdsFly ) #oops! error! resFly <- nbinomTest( cdsFly, "A", "B" )
Error: 'estimateVarianceFunctions' is defunct. Use 'estimateDispersions' instead. See help("Defunct") Error in nbinomTest(cdsFly, "A", "B") : Call 'estimateDispersions' first.
It didn't work! We got an error when running their code.The Defunct error suggests that we use the estimateDispersions function instead of the removed estimateSizeFactors function, so lets do that …
359?!? The paper (and Sweave output) report that this code gives the result of 864!
Before we go lighting our pitchforks and sharpening our torches, however, we need to remember a simple fact: just because it doesn't work now doesn't mean it didn't work then.
At this point you may be wondering what I'm proposing. Are we supposed to take a quick trip back in time and rerun their code in 2010? Well, sort of. The GRANBase package - among many other things - provides a sort of time machine for your computing environment.
3. Climbing in through the sessionInfo() window
Anders and Huber provided a window into the computing environment they used in the form of sessionInfo() output in their SWeave output PDF. "Windows are great", I'm sure you're thinking, "except that I can't use them to get inside and so I'm left out in the cold!"
Well not with that attitude you can't. We're going to use GRANBase to knock out that window and install a door in its place. One that we - and everyone else - can use to reproduce Anders and Huber's results. No more pressing your face up against the glass, drooling at the delicious old software tantalizingly arranged on the other side.
First we have to get our hands on the right version of R - 2.12 in this case. That isn't the easiest thing to do, but for those of you who don't want to install and manage an old R installation you can use an Amazon EC2 instance with this lovingly hand-crafted AMI that contains R 2.12.1 and has switchr installed: ami-94a670fc
NOTE: the authors actually ran their code in 2.12.0, but in general we can usually assume that matching the major (2) and minor (12) versions is sufficient, as we see that it is in this case. The end result (run in R 2.12.1):
library(switchr) doiRepo = "http://research-pub.gene.com/gran/10.1186/gb-2010-11-10-r106" switchTo(doiRepo, name = "10.1186/gb-2010-11-10-r106")
Switched to the '10.1186/gb-2010-11-10-r106' computing environment. 54 packages are currently available. Packages installed in your site library ARE suppressed. To switch back to your previous environment type switchBack()
If we had never switched to a computing environment based on this particular repository within the version of R being used, the above statement would create the computing environment - by installing all the packages contained in that repository - and remember it for the future. I have spared you the dozens of lines of package installation output, because that's just the type of person I am. Now then, let's try that again.
suppressMessages(library(DESeq)) countsTableFly <- read.delim( "fly_RNA_counts.tsv" ) condsFly <- c( "A", "A", "B", "B" ) # Add dummy names to avoid confusion later rownames( countsTableFly ) <- paste( "Gene", 1:nrow(countsTableFly), sep="_" ) cdsFly <- newCountDataSet( countsTableFly, condsFly ) cdsFly <- estimateSizeFactors( cdsFly ) cdsFly <- estimateVarianceFunctions( cdsFly ) #oops! error! resFly <- nbinomTest( cdsFly, "A", "B" ) length( which( resFly$padj < .1 ) )
And just like that we've recreated their result. Now, if we were staying in the same R, we would revert to our original computing environment and go about our day.
4. So what just happened?
The short answer: there is a repository associated with the paper's DOI that contains all the exact package versions used by the authors. This repository encapsulates the R computing environment - up to the actual version of R used - that Anders and Huber used. And it renders that environment recreatable, and thus - along with the original data they provide - their results reproducible.
The repository mechanism provides a natural way to both define specific sets of package versions that make up an R computing environment and make them instantly distributable.
5. Sure, but how did it get there?
"Well that's great", you might be thinking, "when we have such a repository; but we're out of luck if the authors of an article don't create one at the time of publication." And you would have largely been right, until now. GRANBase can retrieve the source version of any non-base package that has ever been on CRAN or Bioconductor.
Allow me to repeat that for emphasis. GRANBase can retrieve and build sources for any version of any non-base package that has ever been released on CRAN or BioConductor. For example - assuming you have the svn command-line utility installed - this will get you the DESeq version Anders and Huber used:
Furthermore, GRANBase will happily parse sessionInfo() output - either the R object or the text -, find and build source versions of the specified packages, and build the repository for you. That is how I got the repository I used to recreate Anders and Huber's work.
"The world is saved!", you're probably shouting, and "GRANBase for President!" You guys sure are excitable. I appreciate the sentiment, but I'm fairly certain that R packages can't hold elected office.
6. So what was that switching stuff?
The switchr package provides a formal abstraction for creating, managing, tracking, and switching between "R computing environments" built on top of R's existing library location mechanism. So instead of managing library path's directly, we switched to the appropriate computing environment, ran the authors' computations, confirmed that we got the same results, and then switched back to our normal R installation. Also, as we have seen, switchr is confirmed to work at least as far back as R 2.12.1 (2010), with earlier R versions likely supported but untested.
7. But won't someone think of the harddrives?
Every time a paper, blog post, or internal analysis gets done, we copy the tarballs to a new repository and move on with our day. Sounds good. "Madness!", comes the refrain, "Why would you copy a file 920345982430589423509823452340589345 times???" Well, first of all that number seems a bit high to me, even for popular packages. In principle, though, I agree (and thus so does GRANBase).
GRANBase gets around this with the concept of having many "virtual repositories". Virtual repositories are ones that have passed Kindergarten and thus know how to share. Specifically, they share tarballs of exact package versions, so that communally they only need one copy of each version of each package.
The details of how this sharing happens does this are both unimportant and out of scope here, but the take-away is that there is no duplication of files when many virtual repositories hosted in the same place share package versions. Plus, like the cooks at any reasonable diner, GRANBase can make your packages to order using the version location functionality I mentioned above. This means that with code, data, sessionInfo() output, and a suitable R installation, we can assess and preserve the reproducibility of every R analysis run in the past, present, or future ^.
^Assuming no dependencies not encompassed by R packages
8. Other uses for this framework
- Ensuring the R environment on your local machine is equivalent to the one on a particular computing cluster to the one on your local machine
- Ensuring everyone collaborating on package development is programming against the same API (set of dependency versions)
- Allowing package maintainers to easily recreate the exact environment a reported bug was detected in
- Having side-by-side release and devel installations of BioConductor
9. Other things GRANBase does, very briefly
9.1 Risk reports
- Compare your currently installed packages to currently available versions
- Asseses possible "ripple" effects of updating packages
- Parses NEWS - when possible - to summary # of fixed bugs you are subject to
- Helps reformulate whether to upgrade or not as a strategic decision
9.2 Incremental, multi-source Repository building
- Build repositories using packages sources from many disparate sources (SVN, github, etc) based on the concept of a package manifest
- Only rebuilds updated packages (version bump) and their reverse dependnecies. No wasted testing of unaltered pkgs
- CI-ready design, we use Jenkins
END SHAMELESS PLUG
10. Related work, and further reading
- Gentleman and Temple Lang for a discussion of why and how software dependencies should be distributed packaged with dynamic documents.
- RStudio's packrat is an implementation of Gentleman and Temple Lang's ideas centered around RStudio's concept of an R Project.
- Andrie de Vries' miniCRAN R package explores the concept of creating frozen, self-consistent subsets of CRAN.
Starting with R 2.13, the Bioconductor project provides AMIs that encapsulate core BioC packages, but which also necessarily encapsulate historical R versions. And no, before you ask, starting at 2.13 isn't because there is a vast and highly prescient conspiracy against us … probably.
Dirk Eddelbuettel's work with Docker-based R builds offer the potential for another way of mitigating the pain of the Old-R-versions requirement, though currently that is not the focus of his work. (See ~pg 50)
Cory Barr created an earlier GRAN package which served as a stepping-off point for development of GRANBase.
Thanks to David Smith, Joseph Rickert and Revolutions Analytics for allowing me use of their soapbox.
GRANBase and switchr were developed by Gabriel Becker under the leadership of Michael Lawrence and Robert Gentleman at Genentech Research and Early Development.
Thanks to Michael and Robert for their advice and support throughout the project, and for allowing me to grow it's scope beyond what was initially envisioned.