The ability to create reproducible research is an important topic for many users of R. So important, that several groups in the R community have tackled this problem. Notably, packrat from RStudio, and gRAN from Genentech (see our previous blog post).
The Reproducible R Toolkit is a new open-source initiative from Revolution Analytics. It takes a simple approach to dealing with R package versions, consisting of an R package checkpoint, and an associated daily CRAN snapshot archive, checkpoint-server. Here's one illustration of the problem it solves (with apologies to xkcd):
checkpoint-server
To achieve reproducibility, we store daily snapshots of all CRAN packages. At midnight UTC each day we refresh the CRAN mirror and then store a snapshot of CRAN as it exists at that very moment. You can access these daily snapshots using the checkpoint package, which installs and consistently use these packages just as they existed at the snapshot date. Daily snapshots exist starting from 2014-09-17.
checkpoint package
The goal of the checkpoint package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, checkpoint allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the checkpoint package and the checkpoint server act as a "CRAN time machine", so that anyone using checkpoint can ensure the reproducibility of scripts or projects at any time.
How to use checkpoint
One you have the checkpoint package installed, using the checkpoint() function is as simple as adding the following lines to the top of your script:
Typically, you will use the date you created the script as the argument to checkpoint. The first time you run the script, checkpoint will inspect your script (and other R files in the same project folder) for the packages used, and install the required packages with versions as of the specified date. (The first time you run the script, it will take some time to download and install the packages, but subsequent runs will use the previously-installed package versions.)
The checkpoint package installs the packages in a folder specific to the current project (in a subfolder of
If you want to update the packages you use at a later date, just update the date in the checkpoint() call and checkpoint() will automatically update the locally-installed packages.
Installing the checkpoint package
The checkpoint package is available on CRAN:
Worked example
To find out more:
Feedback and Thanks
The Reproducible R Toolkit was created by the Open Source Solutions group at Revolution Analytics. Special thanks go to Scott Chamberlain who helped with early development.
We'd love to know what you think about checkpoint. Leave comments here on the blog, or via the checkpoint GitHub page.
Why do you have require(checkpoint) in your graphic and library(checkpoint) in your code? I know require just tries to load the package and if it fails, keeps going on it's merry way. Is there a reason for it?
Posted by: Mark | October 13, 2014 at 10:01
There's no particular reason. I have a bad habit of using require() in my scripts, but library() is actually the better one to use.
Posted by: David Smith | October 13, 2014 at 14:34
This looks great!
Currently checkpoint depends on R>=3.1.1, is that a hard requirements?
Thanks.
Posted by: Gad Abraham | October 13, 2014 at 20:23
Hi,
really awesome. Was looking for something like that. One issue:
We use bioconductor a lot. It would be nice to either:
- have an interface to bioconductor
- expose the function that scans for all library and require statements
- filter package names that do not exist on cran (which are then hopefully the packages that exist on bioconductor)
Posted by: Holger Hoefling | October 15, 2014 at 04:51
Hi Holger, we have some ideas on how to incorporate bioconductor packages, github packages and private packages into the checkpoint system -- stay tuned. In the meantime, you might want to check out Gran or Packrat (see the links at the top of this post).
Posted by: David Smith | October 15, 2014 at 05:33
David, could you say a little about advantages/disadvantages of this vs. Packrat?
Posted by: Clayton Yochum | October 15, 2014 at 14:37
@Clayton, packrat's more flexible, in that you can mix-and-match R packages from different epochs. checkpoint is designed to be less flexible, but much simpler. It's easier to share reproducible scripts with checkpoint -- the script alone is sufficient. With packrat, you need to share the packages as well (or have your script configured in a github repo). Depending on what you need, either checkpoint or packrat might be the best fit.
Posted by: David Smith | October 15, 2014 at 20:40