by Joseph Rickert
What makes for a good R package? With over 8,000 packages up on CRAN the quantity of packages is clearly not an issue for R users. Developing an instinct to recognize quality, however, both requires and deserves some effort. I regularly spend time on Dirk Eddelbuettel’s CRANberries site investigating new packages and monitoring changes in old favorites in order to recommend packages for inclusion in MRAN’s Package Spotlight page. As a consequence, I think I’m getting a feel for quality and I believe it comes down to this: A good R package clearly says what it does and then really does what it says.
It should not be surprising that documentation is the key. For an R package, the first obvious place for an author to provide quality documentation is the vignette. Hadley Wickham writes:
A vignette is a long-form guide to your package. . . A vignette is like a book chapter or an academic paper: it can describe the problem that your package is designed to solve, and then show the reader how to solve it. . . Vignettes are also useful if you want to explain the details of your package. For example, if you have implemented a complex statistical algorithm, you might want to describe all the details in a vignette so that users of your package can understand what’s going on under the hood, and be confident that you’ve implemented the algorithm correctly.
Unfortunately, less than 25% of all R packages have vignettes!
As a person who is in the habit of writing, I find it astounding that anyone would do all the creative work to develop the contents of a package and then make the effort to get it through the CRAN process and release into the open source domain without taking the basic step to explain its value, and maybe even discretely sing its praises. Nevertheless, most package authors, even some who have otherwise done great work balk at writing engaging documentation.
Although they are not visible in the plot above, there are a few packages whose authors have made extravagant attempts at documentation. The following table lists all packages with 10 or more vignettes.
Name Date Vignettes
1 caschrono 2014-03-21 10
2 catdata 2014-11-11 45
3 copula 2015-10-26 13
4 gamclass 2015-08-20 11
5 ggvis 2015-06-06 10
6 HSAUR 2015-07-28 17
7 HSAUR2 2015-07-28 19
8 HSAUR3 2015-07-29 22
9 Sleuth2 2016-01-08 13
10 Sleuth3 2016-01-08 13
11 tigerstats 2015-09-23 20
(Yes, catdata does really have 45 vignettes, but the package is the documentation for a book.)
To be fair and thorough though, I should mention that are some pretty lame vignettes out there. I sometimes do find myself putting a new package on the short list for a Spotlight evaluation just because it has a vignette, only to be disappointed when it I get around to looking at it.
I should also note that writing a vignette is neither the only way, nor the most sophisticated way to document an R package. Top shelf packages implementing a statistical algorithm or new computational method are often complemented up by a paper published in the Journal of Statistical Software or some other peer reviewed publication. kernlab, for example, uses a version of its JSS paper as a vignette. Other packages, such as statnet are backed up by informative websites, and it is also becoming common for packages authors to provide links to their GitHub development sites. See data.table's GitHub page, for example, or Dirk's Rcpp CRAN page which provides links to extensive documentation listing multiple vignettes as well as multiple websites. I find it particularly convenient that the package pdf links to these sites and also to the supporting JSS paper.
So, what about the other part of my definition: A good R package ... really does what it says. I find that good documentation does correlate positively with good code, but beyond that the best way to make a quick assessment of the quality of a package is to see if it is included in any of the CRAN Task Views. These are lists of packages organized and curated by experts who make heroic efforts to keep them current and comprehensive, if not complete. Amazingly, 27% of CRAN packages are listed in at least one Task View! This is higher than the percentage of packages that have vignettes.
MRAN provides a lookup feature that is convenient for checking a package's quality potential. With one click you can see the names of any vignettes that may be associated with a package along with any Task Views in which it may be listed.
Of course, the crucial step for checking quality and utility of a package is to try the package yourself. Doing this also provides an opportunity for you the user to contribute to the common good by just using the package in your work, talking about its merits, and maybe even providing constructive feedback to the author if you think that would be helpful.
The data for this post comes from a JSON file available on the MRAN website. The code below was used for the post.
library(jsonlite) library(ggplot2) # Read in package data as JSON file and form into a data frame json_file <- "https://mran.revolutionanalytics.com/packagedata/allpackages.json" json_data <- as.data.frame(fromJSON(paste(readLines(json_file),collapse=""))) dF <- json_data[,c(1,2,6,7)] names(dF) <- c("Name","Date","Task_View","Vignettes") dF$Vignettes <- ifelse(is.na(dF$Vignettes)==TRUE,0,dF$Vignettes) dF$Task_View <- ifelse(is.na(dF$Task_View)==TRUE,"None",dF$Task_View) head(dF) # Look at some summary statistics summary(dF$Vignettes) table(dF$Vignettes) # Find number of vignettes have_vig <- sum(dF$Vignettes>0) have_vig #1961 pct_vig <- have_vig / length(dF$Vignettes) pct_vig #0.2353859 # Find package with >= 10 Task Views dF[dF$Vignettes>=10,] # Modify data frame for printing dF10 <- dF[dF$Vignettes>=10, -3] row.names(dF10) <- NULL dF10 # Plot histogram p <- ggplot(dF, aes(x=Vignettes)) p + geom_histogram(binwidth=1) + ggtitle("The sad tale but long tail of vignettes") # Find number of packages in task views have_TV <- dF[dF$Task_View!="None",] pct_TV <- dim(have_TV)[1] / length(dF$Vignettes) pct_TV #0.2724763
Here are the things I tell people to look out for as potential indicators for the "quality" of a package (in no particular order):
- written by a known expert in the field
- package has been around for some time
- package has been updated
- listed under a task view
- has a vignette or other supporting documentation
- there is a package website
- paper/book about package has been published
- help files are comprehensive and free of errors
- has been cited in papers
None of these are necessary nor sufficient conditions for showing that a package is of high quality (which is difficult to define in the first place), but these can be helpful indicators.
Posted by: Wolfgang Viechtbauer | May 12, 2016 at 09:05
I am curious, does the 25% figure include those packages that are on Bioconductor, or just those on CRAN? A requirement for inclusion on Bioconductor is that a package must have at least one vignette, which I think is a very good policy.
Posted by: Rmflight | May 12, 2016 at 09:16
You write "For an R package, the first obvious place for an author to provide quality documentation is the vignette" I would argue this is not the first obvious place to look for quality documentation. README's are the first place as they are what get displayed first on a GitHub page. The very historical nature of READMEs tell the user to "read this first". You make what is a faulty initial assumption in the vignette == quality (I agree with this) but your logic also means !vignette == low quality (which you indicate is also an assumption based on the remainder of your post). This is likely not a tenable assumption given the prevalence of READMEs and GitHub as the popular dev environment. If this were a regression model you have one variable and an R^2 that shows vignettes are predictors but your model falls short of explaining package quality (i.e., significance but great amounts of unexplained variance). It's pretty tedious to maintain separate READMEs and vignettes so many chose READMEs as the natural obvious place to provide quality information. Adding README length (nchar) as a variable to your model may be more work but is likely a better model of package quality. Granted you do give a paragraph disclaimer that vignettes aren't the only measure but but your title implies !vignette == bad. Perhaps you're model is measuring developer's valuing of vignettes as the primary form of communication, not package quality as your title suggests. I suspect your model is a measure of developer groups (e.g., academics, business, etc.) and their choices of mode of communication, not quality.
Posted by: tyler rinker | May 12, 2016 at 14:31
You have raised an important topic. As Bill Venables wrote on R-Help in 2007 "Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous."
Having good examples in the help with real datasets is another plus point. If authors have never tried their own code in practice, they may well have missed something.
Posted by: Antony Unwin | May 13, 2016 at 02:25
Tyler Rinker suggests Github README files are more important than vignettes. I disagree. Github is a developer site, not a user site. Users see the documentation that comes with the package, and is displayed within R.
Maintaining a web page to support a package is a good thing, and the README file would be the place to hold the content if the package was hosted on Github, but requiring Github as the host is too limiting, and depending on an external web site for documentation is too fragile. Packages should contain their documentation.
I think Joe Rickert underestimates the importance of the help page documentation. Antony Unwin said that having good examples with real datasets is a plus. Those probably belong in vignettes, but I'd add that having short illustrative examples on the help pages is really important.
Posted by: Duncan Murdoch | May 13, 2016 at 06:35
Duncan thanks for your insights. I think I may have overstated the importance of a README, I didn't mean to suggest it is more important, more so that it is an alternative place to display package usage and can also demonstrate quality. The README is also typically part of the package build which also is displayed in a pretty way if it's an .md by CRAN though not as readily accessible within R as a vignette or help page.
I think there's room for a solution to the problem of having an easy way to maintain a README that also serves as a vignette without maintaining 2 documents. This is especially true for smaller packages with a few number of functions where the use could easily be described with a single vignette. It'd be nice to just be able to maintain a README and vignette but the two documents are most likely, just a tad bit different. A gogle search brings up this related twitter discussion on the topic: https://twitter.com/JennyBryan/status/724391823385354240
Posted by: tyler rinker | May 13, 2016 at 07:24
Vignettes are less important than manuals IMO. Vignettes are just a way to wrap what is in manuals. Vignette is important to show a bigger picture on how to combine multiple functions from a package, but it is just an example workflow. Real documentation are the manuals.
Posted by: jangorecki | May 15, 2016 at 04:33