A few weeks ago Joseph Rickert wrote an excellent post about using the igraph package, illustrating many concepts of using graphs.
His post reminded me of another excellent blog entry by Antonio Piccolboni where he used the page.rank() function in the igraph package to determine the essential R packages. Unfortunately Antonio does not show the code he used, and I intrigued me to recreate his analysis.
In this post I illustrate:
- Using the miniCRAN package to build a graph of package dependencies (see previous blog post)
- Using page.rank() to compute the most relevant packages
- Incidentally, I also make use of the %>% pipes, exposed by the magrittr package (previous blog post).
You can view the code at the bottom of this blog post, but first let's review the big idea.
The idea
The big idea behind this analysis is that package authors write their code to depend only on packages that they view as useful and stable. In a sense, each time a package is included in the dependency list of another package, it is as if the author casts a vote of confidence.
What if you had some kind of algorithm that calculates a statistic, where the number increases with each link to a package?
This is what the famous PageRank algorithm does, one of the mechanisms that Google use to determine the importance of a web page. In essence, a page (or package) is deemed to be more important if many other pages (packages) link to it.
The good news is that the igraph package has a built-in function to compute the pagerank, called page.rank().
Now we have only one more thing to do, and that is to construct the actual graph. Now you have a second piece of good news: the miniCRAN package contains a function makeDepGraph() that constructs the dependency graph for a given set of packages.
For example, in this simple graph, node 4 has many incoming links and thus has a high page rank. However, although node 1 has only 2 incoming links, one of those links is from a very popular node, thus node 1 has the highest page rank in the graph
The results
Starting with the answer first, here are the top 25 packages, sorted by page rank. I added a high level description for each package:
Package | PageRank | Description | |
1 | MASS | 0.0210 | Functions and datasets to support Venables and Ripley, 'Modern Applied Statistics with S' (4th edition, 2002). |
2 | Rcpp | 0.0166 | Interface to use C++ code in R |
3 | Matrix | 0.0100 | Sparse matrix engine |
4 | lattice | 0.0096 | Base R package for lattice (trellis) graphics |
5 | mvtnorm | 0.0088 | Multivariate normal distributions |
6 | survival | 0.0083 | Time-to-event analysis |
7 | ggplot2 | 0.0073 | Graphics engine |
8 | plyr | 0.0067 | Group-by operations |
9 | XML | 0.0047 | Parse and manipulate documents in XML format |
10 | igraph | 0.0047 | Analyse graph structures |
11 | RCurl | 0.0043 | Wrapper around Curl, to process HTTP requests |
12 | sp | 0.0043 | Spatial analysis |
13 | coda | 0.0042 | Output analysis and diagnostics for Markov Chain Monte Carlo simulations |
14 | nlme | 0.0041 | Non-linear mixed effects modeling |
15 | boot | 0.0038 | Functions and datasets for bootstrapping |
16 | stringr | 0.0038 | String operations |
17 | rgl | 0.0034 | Interface to gl library to create interactive 3D graphics |
18 | rJava | 0.0033 | Interface to Java |
19 | reshape2 | 0.0033 | Reshape data from wide to long format and vice-versa |
20 | RcppArmadillo | 0.0032 | Extension of Rcpp to include Armadillo libraries |
21 | ape | 0.0031 | Provides functions for reading, writing, plotting, and manipulating phylogenetic trees |
22 | zoo | 0.0031 | Functions to work with irregularly ordered time series data |
23 | Hmisc | 0.0027 | Frank Harrell's miscellaneous functions for data analysis |
24 | numDeriv | 0.0027 | Methods for calculating (usually) accurate numerical first and second order derivatives |
25 | mgcv | 0.0026 | Routines for GAMs and other generalized ridge regression |
The code
The combination of miniCRAN::makeDepGraph() and igraph::page.rank() means the complete analysis is less than 50 lines of R code
Comments