by Andrie de Vries
My experience of UseR!2015 drew to an end shortly after I gave a Kaleidoscope presentation discussing "The Network Structure of CRAN".
My talk drew heavily on two previous blog posts, Finding the essential R packages using the pagerank algorithm and Finding clusters of CRAN packages using igraph.
However, in this talk I went further, attempting to create a single visualiziation of all ~6,700 packages on CRAN. To do this, I did all the analysis in R, then exported a GraphML file, and used Gephi to create a network visualization.
My first version of the graph was in a single colour, where each node is a package, and each segment is a dependency on another package. Although this graph indicates dense areas, it reveals little of the deeper structure of the network.
To examine the structure more closely, I did two things:
- Use the page.rank() algorithm to compute package importance, then changed the font size so that more "important" packages have a bigger font
- Used the walktrap.community() algorithm to assign colours to "clusters". This algorithm uses random walks of a short length to find clusters of densely connected nodes
This image (click to enlarge) quite clearly highlights several clusters:
- MASS, in yellow. This is a large cluster of packages that includes lattice and Matrix, together with many others that seem to expose statistical functionality
- Rcpp, in light blue. Rcpp allows any package or script to use C++ code for highly performant code
- ggplot2, in darker blue. This cluster, sometimes called the Hadleyverse, contains packages such as plyr, dplyr and their dependencies, e.g. scales and RColorBrewer.
- sp, in green. This cluster contains a large number of packages that expose spatial statistics features, including spatstat, maps and mapproj
It turns out that Rcpp has a slightly higher page rank than MASS. This made Dirk Eddelbuettel very happy:
You can find my slides at SlideShare and my source code on github.
Finally, my thanks to Gabor Csardi, maintainer of the igraph package, who listened to my ideas and gave helpful hints prior to the presentation.
Try multilevel.community(). I've found it to give better results than walktrap.community, and it runs much faster.
Posted by: zach | July 08, 2015 at 11:56
It would be nice to see whether the network structure or page ranking changes significantly now that CRAN policy demands to make dependencies on recommended packages explicit.
Also I think we should call the upcoming red group on the left the "Jeroeniverse" :)
Posted by: MarkPJvanderLoo | July 13, 2015 at 05:36
This is very very nice and useful. Congrats on the package creators for being generous with their time and energy to the cause of global scientific computing
Posted by: Ajay | July 13, 2015 at 21:53