by Andrie de Vries
In a previous post I demonstrated how to use the igraph package to create a network diagram of CRAN packages and compute the page rank.
Now I extend this analysis and try to find clusters of packages that are close to one another.
Method
In graph terminology, clusters are called communities. Several community algorithms exist, but only some of these are suitable for directed graphs. In particular, I was only able to apply the walktrap community algorithm.
For a rather wonderful discussion the various community detection algorithms, take a look at this StackOverflow answer by Tamás Nepusz.
From the igraph package help, the walktrap algorithm "tries to find densely connected subgraphs, also called communities in a graph via random walks. The idea is that short random walks tend to stay in the same community".
Clusters and size
The walktrap algorithm finds more than 25 clusters. The largest has more than 1,350 packages. Each of the top 10 largest clusters has more than 50 members.
Results
With clusters this large, it's quite brazen (and possibly just wrong) to try and interpret the clusters for meaning. To help with this task, I sorted the cluster members according to their pagerank. This means the packages with the highest pagerank is at the top of this list. The package names are the actual packages in each cluster, but the bold heading is my own interpretation.
Foundations | Hadleyverse | Interface | Statistical learning | Spatial |
MASS | ggplot2 | XML | cluster | sp |
Matrix | plyr | RCurl | glmnet | raster |
lattice | reshape2 | stringr | nnet | maptools |
mvtnorm | foreach | httr | rpart | spatstat |
survival | shiny | RJSONIO | randomForest | rgdal |
igraph | scales | rjson | pls | rgeos |
nlme | reshape | data.table | class | matrixStats |
coda | digest | lubridate | slam | splancs |
boot | DBI | jsonlite | survey | plotKML |
rgl | dplyr | knitr | clue | Deducer |
Hmisc | iterators | assertthat | Rglpk | gstat |
numDeriv | doParallel | spocc | mice | RandomFields |
mgcv | gWidgetsRGtk2 | bitops | mboost | ecospat |
RColorBrewer | gridExtra | taxize | caret | pedometrics |
car | png | devtools | betareg | geoR |
gtools | gWidgets | rplos | gbm | GSIF |
xtable | RGtk2 | rgbif | bootfs | biomod2 |
lme4 | RSQLite | rnoaa | party | classInt |
fields | proto | Causata | modeltools | wux |
e1071 | opm | markdown | rminer | gfcanalysis |
The code
Here is the code. Perhaps you want to reproduce this and provide a different interpretation of the clusters.
Comments
You can follow this conversation by subscribing to the comment feed for this post.