by Joseph Rickert
In a recent post on creating JavaScript network graphs directly from R, my colleague and fellow blogger, Andrie de Vries, included a link to a saved graph of CRAN. Here, I will use that same graph (network) to build a simple exponential random graph model using functions from the igraph package, and the network and ergm packages included in the statnet suite of R packages. Each node (vertex) in the saved graph represents a package on CRAN, and a directed link or edge between two nodes A -> B indicates that package A depends on package B. Since Andrie's CRAN graph does not have any external attributes associated with either the nodes or edges, the idea is to see if we can develop a model using only structural aspects of the network itself as predictors. In general, this is not an easy thing to do and we are only going to have limited success here. However, the process will illustrate some basic concepts.
A fundamental statistic associated with each node in a network is its degree, the number of edges that connect to it. (Each edge connects a node to some other node in the network forming a tie.) The degree distribution (distribution of the degree associated with each node of the network) is one way to think about the local connectivity structure of the network. The following plots illustrate different aspects CRAN's degree distribution.
The top left histogram shows that although most nodes have degree less than 5, there is a long tail with some nodes having hundreds of incident edges. (Of the 6,867 nodes in the network 1,156 have degree greater than 5). Additionally, we have:
summary(cran_deg)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.000 1.000 2.000 4.296 4.000 764.000
The bottom two histograms provide greater detail for the beginning and tail-end of the degree distribution.
The purpose of the fourth plot in the panel, the log-log plot, is to look for evidence that the CRAN network may exhibit the kind of "scale free" structure that is common to social network graphs. (A straight line would provide some evidence that the network follows a power law distribution.) That is not quite what is going on here. Nevertheless, the structure exhibited in the degree distribution and the clustering clearly visible in CRAN package network plots indicate that we are not dealing with a purely random graph either.
The following shows the code to specify and fit an ergm model to the data comprising the CRAN network with just two predictors edges and degree(c(1,2)).
# Model edges form <- formula(cran_net ~ edges + degree(c(1,2))) summary.statistics(form) # edges degree1 degree2 # 14293 1400 1119 # Fit ergm fit <- ergm(form,control=control.ergm(MCMLE.maxit = 60)) summary.ergm(fit) # ========================== # Summary of model fit # ========================== # # Formula: cran_net ~ edges + degree(c(1, 2)) # # Iterations: 22 out of 60 # # Monte Carlo MLE Results: # Estimate Std. Error MCMC % p-value # edges -6.89814 0.01124 1 <1e-04 *** # degree1 2.46835 0.04519 0 <1e-04 *** # degree2 1.25087 0.03884 0 <1e-04 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Null Deviance: 32681073 on 23574411 degrees of freedom # Residual Deviance: 238828 on 23574408 degrees of freedom # # AIC: 238834 BIC: 238879 (Smaller is better.) #
Fitting an ergm with just edges as a predictor would be an attempt to fit a random graph to the data. (See Hunter et al. for details.) Adding the term degree(c(1,2) attempts to control for the nodes of degrees 1 and 2 which appear to deviate from the straight line in the log-log plot. The small p values associated with the coefficients and the fact that both the AIC and BIC are slightly smaller than the those resulting from a fit to a random model (AIC = 240349, BIC = 240364) indicate that the simple model has some explanatory power. But how good is the fit?
The ergm package provides a method of assessing goodness of fit through the computationally intensive scheme of simulating networks from the model and comparing statistics from the simulated networks with those calculated from the actual network. The following plot illustrates this comparison for the degree distribution, edge-wise shared partners distribution and the proportion of dyads at values of minimum geodesic distance. The fit for the first two distributions appears to be quite good, and the fit for the third distribution, while far from perfect, is not bad either.
(in the plots, the solid black lines represent the statistics computed from the GRAN graph, the light lines with box plots are the simulated data.)
The take away here is that a very simple model appears to have a surprising good fit. Constructing a better fitting and more interesting model would need to take into account characteristics of the package design and construction process. These features would then be used as additional predictors. I think a hint as to how one might go about this is contained in Andrie's observation that a many of the packages of very high degree such as Rcpp, MASS, ggplot2 etc. are concerned with providing tools or reusable infrastructure. Exploring this idea, however, and modeling other package networks such as Bioconductor are subjects for future work.
The code used for this post may be obtained here: Download CRAN_ergm_blog
To go further into ergm modeling with statnet and R have a look at benjamin's most informative and entertaining analysis of Grey's Anatomy "hook-ups".
Comments
You can follow this conversation by subscribing to the comment feed for this post.