by Andrie de Vries
This week at JSM2015, the annual conference of the American Statistical Association, Joseph Rickert and I gave a presentation on the topic of "The network structure of CRAN and BioConductor" (link to abstract).
Our work tested the hypothesis if one can detect statistical differences in the network graph formed by the dependencies between packages. In the dependency graph, each package is a vertex and each dependency is an edge connecting two vertices.
Building on previous work
This presentation combines earlier work that we have discussed in blog posts during the year:
- The network structure of CRAN
- A simple statnet model of CRAN
- Finding the essential R packages using the pagerank algorithm
- Contracting and simplifying a network graph
- Finding clusters of CRAN packages using igraph
Before starting the work, we formed a hypothesis that CRAN and BioConductor have discernably different package network structures.
This hypothesis is based on the intuition that these two repositories have different management structures:
- On CRAN, packages of almost any type are welcome. The CRAN maintainers have some strict policies on how a package should behave to get on CRAN (have documentation, have examples, build without warnings, etc.). However, CRAN does not prescribe anything about the subject matter or content of any package.
- In contrast, BioConductor is more focused and centrally managed. Packages must add something to the topic of high-throughput genomic data. For a great introduction, read Peter Hickey's contributed blog post, A Short Introduction to Bioconductor.
What we found
Firstly, we used the igraph package to compute descriptive network statistics. Among these, we found the clustering coefficient and the degree distribution most illuminating.
Firstly, we found that BioConductor has a higher clustering coefficient than CRAN. The clustering coefficient (also called transitivity) measures the probability that the adjacent vertices of a vertex are connected.
You can see this visually in the network graphs. It appears as if the BioConductor graph is more compact, while the CRAN graph has many packages on the perimeter that are only loosely connected to the rest of the graph.
We used a simple bootstrapping algorithm to simulate the local clustering coefficient of induced subgraphs. In this plot, CRAN (in red) has a much lower distribution of clustering coefficient than BioConductor (in blue).
The second statistical summary is the degree distribution. The degree of a node is the number of adjacent edges. Note in particular the degree distribution with nodes of degree zero, i.e. unconnected nodes.
BioConductor has a much lower fraction of packages with zero connections. It seems that the BioConductor policy encourages package authors to re-use exising material and write packages that work better together.
The presentation is available on slideshare.
The scripts we used are available at github. We think this is an important topic to study, since it could help to discover:
- Better search algorithms for finding packages that are useful to solve a specific problem
- Recommendations for packages to use