by Andrie de Vries
A few weeks ago I wrote about the growth of CRAN packages, where I demonstrated how to scrape CRAN archives to get an estimate of the number of packages over time. In this post I briefly mentioned that the Ecdat package contains a dataset, CRANpackages, with snapshots recorded by John Fox and Spencer Graves.
Here is a plot of the data they collected. The dataset contains data through 2014, so I manually added the package count as of today (8,329).
Is there a decline in the rate of growth?
In my previous post, I asked the question: "are there indications that the contribution rate is steady, accelerating or decelerating?"
This hints to analysis by John Fox where he says "The number of packages on CRAN ... has grown roughly exponentially, with residuals from the exponential trend ... showing a recent decline in the rate of growth" (Fox, 2009).
Segmented regression
In my previous post Using segmented regression to analyse world record running times I used segmented regression to estimate a model that is piece-wise linear.
I used the same process to fit a segmented regression line through the CRAN package data.
By default the segmented package fits a single break point through the data. The results of this analysis indicates a break point occurring some time during 2008. This is entirely consistent with the observation by John Fox that the rate of growth is slowing down.
However, note that the the segmented regression line doesn't fit the data very well during the period 2008 to 2012.
With a small amount of extra work you can fit segmented models with multiple break points. To do this, you simply have to specify initial values for the search. Here I show the results of a simple model with two break points. This model finds the first break point during 2007 and the second break point during 2011.
Conclusion
Natural systems can not maintain exponential growth forever. There are always some limits on the system that will ultimately inhibit any further growth. This is why many systems display some kind of sigmoid curve, or S curve.
Although the growth curve of CRAN packages shows signs of slowing down, it does not seem as if there is an inflexion point in the data. An inflexion point is where the curve transitions from being convex to being concave.
Thus it seems the grown of CRAN packages will appear to be exponential for quite some time in the future!
The code
As usual, here is the R code I used.
Is it reasonable to assume that the log-number of CRAN packages is trend stationary? Wouldn't it be more sensible to assume that the process is integrated and model the growth rates directly? These are more likely to be level stationary. Thus:
This also shows some decline in the growth rate from around 0.1% to 0.06% per day - but it also exposes the large variance. Neither a linear model nor a single structural break would appear to be significant here. Possibly a higher sampling frequency could help to work out the differences between these possible data-generating processes.
Posted by: Achim Zeileis | April 28, 2016 at 14:31
But there are 10,000 R packages on github
http://rpkg.gepuro.net/
So growth is probably only increasing
Posted by: Alex | May 02, 2016 at 17:24
@Alex nice finding but it lists < 50% of my pkgs on github, so 10K isn't very accurate I think. Decentralized index for R packages would be a nice solution for that.
Posted by: jangorecki | May 05, 2016 at 14:03