by Joseph Rickert

In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.

An EDM random variable Y follows a Tweedie distribution if

var(Y) = f * V(m)

where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:

V(m) = m^{p }

Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)

This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.

R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .

# Fit an inverse-Gaussion glm with log-link

glm(y~x,family=tweedie(var.power=3,link.power=0))

The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.

It is apparent that increasing the power shifts mass away from zero towards the right.

(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)

Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.

Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.

by Daniel Hanson, with contributions by Steve Su (author of the GLDEX package). Part 1 of a series.

As most readers are well aware, market return data tends to have heavier tails than that which can be captured by a normal distribution; furthermore, skewness will not be captured either. For this reason, a four parameter distribution such as the Generalized Lambda Distribution (GLD) can give us a more realistic representation of the behavior of market returns, including a more accurate measure of expected loss in risk management applications as compared to the normal distribution.

This is not to say that the normal distribution should be thrown in the dustbin, as the underlying stochastic calculus, based on Brownian Motion, remains a very convenient tool in modeling derivatives pricing and risk exposures (see earlier blog article here), but like all modeling methods, it has its strengths and weaknesses.

As noted in the book Financial Risk Modelling and Portfolio Optimization with R (Pfaff, Ch 6: Suitable distributions for returns) (publisher information provided here), the GLD is one of the recommended distributions to consider in order “to model not just the tail behavior of the losses, but the entire return distribution. This need arises when, for example, returns have to be sampled for Monte Carlo type applications.” The author provides descriptions and examples of several R packages freely available on the CRAN website, namely Davies, fBasics, gld, and lmomco. Another package, also freely available on CRAN, is the GLDEX package, which is the package we will use in the current article. It contains a rich offering of functions and is well documented. In addition, the author of the GLDEX package, Dr Steve Su, has kindly provided assistance in the writing of this article. He has also published a very useful and related article in the Journal of Statistical Software (JSS) (2007), to which we will refer in the discussion below.

The four parameters of the GLD are, not surprisingly, λ1, λ2, λ3, and λ4. Without going into theoretical details, suffice it to say that λ1 and λ2 are measures of location and scale respectively, while the skewness and kurtosis of the distribution are determined by λ3 and λ4.

Furthermore, there are two forms of the GLD that are implemented in GLDEX, namely those of Ramberg and Schmeiser (1974), and Freimer, Mudholkar, Kollia, and Lin (1988). These are commonly abbreviated as RS and FMKL. As the FMKL form is the more modern of the two, we will focus on it in the discussion that follows. An additional reference frequently cited in the literature related to the GLD in finance is the paper by Chalabi, Scott, and Wurtz, freely available here on the rmetrics website.

As Steve Su points out in his 2007 JSS article on the GLDEX package (see link above), there are three basic steps that are useful in determining the quality of the GLD fit. The first two, as we shall see, can be competing objectives in determining the fit. The GLDEX package provides functionality for each.

- Comparing the mean, variance, skewness and kurtosis of the fitted distribution with the empirical data
- The Komogorov-Smirnoff (KS) resample test and goodness of fit
- Graphical outputs

Remark: The list of options has been presented here in opposite order of that in the JSS article in order to assist in the development of the discussion, as we shall see.

*Market Returns Data*

Let’s first obtain some market data to use. The Wilshire 5000 index is commonly used as a measure of the total US equity market -- comprising large, medium, and small cap stocks -- so we call once again upon our old friend the quantmod package to access the past 20 years of daily closing prices of the Vanguard Total Stock Market Index Fund (VTSMX).

require(quantmod) # quantmod package must be installed

getSymbols("VTSMX", from = "1994-09-01")

VTSMX.Close <- VTSMX[,4] # Closing prices

VTSMX.vector <- as.vector(VTSMX.Close)

# Calculate log returns

Wilsh5000 <- diff(log(VTSMX.vector), lag = 1)

Wilsh5000 <- 100 * Wilsh5000[-1] # Remove the NA in the first position,

# and put in percent format

*Method of Moments*

Appealing to step (1) above, the following function uses the FMKL form to fit the data to a GLD with the method of moments,

wshLambdaMM <- fun.RMFMKL.mm(Wilsh5000)

This returns the estimated values of λ1, λ2, λ3, and λ4 in the vector wshLambdaMM :

[1] 0.04882924 1.98442097 -0.16423899 -0.13470102

Remark: Warning messages such as the following may occur when running this function:

Warning messages:

1: In beta(a, b) : NaNs produced

2: In beta(a, b) : NaNs produced

…

These may be ignored.

We can then compare the four moments of the fitted distribution with those of the market data using the following functions respectively:

# Moments of fitted distribution:

fun.theo.mv.gld(wshLambdaMM[1], wshLambdaMM[2], wshLambdaMM[3], wshLambdaMM[4],

param = "fmkl", normalise="Y")

# Results:

# mean variance skewness kurtosis

# 0.02824672 1.50214919 -0.30413445 7.8210743

# Moments of Wilshire 5000 market returns:

fun.moments.r(Wilsh5000,normalise="Y") # normalise="Y" -- subtracts the 3

# from the normal dist value.

# Results:

# mean variance skewness kurtosis

# 0.02824676 1.50214916 -0.30413445 7.82107430

We’re basically spot-on here, and things are looking pretty good; however, we haven’t looked at a goodness of fit test yet, and unfortunately, this will tell a different story. We will first look at the Komogorov-Smirnoff (KS) resample test, as shown in the 2007 JSS article. The test is based on the sample statistic Kolmogorov-Smirnoff Distance (D) between the data in the sample and the fitted distribution. The null hypothesis is, simply speaking, that the sample data is drawn from the same distribution as the fitted distribution.

The function here, from the GLDEX package, samples a proportion (default = 90%) of the data and fitted distribution and calculates the KS test p-value 1000 times (no.test argument), and returns the number of times that the p-value is not significant. The higher the number, the more confident we can be that the fitted distribution is reasonable.

fun.diag.ks.g(result = wshLambdaMM, data = Wilsh5000, no.test = 1000,

param = "fmkl")

Our result here is 53/1000, which suggests that we’re pretty way off. A more recent addition to the GLDEX package that was not available at the time the related 2007 JSS article was written is the following:

ks.gof(Wilsh5000, "pgl", wshLambdaMM, param="fmkl")

where pgl is the GLD distribution function included in the GLDEX package (the analog of the pnorm normal distribution function included in Base R).

With a p-value of 1.912e-05, it is pretty safe to reject the hypothesis that the sample data is drawn from the fitted distribution.

*Method of Maximum Likelihood (ML)*

As Steve Su points out in his JSS article, “The maximum likelihood estimation is usually the preferred method” for “providing definite fits to a data set using the GLD”. The function in the GLDEX package, again for the FKML parameterization, is

wshLambdaML = fun.RMFMKL.ml(Wilsh5000)

Checking our goodness of fit tests,

fun.diag.ks.g(result = wshLambdaML, data = Wilsh5000, no.test = 1000, param = "fmkl")

We get a result of 825/1000, and for

ks.gof(Wilsh5000,"pgl", wshLambdaML, param="fmkl")

we get D = 0.0151, p-value = 0.2

This p-value, while not spectacular, is far better than what we saw for the method of moments case, and the KS resample test is also much more convincing. But now, the “bad news”: if we look at the four moments of the ML fit,

fun.theo.mv.gld(wshLambdaML[1], wshLambdaML[2], wshLambdaML[3],

wshLambdaML[4], param = "fmkl", normalise="Y")

we get

# mean variance skewness kurtosis

# 0.02850058 1.64456695 -2.13494680 NA

While the mean and variance are reasonably close to their empirical counterparts, skewness is off by about 60%, and kurtosis can’t be determined by the algorithm.

*Graphical Comparison of Method of Moments and Maximum Likelihood*

Now, invoking step 3, let’s compare the plots resulting from the two different methods, using the fun.plot.fit(.) function provided in the GLDEX package, by overlaying the pdf curve of the fitted distribution on top of the histogram of the returns data. In order to assure a meaningful plot, however, we should first determine the optimal number of bins in the histogram using the Freedman-Diaconis Rule with the following R function:

bins <- nclass.FD(Wilsh5000) # We get 158

Then, set nclass = bins into the plotting function in the GLDEX package:

# Method of Moments

fun.plot.fit(fit.obj = wshLambdaMM, data = Wilsh5000, nclass = bins,

param = "fmkl",

xlab = "Returns", main = "Method of Moments Fit")

# Method of Maximum Likelihood

fun.plot.fit(fit.obj = wshLambdaML, data = Wilsh5000,

nclass = bins, param = "fkml",

xlab = "Returns", main = "Method of Maximum Likelihood")

Visual inspection of the plots is consistent with our findings above, that the method of maximum likelihood results in a better fit of the data than the method of moments, despite the fact that the moments line up almost exactly in the case of the former.

One more set of plots that one should inspect is the set of quantile (“QQ”) plots:

qqplot.gld(fit=wshLambdaMM,data=Wilsh5000,param="fkml", type = "str.qqplot",

main = "Method of Moments")

qqplot.gld(fit=wshLambdaML,data=Wilsh5000,param="fkml", type = "str.qqplot",

main = "Method of Maximum Likelihood")

Now, if we were to look at these two plots in a vacuum, so to speak, with none of the other prior information available, there is a good case to be made that the QQ plot for Method of Moments might indicate a better fit. However, note that at about -4 along the horizontal (empirical data) axis, the plotted points start to drift above the line indicating where the horizontal and vertical axis values are equal. This implies that our fit is underestimating market losses as we move out toward the left tail of the distribution. The QQ plot for Maximum Likelihood is more conservative, erring on the side of caution with the fitted distribution indicating an increased risk of greater loss than the Method of Moments fit. As Steve Su puts it, the general recommendation is to look at the QQ plot and KS test results together, to determine the goodness of fit. The QQ plot alone, however, is not a fail-proof method.

**Conclusion**We have seen, using R and the GLDEX package, how a four parameter distribution such as the Generalized Lambda Distribution can be used to fit a distribution to market data. While the Method of Moments as a fitting algorithm is highly appealing due to its preserving the moments of the empirical distribution, we sacrifice goodness of fit that can be obtained by using the Method of Maximum Likelihood.

In our next article, we will look at an alternative GLD fitting method know as the Method of L-Moments as a compromise between the two methods discussed here, and then conclude with a comparison with the normal distribution, which will exhibit quite clearly the advantages of the GLD when it comes to fitting financial returns data.

*Very special thanks are due to Dr Steve Su for his contributions and guidance in presenting this topic.*

Hadley Wickham's dplyr package is a great toolkit for getting data ready for analysis in R. If you haven't yet taken the plunge to using dplyr, Kevin Markham has put together a great hands-on video tutorial for his Data School blog, which you can see below. The video covers the five main data-manipulation "verbs" that dplyr provides: filter, select, arrange, mutate and summarise/group_by. (It also introduces the glimpse function, a handy alternative to str, that I had overlooked before.)

The video also provides an introduction to the %>% ("then") operator from magrittr, which you'll likely fund useful for many other applications in addition to dplyr. Also, Kevin's video works from an Rmarkdown script to show how dplyr works, and so serves as a mini-tutorial for Rmarkdown as well. It's well worth 40 minutes of your time. Also, check out Kevin's blog post linked below for links to many other useful dplyr resources.

Data School: Hands-on dplyr tutorial for faster data manipulation in R (via Peter Aldhous)

by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.

by Seth Mottaghinejad

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

- if n is 1, stop, otherwise recurse on the following
- if n is even, then divide it by 2
- if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

- Is the integer a prime number?
- What are its proper divisors?
- What is the remainder upon dividing the integer by some other number?
- What are the sum of its digits?
- Is the integer a triangular number?
- Is the integer a square number?
- Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n - 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 - sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }

Here is a list of other helper functions in collatz.cpp:

- 'is_prime': given an integer vector, returns a logical vector
- 'gen_primes': given some integer input, n, generates the first n prime numbers
- 'gen_divisors': given an integer n, returns an integer vector of all its proper divisors
- 'is_perfect': given an integer vector, returns a logical vector

sum_digits <- function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f <- function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

maxint <- 10^5 df <- data.frame(int = 1:maxint) # the original number df <- transform(df, st = cpp_collatz(int)) # stopping times df <- transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1) - 1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead

[1] FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9

abs(.3 - (.4 - .1)) < eps # returns TRUE

[1] TRUE

df <- transform(df, is_triangular = abs(inv_triangular - as.integer(inv_triangular)) < eps, is_square = abs(inv_square - as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal - as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df <- df[ , names(df)[-grep("^inv_", names(df))]]

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the *'unique factorization theorem'*. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.

Here's a sample of 10 rows from the data:

df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE

We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.

my_summary <- function(df) { primes <- gen_primes(9) res <- with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }

To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.

To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.

flags <- names(df)[grep("^is_", names(df))] res <- lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) <- flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996

As we can see from the above results, most comparisons appear to be non-significant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:

- On average, stopping times are slightly higher for prime numbers compared to composite numbers.
- On average, stopping times are slightly lower for triangular numbers, square numbers, and pentagonal numbers compared to their corresponding counterparts.
- Despite having lower average stopping times, triangular numbers, square numbers and pentagonal numbers are more strongly correlated with their stopping times than their corresponding counterparts. A larger sample size may help in this case.
- On average, odd numbers have a higher stopping time than even numbers.

This last item could just be restatement of 1. however. Let's take a closer look:

ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA

When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.

Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.

The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:

- We could think of other features that we could add to the data.
- We could think of other statistical summaries to include in the analysis.
- We could try to scale up by looking at all integers form 1 to 1 billion instead of 1 to 1 million.

In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memory-bound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:

- as the data will no longer be bound by available memory, we will scale with the size of the data,
- since the XDF format is also a distributed data format, we will use the multiple cores available on a single machine to distribute the computation itself by having each core process a separate chunk of the data. On a cluster of machines, the same analysis could then be distributed over the different nodes of the cluster.

Hilary Parker has contributed a lovely article to *Significance*, the magazine of the American Statistical Association and the Royal Statistical Society, on using R to set your Google calendar to mark the time of sunsets. Hilary details the process in the article, but the basic idea is to use the sunrise.set function from the StreamMetabolism package to calculate sunset times at future date, and then create an R function to write a file of calendar appointments which you can import into Google Calendar. (You can use a similar process to set up calendar events for anything you like — just adapt the R code accordingly.) Now that her phone buzzes to alert her to the upcoming sunset, Hilary has been able to capture some beautiful sunsets framed by the NYC skyline, like this one:

Hilary is a data analyst at Etsy, and writes interesting articles about R and statistics at her blog, Not So Standard Deviations. If — like Hilary — you use R at work, you should also check out her post, Writing an R package from scratch, which includes some great tips on organizing your R functions using personal packages. Hilary also has some great advice on testing R scripts for production use in her useR! 2014 poster, testdat: An R Package for Unit Testing of Tabular data.

Significance: Never miss another sunset with R

*by Nick Elprin, Co-Founder of Domino Data Lab*

We built a platform that lets analysts deploy R code to an HTTP server with one click, and we describe it in detail below. If you have ever wanted to invoke your R model with a simple HTTP call, without dealing with any infrastructure setup or asking for help from developers — imagine Heroku for your R code — we hope you’ll enjoy this.

**Introduction**

Across industries, analytical models are powering core business processes and applications as more companies realize that that analytics are key to their competitiveness. R is particularly well suited to developing and expressing such models, but unfortunately, the final step of integrating R code into existing software systems remains difficult. This post describes our solution to this problem: “one-click” publishing of R code to an API server, allowing easy integration between R and other languages, and freeing data scientists to change their models on their own, without help from any developers or engineers.

Today, two problems — one technical, and one organizational — create friction when trying to integrate R code into existing software applications. First, while R is a great language for analytical code, most enterprise software systems are written in more general purpose languages, such as Java, PHP, C#, C++, or even data pipeline tools such as Informatica or Microsoft’s SSIS. Invoking R code from these languages requires some non-trivial technical work, or translation to another language. This leads to the second problem: in most companies, software engineering teams are separate from analytics teams, so when analysts need engineering help, they are forced to compete against other priorities, or they must do their own engineering. Even after an initial deployment of R code, when the model is updated, the deployment process must be repeated, resulting in a painful iteration cycle.

**A Solution: Domino and API Endpoints**

Domino is a platform for doing data science in the enterprise: it provides turnkey functionality for job distribution, version control, collaboration, and model deployment, so that data science teams can be productive without their own engineers and developers. We built our “API Endpoints” feature to address the use case I describe above, reducing the friction associated with integrating R (or Python) models into production systems. Here’s how it works:

Let’s say we are building a library for arithmetic. We have a file, arithmetic.R, with this code:

add <- function(a, b) {

a + b

}

multiply <- function(a, b) {

a * b

}

Now we’d like to make that code accessible to external applications. You can upload this file to Domino (via our web interface, command-line client, or our R package). Once uploaded, you can define a new “API Endpoint” by specifying this file, and the name of the function to invoke when the API is used.

When we hit “publish,” Domino deploys the script to a low-latency server and listens for incoming HTTP requests. If your script performs any one-time initialization (for example, installing custom R packages, or calculating any model parameters), it would run once upon publishing. When a request comes in, Domino passes the request parameters to your specified R function, and returns the results.

It’s that simple. We can test this with a simple curl command (or the equivalent operation in any modern programming language)

curl -v -X POST http://rt.dominoup.com:9000/v1/nick/demo/rt

-H "Content-Type:application/json"

-d '{"parameters": [10,20] }'

Our two parameters (10, 20) are the inputs to the function. Domino pulls them out of the incoming HTTP request, passes them to the R function, and marshals the “result” value back in the HTTP response, along with some status info:

< HTTP/1.1 200 OK

< Content-Type: text/plain; charset=utf-8

< Content-Length: 70

<

* Connection #0 to host rt.dominoup.com left intact

{"runId":"53d3c2986fe0206fee536283","status":"Succeeded","result":200}

This technique works for more complex return types, as well, including strings and arrays of multiple values.

Now our R code accessible via an HTTP call, providing a clean interface to any other system or language. Moreover, analysts can deploy updates to the R code on their own, enabling a faster iteration cycle.

**Training Steps with Scheduled Runs**

A common workflow in machine learning tasks is to create a training task, which might take hours to run, and a classify step, which uses the output of the training step to classify inputs very quickly. Domino let’s you schedule recurring tasks, which can automatically publish their results to an API Endpoint you have defined. In this example, we have a “training_task.R” script that runs a computationally intensive regression and saves the resulting parameters to an RData file.

We can then create a classify function (and API Endpoint) which reads in our parameters and quickly classifies incoming requests. Because our scheduled tasks is set to update the API Endpoint upon completion, the API will get the latest training parameters each night when the training script runs.

**Change Management: Production vs Development**

Once your R code is consumed by production processes, it’s critical to be careful when changing it. Domino facilitates this by automatically keeping a revisioned history of all your files, and by keeping your API Endpoints “pinned” to specific revisions of your code until you explicitly update it. As an example, this screenshot shows Domino’s view of the history of our API releases. We could edit our R files all we want, but the most recent API release (v8) would remain pointed to the exact version of the code we had when we published it. And if we ever needed to debug a production issue from a past release, we could go back to the exact version of our code associated with earlier releases simply by clicking the “commit” link.

**Implementation**

Domino actually lets you publish R or Python code as an API Endpoint, but since this is an R blog, I will focus on how we have implemented this for R code. Our API functionality is driven by two powerful tools under the hood:

- Rserve, along with its Java client, lets us programmatically control an in-memory R session.
- Docker lets us isolate users’ code in separate containers, so they cannot interfere with each other or with our host machines.

With these tools, the implementation of our API functionality works roughly like this:

When we “publish” a new endpoint, we get a source_file and a function_name from the user. Then we:

- Create a new Docker container
- Write a conf file telling Rserve to source the source_file script
- In the container, start an Rserve session and tell it to use the conf file we just created

To start Rserve, we invoke R from the command line (e.g., R --no-save --no-environ) and pass it the following STDIN content:

library(Rserve)

Rserve(debug=TRUE,port=$rservePort,'--vanilla --RS-conf $conf_file')

Where $conf_file is the path to a file containing:

remote enable

source $source_file

By deploying the script and running any initialization when the user hits “publish,” we minimize the work necessary each time the API is invoked. In fact, the overhead of an API request to this published endpoint is only about 150ms. This is critical to support production applications with low latency requirements.

When an HTTP request comes in, we can then do the following:

- Lookup the Docker container that corresponds to the requested endpoint
- Build a command_string from the function_name corresponding to the requested endpoint, and the POST parameters from the HTTP request. E.g., “multiply(10, 20)”
- Invoke the Rserve API to execute this command in the process running inside the specific Docker contain. In Scala, this is how that looks:

import org.rosuda.REngine.REXP

import org.rosuda.REngine.Rserve.RConnection

val c = new RConnection(null, 9999) // second parameter is the port

try {

new RResult(c.eval(command_string))

} finally {

c.close()

}

RConnection.eval() returns an REXP object. Our helper class, RResult, knows how to translate this object into appropriate JSON, so that we can return it in our HTTP response:

class RResult(val rexp: REXP) {

def toJson = {

if (rexp.isString) {

Json.toJson(rexp.asString)

} else if (rexp.isVector && rexp.length > 1) {

Json.toJson(rexp.asDoubles())

} else if (rexp.isList) {

val list = rexp.asList

if (list.isNamed) {

JsObject(

for {

key <- list.keys

} yield {

key -> Json.toJson(new RResult(list.at(key)))

}

)

} else {

JsArray(

for {

i <- 0 to list.capacity

} yield {

Json.toJson(new RResult(list.at(i)))

}

)

}

} else if (rexp.isNumeric) {

JsNumber(rexp.asDouble)

} else {

JsString(rexp.asNativeJavaObject.toString)

}

}

}

**Conclusion**

We are excited to see more organizations making analytical models a central part of their business processes, and we hope that Domino can empower data scientists to accelerate this trend. We are always eager for feedback, so please check our free trial, including API Endpoints, and let us know what you think at info@dominodatalab.com or on Twitter at @dominodatalab.

*The latest in a series by Daniel Hanson*

**Introduction**

Correlations between holdings in a portfolio are of course a key component in financial risk management. Borrowing a tool common in fields such as bioinformatics and genetics, we will look at how to use heat maps in R for visualizing correlations among financial returns, and examine behavior in both a stable and down market.

While base R contains its own heatmap(.) function, the reader will likely find the heatmap.2(.) function in the R package gplots to be a bit more user friendly. A very nicely written companion article entitled A short tutorial for decent heat maps in R (Sebastian Raschka, 2013), which covers more details and features, is available on the web; we will also refer to it in the discussion below.

We will present the topic in the form of an example.

**Sample Data**

As in previous articles, we will make use of R packages Quandl and xts to acquire and manage our market data. Here, in a simple example, we will use returns from the following global equity indices over the period 1998-01-05 to the present, and then examine correlations between them:

S&P 500 (US)

RUSSELL 2000 (US Small Cap)

NIKKEI (Japan)

HANG SENG (Hong Kong)

DAX (Germany)

CAC (France)

KOSPI (Korea)

First, we gather the index values and convert to returns:

library(xts) library(Quandl) my_start_date <- "1998-01-05" SP500.Q <- Quandl("YAHOO/INDEX_GSPC", start_date = my_start_date, type = "xts") RUSS2000.Q <- Quandl("YAHOO/INDEX_RUT", start_date = my_start_date, type = "xts") NIKKEI.Q <- Quandl("NIKKEI/INDEX", start_date = my_start_date, type = "xts") HANG_SENG.Q <- Quandl("YAHOO/INDEX_HSI", start_date = my_start_date, type = "xts") DAX.Q <- Quandl("YAHOO/INDEX_GDAXI", start_date = my_start_date, type = "xts") CAC.Q <- Quandl("YAHOO/INDEX_FCHI", start_date = my_start_date, type = "xts") KOSPI.Q <- Quandl("YAHOO/INDEX_KS11", start_date = my_start_date, type = "xts") # Depending on the index, the final price for each day is either # "Adjusted Close" or "Close Price". Extract this single column for each: SP500 <- SP500.Q[,"Adjusted Close"] RUSS2000 <- RUSS2000.Q[,"Adjusted Close"] DAX <- DAX.Q[,"Adjusted Close"] CAC <- CAC.Q[,"Adjusted Close"] KOSPI <- KOSPI.Q[,"Adjusted Close"] NIKKEI <- NIKKEI.Q[,"Close Price"] HANG_SENG <- HANG_SENG.Q[,"Adjusted Close"] # The xts merge(.) function will only accept two series at a time. # We can, however, merge multiple columns by downcasting to *zoo* objects. # Remark: "all = FALSE" uses an inner join to merge the data. z <- merge(as.zoo(SP500), as.zoo(RUSS2000), as.zoo(DAX), as.zoo(CAC), as.zoo(KOSPI), as.zoo(NIKKEI), as.zoo(HANG_SENG), all = FALSE) # Set the column names; these will be used in the heat maps: myColnames <- c("SP500","RUSS2000","DAX","CAC","KOSPI","NIKKEI","HANG_SENG") colnames(z) <- myColnames # Cast back to an xts object: mktPrices <- as.xts(z) # Next, calculate log returns: mktRtns <- diff(log(mktPrices), lag = 1) head(mktRtns) mktRtns <- mktRtns[-1, ] # Remove resulting NA in the 1st row

**Generate Heat Maps**

As noted above, heatmap.2(.) is the function in the gplots package that we will use. For convenience, we’ll wrap this function inside our own generate_heat_map(.) function, as we will call this parameterization several times to compare market conditions.

As for the parameterization, the comments should be self-explanatory, but we’re keeping things simple by eliminating the dendogram, and leaving out the trace lines inside the heat map and density plot inside the color legend. Note also the setting Rowv = FALSE, this ensures the ordering of the rows and columns remains consistent from plot to plot. We’re also just using the default color settings; for customized colors, see the Raschka tutorial linked above.

require(gplots) generate_heat_map <- function(correlationMatrix, title) { heatmap.2(x = correlationMatrix, # the correlation matrix input cellnote = correlationMatrix # places correlation value in each cell main = title, # heat map title symm = TRUE, # configure diagram as standard correlation matrix dendrogram="none", # do not draw a row dendrogram Rowv = FALSE, # keep ordering consistent trace="none", # turns off trace lines inside the heat map density.info="none", # turns off density plot inside color legend notecol="black") # set font color of cell labels to black }

Next, let’s calculate three correlation matrices using the data we have obtained:

- Correlations based on the entire data set from 1998-01-05 to the present
- Correlations of market indices during a reasonably calm period -- January through December 2004
- Correlations of falling market indices in the midst of the financial crisis - October 2008 through May 2009

Now, let’s call our heat map function using the total market data set:

generate_heat_map(corr1, "Correlations of World Market Returns, Jan 1998 - Present")

And then, examine the result:

As expected, we trivially have correlations of 100% down the main diagonal. Note that, as shown in the color key, the darker the color, the lower the correlation. By design, using the parameters of the heatmap.2(.) function, we set the title with the main = title parameter setting, and the correlations shown in black by using the notecol="black" setting.

Next, let’s look at a period of relative calm in the markets, namely the year 2004:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

This gives us:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

Note that in this case, at a glance of the darker colors in each of the cells, we can see that we have even lower correlations than those from our entire data set. This may of course be verified by comparing the numerical values.

Finally, let’s look at the opposite extreme, during the upheaval of the financial crisis in 2008-2009:

generate_heat_map(corr3, "Correlations of World Market Returns, Oct 2008 - May 2009")

This yields the following heat map:

Note that in this case, again just at first glance, we can tell the correlations have increased compared to 2004, by the colors changing from dark to light nearly across the board. While there are some correlations that do not increase all that much, such as the SP500/Nikkei and the Russell 2000/Kospi values, there are others across international and capitalization categories that jump quite significantly, such as the SP500/Hang Seng correlation going from about 21% to 41%, and that of the Russell 2000/DAX moving from 43% to over 57%. So, in other words, portfolio diversification can take a hit in down markets.

**Conclusion**

In this example, we only looked at seven market indices, but for a closer look at how correlations were affected during 2008-09 -- and how heat maps among a greater number of market sectors compared -- this article, entitled* Diversification is Broken*, is a recommended and interesting read.

**Using package repositories to recreate the past, distribute the present, and protect against the future**

*by Gabriel Becker (@groundwalkergmb)*

**1. Have you ever needed to reach into the distant past …**to recreate a years old result? Take - as an arbitrary example - Anders and Huber's paper on Differential expression analysis for sequence count data from 2010. "2010", the excitable among you might exclaim, "were they programming their R scripts on punch-cards and running them on coal-powered computers??" Perhaps they were, my hyperbolic readers, but hardware is not our concern today.

Replicable scientific results are the engine for advancing human knowledge; non-replicable results, simply put, are not. This holds true regardless of how many eons - in computer years - have passed since the results were originally generated.

**2. They got what??**

Anders and Huber provide Sweave output (a PDF) that contains the code for the plots and output directly reported in their paper, as well as all the datasets used by the code. So let's reproduce their results! Getting the code into an executable form for this is painful, but that is a subject for another time.

Below is their code (I added the supressMessages() call) for identifying differentially expressed genes for their fly data with a False Discovery Rate (FDR) cutoff of .1

suppressMessages(library(DESeq)) countsTableFly <- read.delim( "fly_RNA_counts.tsv" ) condsFly <- c( "A", "A", "B", "B" ) # Add dummy names to avoid confusion later rownames( countsTableFly ) <- paste( "Gene", 1:nrow(countsTableFly), sep="_" )

cdsFly <- newCountDataSet( countsTableFly, condsFly ) cdsFly <- estimateSizeFactors( cdsFly ) cdsFly <- estimateVarianceFunctions( cdsFly ) #oops! error! resFly <- nbinomTest( cdsFly, "A", "B" )

Error: 'estimateVarianceFunctions' is defunct. Use 'estimateDispersions' instead. See help("Defunct") Error in nbinomTest(cdsFly, "A", "B") : Call 'estimateDispersions' first.

It didn't work! We got an error when running their code.The Defunct error suggests that we use the estimateDispersions function instead of the removed estimateSizeFactors function, so lets do that …

359?!? The paper (and Sweave output) report that this code gives the result of 864!

Before we go lighting our pitchforks and sharpening our torches, however, we need to remember a simple fact: just because it doesn't work now doesn't mean it didn't work then.

At this point you may be wondering what I'm proposing. Are we supposed to take a quick trip back in time and rerun their code in 2010? Well, sort of. The GRANBase package - among many other things - provides a sort of time machine for your computing environment.

**3. Climbing in through the sessionInfo() window**

Anders and Huber provided a window into the computing environment they used in the form of sessionInfo() output in their SWeave output PDF. "Windows are great", I'm sure you're thinking, "except that I can't use them to get inside and so I'm left out in the cold!"

Well not with that attitude you can't. We're going to use GRANBase to knock out that window and install a door in its place. One that we - and everyone else - can use to reproduce Anders and Huber's results. No more pressing your face up against the glass, drooling at the delicious old software tantalizingly arranged on the other side.

First we have to get our hands on the right version of R - 2.12 in this case. That isn't the easiest thing to do, but for those of you who don't want to install and manage an old R installation you can use an Amazon EC2 instance with this lovingly hand-crafted AMI that contains R 2.12.1 and has switchr installed: ami-94a670fc

NOTE: the authors actually ran their code in 2.12.0, but in general we can usually assume that matching the major (2) and minor (12) versions is sufficient, as we see that it is in this case. The end result (run in R 2.12.1):

library(switchr) doiRepo = "http://research-pub.gene.com/gran/10.1186/gb-2010-11-10-r106" switchTo(doiRepo, name = "10.1186/gb-2010-11-10-r106")

Switched to the '10.1186/gb-2010-11-10-r106' computing environment. 54 packages are currently available. Packages installed in your site library ARE suppressed. To switch back to your previous environment type switchBack()

If we had never switched to a computing environment based on this particular repository within the version of R being used, the above statement would create the computing environment - by installing all the packages contained in that repository - and remember it for the future. I have spared you the dozens of lines of package installation output, because that's just the type of person I am. Now then, let's try that again.

suppressMessages(library(DESeq)) countsTableFly <- read.delim( "fly_RNA_counts.tsv" ) condsFly <- c( "A", "A", "B", "B" ) # Add dummy names to avoid confusion later rownames( countsTableFly ) <- paste( "Gene", 1:nrow(countsTableFly), sep="_" ) cdsFly <- newCountDataSet( countsTableFly, condsFly ) cdsFly <- estimateSizeFactors( cdsFly ) cdsFly <- estimateVarianceFunctions( cdsFly ) #oops! error! resFly <- nbinomTest( cdsFly, "A", "B" ) length( which( resFly$padj < .1 ) )

[1] 864

And just like that we've recreated their result. Now, if we were staying in the same R, we would revert to our original computing environment and go about our day.

switchBack()

Reverted to the 'original' computing environment. 33 packages are currently available. Packages installed in your site library ARE NOT suppressed. To switch back to your previous environment type switchBack()

**4. So what just happened?**

The short answer: there is a repository associated with the paper's DOI that contains all the exact package versions used by the authors. This repository encapsulates the R computing environment - up to the actual version of R used - that Anders and Huber used. And it renders that environment recreatable, and thus - along with the original data they provide - their results reproducible.

The repository mechanism provides a natural way to both define specific sets of package versions that make up an R computing environment and make them instantly distributable.

**5. Sure, but how did it get there?**

"Well that's great", you might be thinking, "when we have such a repository; but we're out of luck if the authors of an article don't create one at the time of publication." And you would have largely been right, until now. GRANBase can retrieve the source version of any non-base package that has ever been on CRAN or Bioconductor.

Allow me to repeat that for emphasis. GRANBase can retrieve and build sources **for any version of any non-base package that has ever been released** on CRAN or BioConductor. For example - assuming you have the svn command-line utility installed - this will get you the DESeq version Anders and Huber used:

Furthermore, GRANBase will happily parse sessionInfo() output - either the R object or the text -, find and build source versions of the specified packages, and build the repository for you. That is how I got the repository I used to recreate Anders and Huber's work.

"The world is saved!", you're probably shouting, and "GRANBase for President!" You guys sure are excitable. I appreciate the sentiment, but I'm fairly certain that R packages can't hold elected office.

**6. So what was that switching stuff?**

The switchr package provides a formal abstraction for creating, managing, tracking, and switching between "R computing environments" built on top of R's existing library location mechanism. So instead of managing library path's directly, we switched to the appropriate computing environment, ran the authors' computations, confirmed that we got the same results, and then switched back to our normal R installation. Also, as we have seen, switchr is confirmed to work at least as far back as R 2.12.1 (2010), with earlier R versions likely supported but untested.

**7. But won't someone think of the harddrives?**

Every time a paper, blog post, or internal analysis gets done, we copy the tarballs to a new repository and move on with our day. Sounds good. "Madness!", comes the refrain, "Why would you copy a file 920345982430589423509823452340589345 times???" Well, first of all that number seems a bit high to me, even for popular packages. In principle, though, I agree (and thus so does GRANBase).

GRANBase gets around this with the concept of having many "virtual repositories". Virtual repositories are ones that have passed Kindergarten and thus know how to share. Specifically, they share tarballs of exact package versions, so that communally they only need one copy of each version of each package.

The details of how this sharing happens does this are both unimportant and out of scope here, but the take-away is that there is no duplication of files when many virtual repositories hosted in the same place share package versions. Plus, like the cooks at any reasonable diner, GRANBase can make your packages to order using the version location functionality I mentioned above. This means that with code, data, sessionInfo() output, and a suitable R installation, we can assess and preserve the reproducibility of **every R analysis run in the past, present, or future** ^.

*^Assuming no dependencies not encompassed by R packages*

**8. Other uses for this framework**

- Ensuring the R environment on your local machine is equivalent to the one on a particular computing cluster to the one on your local machine
- Ensuring everyone collaborating on package development is programming against the same API (set of dependency versions)
- Allowing package maintainers to easily recreate the exact environment a reported bug was detected in
- Having side-by-side release and devel installations of BioConductor

**9. Other things GRANBase does, very briefly**

*SHAMELESS PLUG*

**9.1 Risk reports**

- Compare your currently installed packages to currently available versions
- Asseses possible "ripple" effects of updating packages
- Parses NEWS - when possible - to summary # of fixed bugs you are subject to
- Helps reformulate whether to upgrade or not as a strategic decision

**9.2 Incremental, multi-source Repository building**

- Build repositories using packages sources from many disparate sources (SVN, github, etc) based on the concept of a package manifest
- Only rebuilds updated packages (version bump) and their reverse dependnecies. No wasted testing of unaltered pkgs
- CI-ready design, we use Jenkins

*END SHAMELESS PLUG*

**10. Related work, and further reading**

See:

- Gentleman and Temple Lang for a discussion of why and how software dependencies should be distributed packaged with dynamic documents.
- RStudio's packrat is an implementation of Gentleman and Temple Lang's ideas centered around RStudio's concept of an R Project.
- Andrie de Vries' miniCRAN R package explores the concept of creating frozen, self-consistent subsets of CRAN.

Starting with R 2.13, the Bioconductor project provides AMIs that encapsulate core BioC packages, but which also necessarily encapsulate historical R versions. And no, before you ask, starting at 2.13 isn't because there is a vast and highly prescient conspiracy against us … probably.

Dirk Eddelbuettel's work with Docker-based R builds offer the potential for another way of mitigating the pain of the Old-R-versions requirement, though currently that is not the focus of his work. (See ~pg 50)

**11. Acknowledgements**

Cory Barr created an earlier GRAN package which served as a stepping-off point for development of GRANBase.

Thanks to David Smith, Joseph Rickert and Revolutions Analytics for allowing me use of their soapbox.

The GRANBase and switchr R packages are Copyright Genentech, Inc. and have been released under the Artistic 2.0 open-source license here and here, respectively.

GRANBase and switchr were developed by Gabriel Becker under the leadership of Michael Lawrence and Robert Gentleman at Genentech Research and Early Development.

Thanks to Michael and Robert for their advice and support throughout the project, and for allowing me to grow it's scope beyond what was initially envisioned.

by Andrie de Vries

In my previous post I wrote about how to identify and visualize package dependencies. Within hours, Duncan Murdoch (member of R-core) identified some discrepancies between my list of dependencies and the visualisation. Since then, I fixed the dispecrancies. In this blog post I attempt to clarify the issues involved in listing package dependencies.

In miniCRAN I expose two functions that provides information about dependencies:

- The function
**pkgDep()**returns a character vector with the names of dependencies. Internally, pkgDep() is a wrapper around**tools::package_dependencies()**, a base R function that, well, tells you about package dependencies. My new function is in one way a convenience, but more importantly it sets different defaults (more about this later). - The function
**makeDepGraph()**creates an**igraph**representation of the dependencies.

Take a look at some examples. I illustrate with the the package **chron**, because chron neatly illustrates the different roles of Imports, Suggests and Enhances:

- chron
**Imports**the base packages**graphics**and**stats**. This means that chron internally makes use of graphics and stats and will always load these packages. - chron
**Suggests**the packages**scales**and**ggplot2**. This means that chron uses some functions from these packages in examples or in its vignettes. However, these functions are not necessary to use chron - chron
**Enhances**the package**zoo**, meaning that it adds something to zoo packages. These enhancements are made available to you if you have zoo installed.

The function **pkgDep()** exposes not only these dependencies, but also also all recursive dependencies. In other words, it answers the question which packages need to be installed to satsify all dependencies of dependencies.

This means that the algorithm is as follows:

- First retrieve a list of Suggests and Enhances, using a non-recursive dependency search
- Next, perform a recursive search for all Imports, Depends and LinkingTo

The resulting list of packages should then contain the complete list necessary to satisfy all dependencies. In code:

> library(miniCRAN)

> tags <- "chron" > pkgDep(tags, suggests=FALSE, enhances=FALSE, includeBasePkgs = TRUE) [1] "chron" "graphics" "stats"

> pkgDep(tags, suggests = TRUE, enhances=FALSE) [1] "chron" "RColorBrewer" "dichromat" "munsell" "plyr" "labeling" [7] "colorspace" "Rcpp" "digest" "gtable" "reshape2" "scales" [13] "proto" "MASS" "stringr" "ggplot2"

> pkgDep(tags, suggests = TRUE, enhances=TRUE) [1] "chron" "RColorBrewer" "dichromat" "munsell" "plyr" "labeling" [7] "colorspace" "Rcpp" "digest" "gtable" "reshape2" "scales" [13] "proto" "MASS" "stringr" "lattice" "ggplot2" "zoo"

Created by Pretty R at inside-R.org

To create an igraph plot of the dependencies, you can use the function **makeDepGraph()** and plot the results:

Created by Pretty R at inside-R.org

Note how the dependencies expand to zoo (enhanced), scales and ggplot (suggested) and then recursively from there to get all the Imports and LinkingTo dependencies.

In my previous post I tried to plot the most popular package tags on StackOverflow. Using the updated functionality in the miniCRAN functions, it is easier to understand the structure of the dependencies:

> tags <- c("ggplot2", "data.table", "plyr", "knitr", + "shiny", "xts", "lattice") > pkgDep(tags, suggests = TRUE, enhances=FALSE) [1] "ggplot2" "data.table" "plyr" "knitr" "shiny" "xts" [7] "lattice" "digest" "gtable" "reshape2" "scales" "proto" [13] "MASS" "Rcpp" "stringr" "RColorBrewer" "dichromat" "munsell" [19] "labeling" "colorspace" "evaluate" "formatR" "highr" "markdown" [25] "mime" "httpuv" "caTools" "RJSONIO" "xtable" "htmltools" [31] "bitops" "zoo" "SparseM" "survival" "Formula" "latticeExtra" [37] "cluster" "maps" "sp" "foreign" "mvtnorm" "TH.data" [43] "sandwich" "nlme" "Matrix" "bit" "codetools" "iterators" [49] "timeDate" "quadprog" "Hmisc" "BH" "quantreg" "mapproj" [55] "hexbin" "maptools" "multcomp" "testthat" "mgcv" "chron" [61] "reshape" "fastmatch" "bit64" "abind" "foreach" "doMC" [67] "itertools" "testit" "rgl" "XML" "RCurl" "Cairo" [73] "timeSeries" "tseries" "its" "fts" "tis" "KernSmooth" > set.seed(1) > plot(makeDepGraph(tags, includeBasePkgs=FALSE, suggests=TRUE, enhances=TRUE), + legendPosEdge = c(-1, -1), legendPosVertex = c(1, -1), vertex.size=10, cex=0.5)

Created by Pretty R at inside-R.org

After my previous post, Duncan Murdoch pointed out that the package **rgl**, suggested by **knitr**, appeared in the list, but not in the plot. This new version of the function fixes this bug, which was introduced because I retrieved the suggests dependencies incorrectly:

EDIT:

A few hours ago the miniCRAN went live on CRAN. Find miniCRAN at http://cran.r-project.org/web/packages/miniCRAN/index.html