by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

The forecast package for R, created and maintained by Professor Rob Hyndman of Monash University, is one of the more useful R packages available available on CRAN. Statistical forecasting — the process of predicting the future value of a time series — is used in just about every realm of data analysis, whether it's trying to predict a future stock price or trying to anticipate changes in the weather. If you're looking to learn about forecasting, a great place to start is the online book Forecasting: Principles and Practice (by Hyndman and George Athanasopoulos) which walks you through the theory and practice, with many examples in R based on the forecast package. Topics covered include multiple regression, Time series decomposition, exponential smoothing, and ARIMA models.

The forecast package itself recently received a major update, to version 7. One major new capability is the ability to easily chart forecasts using the ggplot2 package with the new autoplot function. For example:

fc <- forecast(fdeaths) autoplot(fc)

You can also add forecasts to any ggplot using the new geom_forecasts geom provided by the forecast package:

autoplot(mdeaths) + geom_forecast(h=36, level=c(50,80,95))

There have been several updates to the forecasting functions as well. The function for fitting linear models to time series data, tslm, has been rewritten to be more compatible with the standard lm function. It's now possible to forecast means (as well as medians) when using Box-Cox transformations. And you can now apply neural networks to time series data by building a nonlinear autoregressive model with the new nnetar function.

Those are just some of the highlights of the updates to the forecast package in version 7. For complete details, follow the links to Rob Hyndman's blog, below.

Hyndsight: forecast v7 and ggplot2 graphics ; Forecast v7 (part 2) (via traims)

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In her series on principal components analysis for regression in R, Win-Vector LLC's Dr. Nina Zumel broke the demonstration down into the following pieces:

- Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of
*y*-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict. - And now Part 3: how to pick the number of components to retain for analysis.

In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3, she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:

- Pick 2, as that is all you can legibly graph.
- Pick enough to cover some fixed fraction of the variation (say 95%).
- (for variance scaled data only) Retain components with singular values at least 1.0.
- Look for a "knee in the curve" (the curve being the plot of the singular value magnitudes).
- Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the *y*-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:

However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:

In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general, we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "*y*" or outcome) and do *all* of the following:

- Fit model and significance prune incoming variables.
- Convert incoming variables into consistent response units by
*y*-aware scaling. - Fit model and significance prune resulting latent components.

The above will become much clearer and much more specific if you click here to read part 3.

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

Apache Spark, the open-source cluster computing framework, will soon see a major update with the upcoming release of Spark 2.0. This update promises to be faster than Spark 1.6, thanks to a run-time compiler that generates optimized bytecode. It also promises to be easier for developers to use, with streamlined APIs and a more complete SQL implementation. (Here's a tutorial on using SQL with Spark.) Spark 2.0 will also include a new "structured streaming" API, which will allow developers to write algorithm for streaming data without having to worry about the fact that streaming data is always incomplete; algorithms written for complete DataFrame objects will work for streams as well.

This update also includes some news for R users. First, the DataFrame object continues to be the primary interface for R (and Python users). Although the DataSets structure in Spark is more general, using the single-table DataFrame construct makes sense for R and Python which have analogous native (or near-native, in Python's case) data structures. In addition, Spark 2.0 is set to add a few new distributed statistical modeling algorithms: generalized linear models (in addition to the Normal least-squares and logistic regression models in Spark 1.6); Naive Bayes; survival (censored) regression; and K-means clustering. The addition of survival regression is particularly interesting. It's a type of model used in situations where the outcome isn't always completely known: for example, some (but not all) patients may have yet experienced remission in a cancer study. It's also used for reliability analysis and lifetime estimation in manufacturing, where some (but not all) parts may have failed by the end of the observation period. To my knowledge, this is the first distributed implementation of the survival regression algorithm.

For R users, these models can be applied to large data sets stored in Spark DataFrames, and then computed using the Spark distributed computing framework. Access to the algorithms is via the SparkR package, which hews closely to standard R interfaces for model training. You'll need to first create a SparkDataFrame object in R, as a reference to a Spark DataFrame. Then, to perform logistic regression (for example), you'll use R's standard glm function, using the SparkDataFrame object as the data argument. (This elegantly uses R's object-oriented dispatch architecture to call the Spark-specific GLM code.) The example below creates a Spark DataFrame, and then uses Spark to fit a logisic regression to it:

```
df <- createDataFrame(sqlContext, iris)
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
```

The object model contains most (but not all) of the output created by R's traditional glm algorithm, so most standard R functions that work with GLM objects work here as well.

For more on what's coming in Spark 2.0 check out the DataBricks blog post below, and the preview documentation for SparkR contains more info as well. Also, you might want to check out how you can use Spark on HDInsight with Microsoft R Server.

Databricks: Preview of Apache Spark 2.0 now on Databricks Community Edition

John Mount Ph. D.

Data Scientist at Win-Vector LLC

Win-Vector LLC's Dr. Nina Zumel has just started a two part series on Principal Components Regression that we think is well worth your time. You can read her article here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:

- It can find important latent structure and relations.
- It can reduce over fit.
- It can ease the curse of dimensionality.
- It is used in a ritualistic manner in many scientific disciplines. In some fields it is considered ignorant and uncouth to regress using original variables.

We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.

Wikipedia: PCA

And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is *already* supplied in a reliable analysis platform (such as R).

Dr. Zumel uses the expressive and graphical power of R to work through the *use* of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "*y*-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.

*by Katherine Zhao, Hong Lu, Zhongmou Li, Data Scientists at Microsoft*

Bicycle rental has become popular as a convenient and environmentally friendly transportation option. Accurate estimation of bike demand at different locations and different times would help bicycle-sharing systems better meet rental demand and allocate bikes to locations.

In this blog post, we walk through how to use Microsoft R Server (MRS) to build a regression model to predict bike rental demand. In the example below, we demonstrate an end-to-end machine learning solution development process in MRS, including data importing, data cleaning, feature engineering, parameter sweeping, and model training and evaluation.

The Bike Rental UCI dataset is used as the input raw data for this sample. This dataset is based on real-world data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States.

The dataset contains 17,379 rows and 17 columns, with each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) are included in the raw data, and the dates are categorized as holiday vs. weekday, etc.

The field to predict is ** cnt**, which contains a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour.

In this example, we use historical bike rental counts as well as the weather condition data to predict the number of bike rentals within a specific hour in the future. We approach this problem as a regression problem, since the label column (number of rentals) contains continuous real numbers.

Along this line, we split the raw data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. Specifically, we employ the Decision Forest Regression algorithm as the regression model and build two models on different feature sets. Finally, we evaluate their prediction performance. We will elaborate the details in the following sections.

We build the models using the `RevoScaleR`

library in MRS. The `RevoScaleR`

library provides extremely fast statistical analysis on terabyte-class datasets without needing specialized hardware. `RevoScaleR`

's distributed computing capabilities can use a different (possibly remote) computing context using the same `RevoScaleR`

commands to manage and analyze local data. A wide range of `rx`

prefixed functions provide functionality for:

- Accessing external data sets (SAS, SPSS, ODBC, Teradata, and delimited and fixed format text) for analysis in R.
- Efficiently storing and retrieving data in a high-performance data file.
- Cleaning, exploring, and manipulating data.
- Fast, basic statistical analysis.
- Train and score advanced machine learning models.

Overall, there are five major steps of building this example using Microsoft R Server:

- Step 1: Import and Clean Data
- Step 2: Perform Feature Engineering
- Step 3: Prepare Training, Test and Score Datasets
- Step 4: Sweep Parameters and Train Regression Models
- Step 5: Test, Evaluate, and Compare Models

First, we import the Bike Rental UCI dataset. Since there are a small portion of missing records within the dataset, we use `rxDataStep()`

to replace the missing records with the latest non-missing observations. `rxDataStep()`

is a commonly used function for data manipulation. It transforms the input dataset chunk by chunk and saves the results to the output dataset.

# Define the tranformation function for the rxDataStep. xform <- function(dataList) { # Identify the features with missing values. featureNames <- c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt") # Use "na.locf" function to carry forward last observation. dataList[featureNames] <- lapply(dataList[featureNames], zoo::na.locf) # Return the data list. return(dataList) } # Use rxDataStep to replace missings with the latest non-missing observations. cleanXdf <- rxDataStep(inData = mergeXdf, outFile = outFileClean, overwrite = TRUE, # Apply the "last observation carried forward" operation. transformFunc = xform, # Identify the features to apply the tranformation. transformVars = c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt"), # Drop the "dteday" feature. varsToDrop = "dteday") Step 2: Perform Feature Engineering

In addition to the original features in the raw data, we add number of bikes rented in each of the previous 12 hours as features to provide better predictive power. We create a `computeLagFeatures()`

helper function to compute the 12 lag features and use it as the transformation function in `rxDataStep()`

.

Note that `rxDataStep()`

processes data chunk by chunk and lag feature computation requires data from previous rows. In `computLagFeatures()`

, we use the internal function `.rxSet()`

to save the last *n* rows of a chunk to a variable ** lagData**. When processing the next chunk, we use another internal function

`.rxGet()`

to retrieve the values stored in # Add number of bikes that were rented in each of # the previous 12 hours as 12 lag features. computeLagFeatures <- function(dataList) { # Total number of lags that need to be added. numLags <- length(nLagsVector) # Lag feature names as lagN. varLagNameVector <- paste("cnt_", nLagsVector, "hour", sep="") # Set the value of an object "storeLagData" in the transform environment. if (!exists("storeLagData")) { lagData <- mapply(rep, dataList[[varName]][1], times = nLagsVector) names(lagData) <- varLagNameVector .rxSet("storeLagData", lagData) } if (!.rxIsTestChunk) { for (iL in 1:numLags) { # Number of rows in the current chunk. numRowsInChunk <- length(dataList[[varName]]) nlags <- nLagsVector[iL] varLagName <- paste("cnt_", nlags, "hour", sep = "") # Retrieve lag data from the previous chunk. lagData <- .rxGet("storeLagData") # Concatenate lagData and the "cnt" feature. allData <- c(lagData[[varLagName]], dataList[[varName]]) # Take the first N rows of allData, where N is # the total number of rows in the original dataList. dataList[[varLagName]] <- allData[1:numRowsInChunk] # Save last nlag rows as the new lagData to be used # to process in the next chunk. lagData[[varLagName]] <- tail(allData, nlags) .rxSet("storeLagData", lagData) } } return(dataList) } # Apply the "computeLagFeatures" on the bike data. lagXdf <- rxDataStep(inData = cleanXdf, outFile = outFileLag, transformFunc = computeLagFeatures, transformObjects = list(varName = "cnt", nLagsVector = seq(12)), transformVars = "cnt", overwrite = TRUE)

Before training the regression model, we split data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. In order to obtain the best combination of parameters for regression models, we further divide year 2011 data into training and test datasets: 80% of the records are randomly selected to train regression models with various combinations of parameters, and the remaining 20% are used to evaluate the models obtained and determine the optimal combination.

# Split data by "yr" so that the training and test data contains records # for the year 2011 and the score data contains records for 2012. rxSplit(inData = lagXdf, outFilesBase = paste0(td, "/modelData"), splitByFactor = "yr", overwrite = TRUE, reportProgress = 0, verbose = 0) # Point to the .xdf files for the training & test and score set. trainTest <- RxXdfData(paste0(td, "/modelData.yr.0.xdf")) score <- RxXdfData(paste0(td, "/modelData.yr.1.xdf")) # Randomly split records for the year 2011 into training and test sets # for sweeping parameters. # 80% of data as training and 20% as test. rxSplit(inData = trainTest, outFilesBase = paste0(td, "/sweepData"), outFileSuffixes = c("Train", "Test"), splitByFactor = "splitVar", overwrite = TRUE, transforms = list(splitVar = factor(sample(c("Train", "Test"), size = .rxNumRows, replace = TRUE, prob = c(.80, .20)), levels = c("Train", "Test"))), rngSeed = 17, consoleOutput = TRUE) # Point to the .xdf files for the training and test set. train <- RxXdfData(paste0(td, "/sweepData.splitVar.Train.xdf")) test <- RxXdfData(paste0(td, "/sweepData.splitVar.Test.xdf"))

In this step, we construct two training datasets based on the same raw input data, but with different sets of features:

- Set A = weather + holiday + weekday + weekend features for the predicted day
- Set B = Set A + number of bikes rented in each of the previous 12 hours, which captures very recent demand for the bikes

In order to perform parameter sweeping, we create a helper function to evaluate the performance of a model trained with a given combination of number of trees and maximum depth. We use *Root Mean Squared Error (RMSE)* as the evaluation metric.

# Define a function to train and test models with given parameters # and then return Root Mean Squared Error (RMSE) as the performance metric. TrainTestDForestfunction <- function(trainData, testData, form, numTrees, maxD) { # Build decision forest regression models with given parameters. dForest <- rxDForest(form, data = trainData, method = "anova", maxDepth = maxD, nTree = numTrees, seed = 123) # Predict the the number of bike rental on the test data. rxPredict(dForest, data = testData, predVarNames = "cnt_Pred", residVarNames = "cnt_Resid", overwrite = TRUE, computeResiduals = TRUE) # Calcualte the RMSE. result <- rxSummary(~ cnt_Resid, data = testData, summaryStats = "Mean", transforms = list(cnt_Resid = cnt_Resid^2) )$sDataFrame # Return lists of number of trees, maximum depth and RMSE. return(c(numTrees, maxD, sqrt(result[1,2]))) }

The following is another helper function to sweep and select the optimal parameter combination. Under local parallel compute context (`rxSetComputeContext(RxLocalParallel())`

), `rxExec()`

executes multiple runs of model training and evaluation with different parameters in parallel, which significantly speeds up parameter sweeping. When used in a compute context with multiple nodes, e.g. high-performance computing clusters and Hadoop, `rxExec()`

can be used to distribute a large number of tasks to the nodes and run the tasks in parallel.

# Define a function to sweep and select the optimal parameter combination. findOptimal <- function(DFfunction, train, test, form, nTreeArg, maxDepthArg) { # Sweep different combination of parameters. sweepResults <- rxExec(DFfunction, train, test, form, rxElemArg(nTreeArg), rxElemArg(maxDepthArg)) # Sort the nested list by the third element (RMSE) in the list in ascending order. sortResults <- sweepResults[order(unlist(lapply(sweepResults, `[[`, 3)))] # Select the optimal parameter combination. nTreeOptimal <- sortResults[[1]][1] maxDepthOptimal <- sortResults[[1]][2] # Return the optimal values. return(c(nTreeOptimal, maxDepthOptimal)) }

A large number of parameter combinations are usually swept through in modeling process. For demonstration purpose, we use 9 combinations of parameters in this example.

Next, we find the best parameter combination and get the optimal regression model for each training dataset. For simplicity, we only present the process for Set A.

# Set A = weather + holiday + weekday + weekend features for the predicted day. # Build a formula for the regression model and remove the "yr", # which is used to split the training and test data. newHourFeatures <- paste("cnt_", seq(12), "hour", sep = "") # Define the hourly lags. formA <- formula(train, depVars = "cnt", varsToDrop = c("splitVar", newHourFeatures)) # Find the optimal parameters for Set A. optimalResultsA <- findOptimal(TrainTestDForestfunction, train, test, formA, numTreesToSweep, maxDepthToSweep) # Use the optimal parameters to fit a model for feature Set A. nTreeOptimalA <- optimalResultsA[[1]] maxDepthOptimalA <- optimalResultsA[[2]] dForestA <- rxDForest(formA, data = trainTest, method = "anova", maxDepth = maxDepthOptimalA, nTree = nTreeOptimalA, importance = TRUE, seed = 123)

Finally, we plot the dot charts of the variable importance and the out-of-bag error rates for the two optimal decision forest models.

In this step, we use the `rxPredict()`

function to predict the bike rental demand on the score dataset, and compare the two regression models over three performance metrics - *Mean Absolute Error (MAE)*, *Root Mean Squared Error (RMSE)*, and *Relative Absolute Error (RAE)*.

# Set A: Predict the probability on the test dataset. rxPredict(dForestA, data = score, predVarNames = "cnt_Pred_A", residVarNames = "cnt_Resid_A", overwrite = TRUE, computeResiduals = TRUE) # Set B: Predict the probability on the test dataset. rxPredict(dForestB, data = score, predVarNames = "cnt_Pred_B", residVarNames = "cnt_Resid_B", overwrite = TRUE, computeResiduals = TRUE) # Calculate three statistical metrics: # Mean Absolute Error (MAE), # Root Mean Squared Error (RMSE), and # Relative Absolute Error (RAE). sumResults <- rxSummary(~ cnt_Resid_A_abs + cnt_Resid_A_2 + cnt_rel_A + cnt_Resid_B_abs + cnt_Resid_B_2 + cnt_rel_B, data = score, summaryStats = "Mean", transforms = list(cnt_Resid_A_abs = abs(cnt_Resid_A), cnt_Resid_A_2 = cnt_Resid_A^2, cnt_rel_A = abs(cnt_Resid_A)/cnt, cnt_Resid_B_abs = abs(cnt_Resid_B), cnt_Resid_B_2 = cnt_Resid_B^2, cnt_rel_B = abs(cnt_Resid_B)/cnt) )$sDataFrame # Add row names. features <- c("baseline: weather + holiday + weekday + weekend features for the predicted day", "baseline + previous 12 hours demand") # List all metrics in a data frame. metrics <- data.frame(Features = features, MAE = c(sumResults[1, 2], sumResults[4, 2]), RMSE = c(sqrt(sumResults[2, 2]), sqrt(sumResults[5, 2])), RAE = c(sumResults[3, 2], sumResults[6, 2]))

Based on all three metrics listed below, the regression model built on feature set B outperforms the one built on feature set A. This result is not surprising, since from the variable importance chart we can see, the lag features play a critical part in the regression model. Adding this set of features can lead to better performance.

Feature Set | MAE | RMSE | RAE |
---|---|---|---|

A | 101.34848 | 146.9973 | 0.9454142 |

B | 62.48245 | 105.6198 | 0.3737669 |

Follow this link for source code and datasets: Bike Rental Demand Estimation with Microsoft R Server

by Yuzhou Song, Microsoft Data Scientist

R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.

In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.

**Step 1: Import the Data**

At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.

library(RevoScaleR)

data.path <- "./"

file.name <- "ccFraud.csv"

fraud_csv_path <- file.path(data.path,file.name)

Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.

colClasses <- c("integer","factor","factor","integer",

"numeric","integer","integer","numeric","factor")

names(colClasses)<-c("custID","gender","state","cardholder",

"balance","numTrans","numIntlTrans",

"creditLine","fraudRisk")

fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)

With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:

rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)

and you will get the following:

**Step 2: Process Data**

Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.

Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:

fraud_flag_csv_source <- RxTextData(file=file.path(data.path,

"ccFraudFlag.csv"))

Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:

make_train_flag <- function(data){

data <- as.data.frame(data)

set.seed(34)

data$trainFlag <- sample(c(0,1),size=nrow(data),

replace=TRUE,prob=c(0.3,0.7))

return(data)

}

Then, use rxDateStep to complete the transformation:

rxDataStep(inData=fraud_csv_source,

outFile=fraud_flag_csv_source,

transformFunc=make_train_flag,overwrite = TRUE)

again, we can check the output file information by using rxGetInfo function:

rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)

we can find the trainFlag column has been appended to the last:

Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:

train_csv_source <- RxTextData(

file=file.path(data.path,"train.csv"))

test_csv_source <- RxTextData(

file=file.path(data.path,"test.csv"))

Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:

rxDataStep(inData=fraud_flag_csv_source,

outFile=train_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 1),

overwrite = TRUE)

rxDataStep(inData=fraud_flag_csv_source,

outFile=test_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 0),

overwrite = TRUE)

A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.

down_sample <- function(data){

data <- as.data.frame(data)

data_bad <- subset(data,fraudRisk == 1)

data_good <- subset(data,fraudRisk == 0)

# good to bad ratio 4:1

rate <- nrow(data_bad)*4/nrow(data_good)

set.seed(34)

data_good$keepTag <- sample(c(0,1),replace=TRUE,

size=nrow(data_good),prob=c(1-rate,rate))

data_good_down <- subset(data_good,keepTag == 1)

data_good_down$keepTag <- NULL

data_down <- rbind(data_bad,data_good_down)

data_down$trainFlag <- NULL

return(data_down)

}

Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:

train_downsample_csv_source <- RxTextData(

file=file.path(data.path,

"train_downsample.csv"),

colClasses = colClasses)

rxDataStep(inData=train_csv_source,

outFile=train_downsample_csv_source,

transformFunc = down_sample,

reportProgress = 0,

overwrite = TRUE)

**Step 3: Training**

In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:

training_vars <- rxGetVarNames(train_downsample_csv_source)

training_vars <- training_vars[!(training_vars %in%

c("fraudRisk","custID"))]

formula <- as.formula(paste("fraudRisk~",

paste(training_vars, collapse = "+")))

The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:

boosted_fit <- rxBTrees(formula = formula,

data = train_downsample_csv_source,

learningRate = 0.2,

minSplit = 10,

minBucket = 10,

# small number of tree for testing purpose

nTree = 20,

seed = 5,

lossFunction ="multinomial",

reportProgress = 0)

**Step 4: Prediction and Evaluation**

We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:

test_data <- rxImport(test_csv_source)

Then, we take the imported testing set and fitted model object as input for rxPredict function:

predictions <- rxPredict(modelObject = boosted_fit,

data = test_data,

type = "response",

overwrite = TRUE,

reportProgress = 0)

In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:

threshold <- 0.5

predictions <- data.frame(predictions$X1_prob)

names(predictions) <- c("Boosted_Probability")

predictions$Boosted_Prediction <- ifelse(

predictions$Boosted_Probability > threshold, 1, 0)

predictions$Boosted_Prediction <- factor(

predictions$Boosted_Prediction,

levels = c(1, 0))

scored_test_data <- cbind(test_data, predictions)

evaluate_model <- function(data, observed, predicted) {

confusion <- table(data[[observed]],

data[[predicted]])

print(confusion)

tp <- confusion[1, 1]

fn <- confusion[1, 2]

fp <- confusion[2, 1]

tn <- confusion[2, 2]

accuracy <- (tp+tn)/(tp+fn+fp tn)

precision <- tp / (tp + fp)

recall <- tp / (tp + fn)

fscore <- 2*(precision*recall)/(precision+recall)

metrics <- c("Accuracy" = accuracy,

"Precision" = precision,

"Recall" = recall,

"F-Score" = fscore)

return(metrics)

}

* *

roc_curve <- function(data, observed, predicted) {

data <- data[, c(observed, predicted)]

data[[observed]] <- as.numeric(

as.character(data[[observed]]))

rxRocCurve(actualVarName = observed,

predVarNames = predicted,

data = data)

}

* *

boosted_metrics <- evaluate_model(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Prediction")

roc_curve(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Probability")

The confusion Matrix:

0 1

0 2752288 67659

1 77117 101796

ROC curve is (AUC=0.95):

**Summary:**

In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.

by Joseph Rickert

When I first went to grad school, the mathematicians advised me cultivate the habit of reading with a pencil. This turned into a lifelong habit and useful skill for reading all sorts of things: literature, reports and newspapers for example; not just technical papers. However, reading statistics and data science papers, or really anything that includes some data, considerably "ups the ante". For this sort of exercise, I need a tool to calculate, to try some variations that test my intuition and see how well I'm following the arguments. The idea here is not so much to replicate the paper but to accept the author's invitation to engage with the data and work through the analysis. Ideally, I'd want something not much more burdensome than than a pencil (maybe a tablet based implementation of R), but standard R on my notebook comes pretty close to the perfect tool.

Recently, I sat down with Bradley Efron's 1987 paper "Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve", the paper where he elaborates on the idea of using conditional logistic regression to estimate hazard rates and survival curves. This paper is classic Efron: drawing you in with a great story well before you realize how much work it's going to be to follow it to the end. Efron writes with a fairly informal style that encourages the reader to continue. Struggling to keep up with some of his arguments I nevertheless get the feeling that Efron is doing his best help me follow along, dropping hints every now and then about where to look if I lose the trail.

The basic idea of conditional logistic regression is to group the data into discrete time intervals with n_{i} patients at risk in each interval, i, and then assume that the intervals really are independent and that the s_{i} events (deaths or some other measure of "success") in each interval, follow a binomial distribution with parameters n_{i} and h_{i} where:

h_{i} = Prob(patient i dies during the i^{th} interval | patient i survives until the beginning of the i^{th} interval).

The modest goal of this post was to see if I could reproduce Efron's Figure 3 which shows survival curves for three different models for A arm of a clinical trial examining treatments for head and neck cancer. I figured that getting to Figure 3 represents the minimum amount of comprehension required to begin experimenting with conditional logistic regression.

I entered the data buried in the caption to Efron's Table 1 and was delighted when R's Survival package replicated survival times also in the caption.

# Data for Efron's Table 1 # Enter raw data for arm A Adays <- c(7, 34, 42, 63, 64, 74, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157, 160, 160, 165, 173, 176, 185, 218, 225, 241, 248, 273, 277, 279, 297, 319, 405, 417, 420, 440, 523, 523, 583, 594, 1101, 1116, 1146, 1226, 1349, 1412, 1417) Astatus <- rep(1,51) Astatus[c(6,27,34,36,42,46,48,49,50)] <-0 Aobj <- Surv(time = Adays, Astatus==1) Aobj # [1] 7 34 42 63 64 74+ 83 84 91 108 112 129 133 133 # [15] 139 140 140 146 149 154 157 160 160 165 173 176 185+ 218 # [29] 225 241 248 273 277 279+ 297 319+ 405 417 420 440 523 523+ # [43] 583 594 1101 1116+ 1146 1226+ 1349+ 1412+ 1417

Doing the same with the data from arm B of the trial led to a set of Kaplan-Meier curves that pretty much match the curves in Efron's Figure 1.

All of this was straightforward but I was puzzled that the summary of the Kaplan-Meier curve for arm A (See KM_A in the code below) doesn't match the values for month, n_{i} and s_{i} in Efron's Table 1, until I realized these values were for the beginning of the month. To match the table I compute n_{i} by putting 51 in front of the vector KM_A$n.risk, and add a 1 to the end of the vector KM_A$n.event to get s_{i}. (See the "set up variables for models section in the code below.)

After this, one more "trick" was required to get to Figure 3. Most of the time, I suppose those of us working with logistic regression to construct machine learning models are accustomed to specifying the outcome as a binary variable of "ones" and "zeros", or as a two level factor. But, how exactly does one specify the parameters n_{i} and s_{i} for the binomial models that comprise each outcome? After a close reading of the documentation (See the first paragraph under Details for ?glm) I was very pleased to see that glm() permits the dependent variable to be a matrix.

The formula for Efron's cubic model looks like this:

Y <- matrix(c(si, failure),n,2) # Response matrix

form <- formula(Y ~ t + t2 + t3)

The rest of the code is straight forward and leads to a pretty good reproduction of Figure 3.

In this figure, the black line represents the survival curve for a Life Table model where the hazard probabilities are estimated by h_{i} = s_{i} / n_{i}. The blue triangles map the survival curve for the cubic model given above, and the red curve with crosses plots a cubic spline model where h_{i} = t_{i} + (t_{i} - 11)^{2} + (t_{i} - 11)^{3 }.

What a delightful little diagram! In addition to illustrating the technique of using a very basic statistical tool to model time-to-event data, the process leading to Figure 3 reveals something about the intuition and care a professional statistician puts into the exploratory modeling process.

There is much more in Efron's paper. What I have shown here is just the "trailer". Efron presents a careful analysis of the data for both arms of the clinical trial data, goes on to study maximum likelihood estimates for conditional logistic regression models and their standard errors, and proves a result about average ratio of the asymptotic variance between parametric and non-parametric hazard rate estimates.

Enjoy this classic paper and write some "am I reading this right" code of your own!

Here is my code for the models and plots.