by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

by Lourdes O. Montenegro

*Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.*

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the *bigrquery* package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The *bigrquery* package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the *bigrquery* package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015 require(bigrquery) downquery_template <- "SELECT connection_spec.client_geolocation.country_code AS country, AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput, COUNT(DISTINCT test_id) AS tests, FROM plx.google:m_lab.ndt.all WHERE IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) AND project = 0 AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) AND connection_spec.data_direction = 1 AND web100_log_entry.snap.HCThruOctetsAcked >= 8192 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) >= 9000000 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) < 3600000000 AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) AND web100_log_entry.snap.CongSignals > 0 AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11)) AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6) GROUP BY country ORDER BY country ASC;" downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the *rworldmap* package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) *joinCountryData2Map* joins the query results with shapefiles of country boundaries; (2) *mapCountryData* plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the *rworldmap* package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the *classInt* package to calculate natural breaks in the range of download speed results and the *RColorBrewer* package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

require(rworldmap) require(classInt) require(RColorBrewer) downloadmap <- joinCountryData2Map(downresult , joinCode='ISO2' , nameJoinColumn='country' , verbose='TRUE') par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #getting class intervals using a 'jenks' classification in classInt package classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks") catMethod = classInt[["brks"]] #getting a colour scheme from the RColorBrewer package colourPalette <- RColorBrewer::brewer.pal(5,'RdPu') mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", mapTitle="Download Speed (mbps)", catMethod=catMethod, colourPalette=colourPalette, addLegend=FALSE) do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", legendIntervals="data", legendMar = 2))

Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1] Thanks to Stephen McInerney and Chris Ritzo for code advice.

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

by Joseph Rickert

The model table on the caret package website lists more that 200 variations of predictive analytics models that are available withing the caret framework. All of these models may be prepared, tuned, fit and evaluated with a common set of caret functions. All on its own, the table is an impressive testament to the utility and scope of the R language as data science tool.

For the past year or so xgboost, the extreme gradient boosting algorithm, has been getting a lot of attention. The code below compares gbm with xgboost using the segmentationData set that comes with caret. The analysis presented here is far from the last word on comparing these models, but it does show how one might go about setting up a serious comparison using caret's functions to sweep through parameter space using parallel programming, and then used synchronized bootstrap samples to make a detailed comparison.

After reading in the data and dividing it into training and test data sets, caret's trainControl() and expand.grid() functions are used to set up to train the gbm model on all of the combinations of represented in the data frame built by expand.grid(). Then train() function does the actual training and fitting of the model. Notice that all of this happens in parallel. The Task Manager on my Windows 10 laptop shows all four cores maxed out at 100%.

After model fitting, predictions on the test data are computed and an ROC curve is drawn in the usual way. The AUC for gbm was computed to be 0.8731. Here is the ROC curve.

Next, a similar process for xgboost computes the AUC to be 0.8857, a fair improvement. The following plot shows how the ROC measure behaves with increasing tree depth for the two different values of the shrinkage parameter.

The final section of code shows how to caret can be used to compare the two models using the bootstrap samples that were created in the process of constructing the two models. The boxplots show xgboost has the edge although the gbm has a tighter distribution.

The next step, which I hope to take soon, is to rerun the analysis with more complete grids of tuning parameters. For a very accessible introduction to caret have a look at Max Kuhn's 2013 useR! tutorial.

#COMPARE XGBOOST with GBM ### Packages Required library(caret) library(corrplot) # plot correlations library(doParallel) # parallel processing library(dplyr) # Used by caret library(gbm) # GBM Models library(pROC) # plot the ROC curve library(xgboost) # Extreme Gradient Boosting ### Get the Data # Load the data and construct indices to divied it into training and test data sets. data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Pull out the dependent variable testX <- testData[,-1] sapply(trainX,summary) # Look at a summary of the training data ## GENERALIZED BOOSTED RGRESSION MODEL (BGM) # Set up training control ctrl <- trainControl(method = "repeatedcv", # 10fold cross validation number = 5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE, allowParallel = TRUE) # Use the expand.grid to specify the search space # Note that the default search grid selects multiple values of each tuning parameter grid <- expand.grid(interaction.depth=c(1,2), # Depth of variable interactions n.trees=c(10,20), # Num trees to fit shrinkage=c(0.01,0.1), # Try 2 values for learning rate n.minobsinnode = 20) # set.seed(1951) # set the seed # Set up to do parallel processing registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE) # Look at the tuning results # Note that ROC was the performance criterion used to select the optimal model. gbm.tune$bestTune plot(gbm.tune) # Plot the performance of the training models res <- gbm.tune$results res ### GBM Model Predictions and Performance # Make predictions using the test data set gbm.pred <- predict(gbm.tune,testX) #Look at the confusion matrix confusionMatrix(gbm.pred,testData$Class) #Draw the ROC curve gbm.probs <- predict(gbm.tune,testX,type="prob") head(gbm.probs) gbm.ROC <- roc(predictor=gbm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) gbm.ROC$auc #Area under the curve: 0.8731 plot(gbm.ROC,main="GBM ROC") # Plot the propability of poor segmentation histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") ##---------------------------------------------- ## XGBOOST # Some stackexchange guidance for xgboost # http://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees # Set up for parallel procerssing set.seed(1951) registerDoParallel(4,cores=4) getDoParWorkers() # Train xgboost xgb.grid <- expand.grid(nrounds = 500, #the maximum number of iterations eta = c(0.01,0.1), # shrinkage max_depth = c(2,6,10)) xgb.tune <-train(x=trainX,y=trainData$Class, method="xgbTree", metric="ROC", trControl=ctrl, tuneGrid=xgb.grid) xgb.tune$bestTune plot(xgb.tune) # Plot the performance of the training models res <- xgb.tune$results res ### xgboostModel Predictions and Performance # Make predictions using the test data set xgb.pred <- predict(xgb.tune,testX) #Look at the confusion matrix confusionMatrix(xgb.pred,testData$Class) #Draw the ROC curve xgb.probs <- predict(xgb.tune,testX,type="prob") #head(xgb.probs) xgb.ROC <- roc(predictor=xgb.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) xgb.ROC$auc # Area under the curve: 0.8857 plot(xgb.ROC,main="xgboost ROC") # Plot the propability of poor segmentation histogram(~xgb.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") # Comparing Multiple Models # Having set the same seed before running gbm.tune and xgb.tune # we have generated paired samples and are in a position to compare models # using a resampling technique. # (See Hothorn at al, "The design and analysis of benchmark experiments # -Journal of Computational and Graphical Statistics (2005) vol 14 (3) # pp 675-699) rValues <- resamples(list(xgb=xgb.tune,gbm=gbm.tune)) rValues$values summary(rValues) bwplot(rValues,metric="ROC",main="GBM vs xgboost") # boxplot dotplot(rValues,metric="ROC",main="GBM vs xgboost") # dotplot #splom(rValues,metric="ROC")

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In her series on principal components analysis for regression in R, Win-Vector LLC's Dr. Nina Zumel broke the demonstration down into the following pieces:

- Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of
*y*-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict. - And now Part 3: how to pick the number of components to retain for analysis.

In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3, she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:

- Pick 2, as that is all you can legibly graph.
- Pick enough to cover some fixed fraction of the variation (say 95%).
- (for variance scaled data only) Retain components with singular values at least 1.0.
- Look for a "knee in the curve" (the curve being the plot of the singular value magnitudes).
- Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the *y*-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:

However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:

In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general, we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "*y*" or outcome) and do *all* of the following:

- Fit model and significance prune incoming variables.
- Convert incoming variables into consistent response units by
*y*-aware scaling. - Fit model and significance prune resulting latent components.

The above will become much clearer and much more specific if you click here to read part 3.

by Joseph Rickert

R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a conference to keep its sparkle and pizzazz. But, the conference organizers and the UIC have managed to create a vibe that keeps people coming back. The fact that invited keynote speakers (e.g. Bernhard Pfaff 2012, Sanjiv Das 2013, and Robert McDonald 2014) regularly submit papers in subsequent years is a testimony to the quality and networking importance of the event. My guess is that the single track format, quality presentations, intense compact schedule and pleasant venues comprise a winning formula.

Since I have recently written about the content of this year's conference in preparation for the event, and since most of the presentations are already online for you to examine directly I'll just present a few personal highlights here.

My favorite single visual from the conference is Bryan Lewis' depiction of corporate "Big Data" architectures as a manifestation of the impulse for completeness, control and dominance that once drove Soviet style central planning. (If you don't read Russian, run google translate on the text in the first panel.)

In his presentation, R in practice, at scale, Bryan presents a lightweight, R-centric architecture built around Redis that is adequate fro many "big data" tasks.

Matt Dziubinski's talk on Getting the Most out of Rcpp, High-Performance C++ in Practice, is probably not a talk I would have elected to attend in a multi-track conference, and I would have missed seeing a virtuoso performance. Matt got through over 120 of his 153 prepared slides in a single, lucid stream of clear (but loud) explanations in only 20 minutes. Never stopping to pause, he gave a mini-course in computer science performance evaluation (both hardware and software aspects) that addressed the Why, What and How of it all.

Ryan Hafen's presentation, Interactively Exploring Financial Trades in R, showed how to use a tool chain built around Tessera and the NxCore R package to perform exploratory data analysis on a large NxCore data set containing approximately 1.25 billion records of 47 variables without leaving the R environment. The following slide provides an example of the kinds of insights that are possible.

In his presentation, Quantitative Analysis of Dual Moving Average Indicators in Automated Trading, Douglas Service showed how to use stochastic differential equations and the Itô calculus to derive a closed form solution for expected Log returns under the Luxor trading strategy and a baseline set of simplifying assumptions. If you like seeing the Math you will be pleased to see that Doug provides all of the details.

Michael Kane (glmnetlib: A Low-Level Library for Regularized Regression) discussed the motivations for continuing to improve linear models and showed the progress he is making on re-implementing glmnet which, although very efficient, does not support arbitrary link family combinations or out of memory calculations and is written in the obscure Mortran flavor of Fortran. Kane's goal with his new package (renamed pirls: Penalized, Iteratively Reweighted Least Squares Regression) is to rectify these deficiencies while producing something fast enough to use.

In his presentation Community Finance Teaching Resources with R/Shiny, Matt Brigida showed off some opensource resources for teaching quantitative finance that are based on the new paradigm of GitHub as the place for tech savvy people to hang out and Shiny as the teaching / presentation tool for making calculations come alive. Check out some of Matt's 5 minute mini lessons. Here is an example from his What is Risk module:

There is much more than I have presented here on the R / Finance conference site. If you are interested in deep Finance and not just the tools I have highlighted, be sure to check out the presentations by Sanjiv Das, Bernhard Pfaff, Matthew Dixon and others. There is plenty of useful R code to be mined in these presentations too.

I would be remiss without mentioning Patrick Burns' keynote presentation which was highly entertaining, novel and thought provoking on many levels: everything a keynote should be. Pat launched his talk by referring to the Sapir-Worf hypothesis which posits that language controls how we think and assigned a similar role to model building. He went on to describe his Agent inspired R simulation model and showed how he calibrated this model to provide a useful tool for investigating ideas such as risk parity, variance targeting and strategies for taxing market pollution. The code for Pat's model is available here, but since his slides are not up on the conference site, and I was apparently too mesmerized to take useful notes, we will have to wait for Pat to post more on his website. (Pat's slides should be available soon.)

Finally, I would like to note that Doug Service and Sanjiv Das won the best paper prizes. This is the second year in a row for Sanjiv to win an R / Finance award. Congratulations to both Doug and Sanjiv!

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

by Yuzhou Song, Microsoft Data Scientist

R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.

In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.

**Step 1: Import the Data**

At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.

library(RevoScaleR)

data.path <- "./"

file.name <- "ccFraud.csv"

fraud_csv_path <- file.path(data.path,file.name)

Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.

colClasses <- c("integer","factor","factor","integer",

"numeric","integer","integer","numeric","factor")

names(colClasses)<-c("custID","gender","state","cardholder",

"balance","numTrans","numIntlTrans",

"creditLine","fraudRisk")

fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)

With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:

rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)

and you will get the following:

**Step 2: Process Data**

Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.

Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:

fraud_flag_csv_source <- RxTextData(file=file.path(data.path,

"ccFraudFlag.csv"))

Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:

make_train_flag <- function(data){

data <- as.data.frame(data)

set.seed(34)

data$trainFlag <- sample(c(0,1),size=nrow(data),

replace=TRUE,prob=c(0.3,0.7))

return(data)

}

Then, use rxDateStep to complete the transformation:

rxDataStep(inData=fraud_csv_source,

outFile=fraud_flag_csv_source,

transformFunc=make_train_flag,overwrite = TRUE)

again, we can check the output file information by using rxGetInfo function:

rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)

we can find the trainFlag column has been appended to the last:

Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:

train_csv_source <- RxTextData(

file=file.path(data.path,"train.csv"))

test_csv_source <- RxTextData(

file=file.path(data.path,"test.csv"))

Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:

rxDataStep(inData=fraud_flag_csv_source,

outFile=train_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 1),

overwrite = TRUE)

rxDataStep(inData=fraud_flag_csv_source,

outFile=test_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 0),

overwrite = TRUE)

A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.

down_sample <- function(data){

data <- as.data.frame(data)

data_bad <- subset(data,fraudRisk == 1)

data_good <- subset(data,fraudRisk == 0)

# good to bad ratio 4:1

rate <- nrow(data_bad)*4/nrow(data_good)

set.seed(34)

data_good$keepTag <- sample(c(0,1),replace=TRUE,

size=nrow(data_good),prob=c(1-rate,rate))

data_good_down <- subset(data_good,keepTag == 1)

data_good_down$keepTag <- NULL

data_down <- rbind(data_bad,data_good_down)

data_down$trainFlag <- NULL

return(data_down)

}

Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:

train_downsample_csv_source <- RxTextData(

file=file.path(data.path,

"train_downsample.csv"),

colClasses = colClasses)

rxDataStep(inData=train_csv_source,

outFile=train_downsample_csv_source,

transformFunc = down_sample,

reportProgress = 0,

overwrite = TRUE)

**Step 3: Training**

In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:

training_vars <- rxGetVarNames(train_downsample_csv_source)

training_vars <- training_vars[!(training_vars %in%

c("fraudRisk","custID"))]

formula <- as.formula(paste("fraudRisk~",

paste(training_vars, collapse = "+")))

The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:

boosted_fit <- rxBTrees(formula = formula,

data = train_downsample_csv_source,

learningRate = 0.2,

minSplit = 10,

minBucket = 10,

# small number of tree for testing purpose

nTree = 20,

seed = 5,

lossFunction ="multinomial",

reportProgress = 0)

**Step 4: Prediction and Evaluation**

We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:

test_data <- rxImport(test_csv_source)

Then, we take the imported testing set and fitted model object as input for rxPredict function:

predictions <- rxPredict(modelObject = boosted_fit,

data = test_data,

type = "response",

overwrite = TRUE,

reportProgress = 0)

In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:

threshold <- 0.5

predictions <- data.frame(predictions$X1_prob)

names(predictions) <- c("Boosted_Probability")

predictions$Boosted_Prediction <- ifelse(

predictions$Boosted_Probability > threshold, 1, 0)

predictions$Boosted_Prediction <- factor(

predictions$Boosted_Prediction,

levels = c(1, 0))

scored_test_data <- cbind(test_data, predictions)

evaluate_model <- function(data, observed, predicted) {

confusion <- table(data[[observed]],

data[[predicted]])

print(confusion)

tp <- confusion[1, 1]

fn <- confusion[1, 2]

fp <- confusion[2, 1]

tn <- confusion[2, 2]

accuracy <- (tp+tn)/(tp+fn+fp tn)

precision <- tp / (tp + fp)

recall <- tp / (tp + fn)

fscore <- 2*(precision*recall)/(precision+recall)

metrics <- c("Accuracy" = accuracy,

"Precision" = precision,

"Recall" = recall,

"F-Score" = fscore)

return(metrics)

}

* *

roc_curve <- function(data, observed, predicted) {

data <- data[, c(observed, predicted)]

data[[observed]] <- as.numeric(

as.character(data[[observed]]))

rxRocCurve(actualVarName = observed,

predVarNames = predicted,

data = data)

}

* *

boosted_metrics <- evaluate_model(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Prediction")

roc_curve(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Probability")

The confusion Matrix:

0 1

0 2752288 67659

1 77117 101796

ROC curve is (AUC=0.95):

**Summary:**

In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.

by Joseph Rickert

When I first went to grad school, the mathematicians advised me cultivate the habit of reading with a pencil. This turned into a lifelong habit and useful skill for reading all sorts of things: literature, reports and newspapers for example; not just technical papers. However, reading statistics and data science papers, or really anything that includes some data, considerably "ups the ante". For this sort of exercise, I need a tool to calculate, to try some variations that test my intuition and see how well I'm following the arguments. The idea here is not so much to replicate the paper but to accept the author's invitation to engage with the data and work through the analysis. Ideally, I'd want something not much more burdensome than than a pencil (maybe a tablet based implementation of R), but standard R on my notebook comes pretty close to the perfect tool.

Recently, I sat down with Bradley Efron's 1987 paper "Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve", the paper where he elaborates on the idea of using conditional logistic regression to estimate hazard rates and survival curves. This paper is classic Efron: drawing you in with a great story well before you realize how much work it's going to be to follow it to the end. Efron writes with a fairly informal style that encourages the reader to continue. Struggling to keep up with some of his arguments I nevertheless get the feeling that Efron is doing his best help me follow along, dropping hints every now and then about where to look if I lose the trail.

The basic idea of conditional logistic regression is to group the data into discrete time intervals with n_{i} patients at risk in each interval, i, and then assume that the intervals really are independent and that the s_{i} events (deaths or some other measure of "success") in each interval, follow a binomial distribution with parameters n_{i} and h_{i} where:

h_{i} = Prob(patient i dies during the i^{th} interval | patient i survives until the beginning of the i^{th} interval).

The modest goal of this post was to see if I could reproduce Efron's Figure 3 which shows survival curves for three different models for A arm of a clinical trial examining treatments for head and neck cancer. I figured that getting to Figure 3 represents the minimum amount of comprehension required to begin experimenting with conditional logistic regression.

I entered the data buried in the caption to Efron's Table 1 and was delighted when R's Survival package replicated survival times also in the caption.

# Data for Efron's Table 1 # Enter raw data for arm A Adays <- c(7, 34, 42, 63, 64, 74, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157, 160, 160, 165, 173, 176, 185, 218, 225, 241, 248, 273, 277, 279, 297, 319, 405, 417, 420, 440, 523, 523, 583, 594, 1101, 1116, 1146, 1226, 1349, 1412, 1417) Astatus <- rep(1,51) Astatus[c(6,27,34,36,42,46,48,49,50)] <-0 Aobj <- Surv(time = Adays, Astatus==1) Aobj # [1] 7 34 42 63 64 74+ 83 84 91 108 112 129 133 133 # [15] 139 140 140 146 149 154 157 160 160 165 173 176 185+ 218 # [29] 225 241 248 273 277 279+ 297 319+ 405 417 420 440 523 523+ # [43] 583 594 1101 1116+ 1146 1226+ 1349+ 1412+ 1417

Doing the same with the data from arm B of the trial led to a set of Kaplan-Meier curves that pretty much match the curves in Efron's Figure 1.

All of this was straightforward but I was puzzled that the summary of the Kaplan-Meier curve for arm A (See KM_A in the code below) doesn't match the values for month, n_{i} and s_{i} in Efron's Table 1, until I realized these values were for the beginning of the month. To match the table I compute n_{i} by putting 51 in front of the vector KM_A$n.risk, and add a 1 to the end of the vector KM_A$n.event to get s_{i}. (See the "set up variables for models section in the code below.)

After this, one more "trick" was required to get to Figure 3. Most of the time, I suppose those of us working with logistic regression to construct machine learning models are accustomed to specifying the outcome as a binary variable of "ones" and "zeros", or as a two level factor. But, how exactly does one specify the parameters n_{i} and s_{i} for the binomial models that comprise each outcome? After a close reading of the documentation (See the first paragraph under Details for ?glm) I was very pleased to see that glm() permits the dependent variable to be a matrix.

The formula for Efron's cubic model looks like this:

Y <- matrix(c(si, failure),n,2) # Response matrix

form <- formula(Y ~ t + t2 + t3)

The rest of the code is straight forward and leads to a pretty good reproduction of Figure 3.

In this figure, the black line represents the survival curve for a Life Table model where the hazard probabilities are estimated by h_{i} = s_{i} / n_{i}. The blue triangles map the survival curve for the cubic model given above, and the red curve with crosses plots a cubic spline model where h_{i} = t_{i} + (t_{i} - 11)^{2} + (t_{i} - 11)^{3 }.

What a delightful little diagram! In addition to illustrating the technique of using a very basic statistical tool to model time-to-event data, the process leading to Figure 3 reveals something about the intuition and care a professional statistician puts into the exploratory modeling process.

There is much more in Efron's paper. What I have shown here is just the "trailer". Efron presents a careful analysis of the data for both arms of the clinical trial data, goes on to study maximum likelihood estimates for conditional logistic regression models and their standard errors, and proves a result about average ratio of the asymptotic variance between parametric and non-parametric hazard rate estimates.

Enjoy this classic paper and write some "am I reading this right" code of your own!

Here is my code for the models and plots.

by Lixun Zhang, Data Scientist at Microsoft

As a data scientist, I have experience with R. Naturally, when I was first exposed to Microsoft R Open (MRO, formerly Revolution R Open) and Microsoft R Server (MRS, formerly Revolution R Enterprise), I wanted to know the answers for 3 questions:

- What do R, MRO, and MRS have in common?
- What’s new in MRO and MRS compared with R?
- Why should I use MRO or MRS instead of R?

The publicly available information on MRS either describes it at a high level or explains the specific functions and the underlying algorithms. When they compare R, MRO, and MRS, the materials tend to be high level without many details at the functions and packages level, with which data scientists are most familiar. And they don’t answer the above questions in a comprehensive way. So I designed my own tests (and the code behind the tests is available on GitHub). Below are my answers to the three questions above. MRO has an optional MKL library and unless noted otherwise the observations hold true, whether MKL is installed on MRO or not.

After installing R, MRO, and MRS, you'll notice that everything you can do in R can be done in MRO or MRS. For example, you can use *glm()* to fit a logistic regression and *kmeans()* to carry out cluster analysis. As another example, you can install packages from CRAN. In fact, a package installed in R can be used in MRO or MRS and vice versa if the package is installed in a library tree that's shared among them. You can use the command *.libPaths()* to set and get library trees for R, MRO and MRS. Finally, you can use your favorite IDEs such as RStudio and Visual Studio with RTVS for R, MRO or MRS. In other words, MRO and MRS are 100% compatible with R in terms of functions, packages, and IDEs.

While everything you do in R can done in MRO and MRS, the reverse is not true, due to the additional components in MRO and MRS. MRO allows users to install an optional math library MKL for multithreaded performance. This library shows up as a package named *"RevoUtilsMath"* in MRO.

MRS comes with more packages and functions than R. From the package perspective, most of the additional ones are not on CRAN and are available only after installing MRS. One such example is the RevoScaleR package. MRS also installs the MKL library by default. As for functions, MRS has High Performance Analysis (HPA) version of many base R functions, which are included in the RevoScaleR package. For example, the HPA version of *glm()* is *rxGlm()* and for *kmeans()* it is *rxKmeans()*. These HPA functions can be used in the same way as their base R counterparts with additional options. In addition, these functions can work with a special data format (XDF) that's customized for MRS.

In a nutshell, MRS solves two problems associated with using R: capacity (handling the size of datasets and models) and speed. And MRO solves the problem associated with speed.

The following table summarizes the performance comparisons for R, MRO, and MRS. In terms of capacity, using HPA in MRS increases the size of data that can be analyzed. From the speed perspective, certain matrix related base R functions can perform better in MRO and MRS than base R due to MKL. The HPA functions in MRS perform better than their base R counterparts for large datasets. More details on this comparison can be found in the notebook on GitHub.

It should be noted that while there are packages such as *“bigmemory”* and *“ff”* that help address some of the big data problems, they were not included in the benchmark tests.

For data scientists trying to determine which of these platforms should be used under different scenarios, the following table can be used as a reference. Depending on the amount of data and the availability of MRS's HPA functions, the table summarizes scenarios where R, MRO, and MRS can be used. It can be observed that whenever R can be used, MRO can be used with the additional benefit of multi-thread computation for certain matrix related computations. And MRS can be used whenever R or MRO can be used and it allows the possibility of using HPA functions that provide better performance in terms of both speed and capacity.

Follow the link below for my in-depth comparison of R, MRO and MRS.

Lixun Zhang: Introduction to Microsoft R Open and Microsoft R Server