by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

by Lourdes O. Montenegro

*Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.*

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the *bigrquery* package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The *bigrquery* package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the *bigrquery* package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015 require(bigrquery) downquery_template <- "SELECT connection_spec.client_geolocation.country_code AS country, AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput, COUNT(DISTINCT test_id) AS tests, FROM plx.google:m_lab.ndt.all WHERE IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) AND project = 0 AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) AND connection_spec.data_direction = 1 AND web100_log_entry.snap.HCThruOctetsAcked >= 8192 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) >= 9000000 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) < 3600000000 AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) AND web100_log_entry.snap.CongSignals > 0 AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11)) AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6) GROUP BY country ORDER BY country ASC;" downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the *rworldmap* package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) *joinCountryData2Map* joins the query results with shapefiles of country boundaries; (2) *mapCountryData* plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the *rworldmap* package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the *classInt* package to calculate natural breaks in the range of download speed results and the *RColorBrewer* package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

require(rworldmap) require(classInt) require(RColorBrewer) downloadmap <- joinCountryData2Map(downresult , joinCode='ISO2' , nameJoinColumn='country' , verbose='TRUE') par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #getting class intervals using a 'jenks' classification in classInt package classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks") catMethod = classInt[["brks"]] #getting a colour scheme from the RColorBrewer package colourPalette <- RColorBrewer::brewer.pal(5,'RdPu') mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", mapTitle="Download Speed (mbps)", catMethod=catMethod, colourPalette=colourPalette, addLegend=FALSE) do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", legendIntervals="data", legendMar = 2))

Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1] Thanks to Stephen McInerney and Chris Ritzo for code advice.

The forecast package for R, created and maintained by Professor Rob Hyndman of Monash University, is one of the more useful R packages available available on CRAN. Statistical forecasting — the process of predicting the future value of a time series — is used in just about every realm of data analysis, whether it's trying to predict a future stock price or trying to anticipate changes in the weather. If you're looking to learn about forecasting, a great place to start is the online book Forecasting: Principles and Practice (by Hyndman and George Athanasopoulos) which walks you through the theory and practice, with many examples in R based on the forecast package. Topics covered include multiple regression, Time series decomposition, exponential smoothing, and ARIMA models.

The forecast package itself recently received a major update, to version 7. One major new capability is the ability to easily chart forecasts using the ggplot2 package with the new autoplot function. For example:

fc <- forecast(fdeaths) autoplot(fc)

You can also add forecasts to any ggplot using the new geom_forecasts geom provided by the forecast package:

autoplot(mdeaths) + geom_forecast(h=36, level=c(50,80,95))

There have been several updates to the forecasting functions as well. The function for fitting linear models to time series data, tslm, has been rewritten to be more compatible with the standard lm function. It's now possible to forecast means (as well as medians) when using Box-Cox transformations. And you can now apply neural networks to time series data by building a nonlinear autoregressive model with the new nnetar function.

Those are just some of the highlights of the updates to the forecast package in version 7. For complete details, follow the links to Rob Hyndman's blog, below.

Hyndsight: forecast v7 and ggplot2 graphics ; Forecast v7 (part 2) (via traims)

by Joseph Rickert

It is always a delight to discover a new and useful R package, and it is especially nice when the discovery comes with at context and testimonial to its effectiveness. It is also satisfying to be able to check in once in awhile and get an idea of what people think is hot, or current or trending in the R world. The schedule for the upcoming useR! conference at Stanford is a touchstone for both of these purposes. It lists 128 contributed talks over 3 days. Vetted for both content and quality by the program committee, these talks represent a snapshot of topics and research areas that are of current interest to R developers and researchers and also catalog the packages and tools that support the research. As best as I can determine, the abstracts for the talks explicitly call out 154 unique packages.

Some of the talks are all about introducing new R packages that have been very recently released or are still under development on Github. Others, mention the packages that represent the current suite of tools for a particular research area. Obviously, the abstracts do not list all of the packages employed by the researchers in their work, but presumably they do mention the packages that the speakers think are most important or in describing their work and attracting an audience.

The following table maps packages to talks. There are multiple lines for packages when they map to more than one talk and vice versa. For me, browsing a list like this is akin to perusing a colleague's book shelves; noting old favorites and discovering the occasional new gem.

I am reticent to try any conclusions about the importance of the packages to the talks from the abstracts alone before the talks are given. However, it is interesting to note that all of the packages mentioned more than twice are from the RStudio suite. Moreover, shiny and rmarkdown are apparently still new and shiny. Mentioning them in an abstract still conveys some cachet. I suspect that ggplot2 will figure in more analyses than both of these packages put together, but ggplot2 has become so much a part of that fabric of R that there is no premium in promoting it.

If you are looking forward to useR! 2016, whether it attend in person or to watch the videos afterwards, gaining some familiarity with the R packages highlighted in the contributed talks is sure to enhance your experience.

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

by Joseph Rickert

The model table on the caret package website lists more that 200 variations of predictive analytics models that are available withing the caret framework. All of these models may be prepared, tuned, fit and evaluated with a common set of caret functions. All on its own, the table is an impressive testament to the utility and scope of the R language as data science tool.

For the past year or so xgboost, the extreme gradient boosting algorithm, has been getting a lot of attention. The code below compares gbm with xgboost using the segmentationData set that comes with caret. The analysis presented here is far from the last word on comparing these models, but it does show how one might go about setting up a serious comparison using caret's functions to sweep through parameter space using parallel programming, and then used synchronized bootstrap samples to make a detailed comparison.

After reading in the data and dividing it into training and test data sets, caret's trainControl() and expand.grid() functions are used to set up to train the gbm model on all of the combinations of represented in the data frame built by expand.grid(). Then train() function does the actual training and fitting of the model. Notice that all of this happens in parallel. The Task Manager on my Windows 10 laptop shows all four cores maxed out at 100%.

After model fitting, predictions on the test data are computed and an ROC curve is drawn in the usual way. The AUC for gbm was computed to be 0.8731. Here is the ROC curve.

Next, a similar process for xgboost computes the AUC to be 0.8857, a fair improvement. The following plot shows how the ROC measure behaves with increasing tree depth for the two different values of the shrinkage parameter.

The final section of code shows how to caret can be used to compare the two models using the bootstrap samples that were created in the process of constructing the two models. The boxplots show xgboost has the edge although the gbm has a tighter distribution.

The next step, which I hope to take soon, is to rerun the analysis with more complete grids of tuning parameters. For a very accessible introduction to caret have a look at Max Kuhn's 2013 useR! tutorial.

#COMPARE XGBOOST with GBM ### Packages Required library(caret) library(corrplot) # plot correlations library(doParallel) # parallel processing library(dplyr) # Used by caret library(gbm) # GBM Models library(pROC) # plot the ROC curve library(xgboost) # Extreme Gradient Boosting ### Get the Data # Load the data and construct indices to divied it into training and test data sets. data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Pull out the dependent variable testX <- testData[,-1] sapply(trainX,summary) # Look at a summary of the training data ## GENERALIZED BOOSTED RGRESSION MODEL (BGM) # Set up training control ctrl <- trainControl(method = "repeatedcv", # 10fold cross validation number = 5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE, allowParallel = TRUE) # Use the expand.grid to specify the search space # Note that the default search grid selects multiple values of each tuning parameter grid <- expand.grid(interaction.depth=c(1,2), # Depth of variable interactions n.trees=c(10,20), # Num trees to fit shrinkage=c(0.01,0.1), # Try 2 values for learning rate n.minobsinnode = 20) # set.seed(1951) # set the seed # Set up to do parallel processing registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE) # Look at the tuning results # Note that ROC was the performance criterion used to select the optimal model. gbm.tune$bestTune plot(gbm.tune) # Plot the performance of the training models res <- gbm.tune$results res ### GBM Model Predictions and Performance # Make predictions using the test data set gbm.pred <- predict(gbm.tune,testX) #Look at the confusion matrix confusionMatrix(gbm.pred,testData$Class) #Draw the ROC curve gbm.probs <- predict(gbm.tune,testX,type="prob") head(gbm.probs) gbm.ROC <- roc(predictor=gbm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) gbm.ROC$auc #Area under the curve: 0.8731 plot(gbm.ROC,main="GBM ROC") # Plot the propability of poor segmentation histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") ##---------------------------------------------- ## XGBOOST # Some stackexchange guidance for xgboost # http://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees # Set up for parallel procerssing set.seed(1951) registerDoParallel(4,cores=4) getDoParWorkers() # Train xgboost xgb.grid <- expand.grid(nrounds = 500, #the maximum number of iterations eta = c(0.01,0.1), # shrinkage max_depth = c(2,6,10)) xgb.tune <-train(x=trainX,y=trainData$Class, method="xgbTree", metric="ROC", trControl=ctrl, tuneGrid=xgb.grid) xgb.tune$bestTune plot(xgb.tune) # Plot the performance of the training models res <- xgb.tune$results res ### xgboostModel Predictions and Performance # Make predictions using the test data set xgb.pred <- predict(xgb.tune,testX) #Look at the confusion matrix confusionMatrix(xgb.pred,testData$Class) #Draw the ROC curve xgb.probs <- predict(xgb.tune,testX,type="prob") #head(xgb.probs) xgb.ROC <- roc(predictor=xgb.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) xgb.ROC$auc # Area under the curve: 0.8857 plot(xgb.ROC,main="xgboost ROC") # Plot the propability of poor segmentation histogram(~xgb.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") # Comparing Multiple Models # Having set the same seed before running gbm.tune and xgb.tune # we have generated paired samples and are in a position to compare models # using a resampling technique. # (See Hothorn at al, "The design and analysis of benchmark experiments # -Journal of Computational and Graphical Statistics (2005) vol 14 (3) # pp 675-699) rValues <- resamples(list(xgb=xgb.tune,gbm=gbm.tune)) rValues$values summary(rValues) bwplot(rValues,metric="ROC",main="GBM vs xgboost") # boxplot dotplot(rValues,metric="ROC",main="GBM vs xgboost") # dotplot #splom(rValues,metric="ROC")

by Joseph Rickert

R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a conference to keep its sparkle and pizzazz. But, the conference organizers and the UIC have managed to create a vibe that keeps people coming back. The fact that invited keynote speakers (e.g. Bernhard Pfaff 2012, Sanjiv Das 2013, and Robert McDonald 2014) regularly submit papers in subsequent years is a testimony to the quality and networking importance of the event. My guess is that the single track format, quality presentations, intense compact schedule and pleasant venues comprise a winning formula.

Since I have recently written about the content of this year's conference in preparation for the event, and since most of the presentations are already online for you to examine directly I'll just present a few personal highlights here.

My favorite single visual from the conference is Bryan Lewis' depiction of corporate "Big Data" architectures as a manifestation of the impulse for completeness, control and dominance that once drove Soviet style central planning. (If you don't read Russian, run google translate on the text in the first panel.)

In his presentation, R in practice, at scale, Bryan presents a lightweight, R-centric architecture built around Redis that is adequate fro many "big data" tasks.

Matt Dziubinski's talk on Getting the Most out of Rcpp, High-Performance C++ in Practice, is probably not a talk I would have elected to attend in a multi-track conference, and I would have missed seeing a virtuoso performance. Matt got through over 120 of his 153 prepared slides in a single, lucid stream of clear (but loud) explanations in only 20 minutes. Never stopping to pause, he gave a mini-course in computer science performance evaluation (both hardware and software aspects) that addressed the Why, What and How of it all.

Ryan Hafen's presentation, Interactively Exploring Financial Trades in R, showed how to use a tool chain built around Tessera and the NxCore R package to perform exploratory data analysis on a large NxCore data set containing approximately 1.25 billion records of 47 variables without leaving the R environment. The following slide provides an example of the kinds of insights that are possible.

In his presentation, Quantitative Analysis of Dual Moving Average Indicators in Automated Trading, Douglas Service showed how to use stochastic differential equations and the Itô calculus to derive a closed form solution for expected Log returns under the Luxor trading strategy and a baseline set of simplifying assumptions. If you like seeing the Math you will be pleased to see that Doug provides all of the details.

Michael Kane (glmnetlib: A Low-Level Library for Regularized Regression) discussed the motivations for continuing to improve linear models and showed the progress he is making on re-implementing glmnet which, although very efficient, does not support arbitrary link family combinations or out of memory calculations and is written in the obscure Mortran flavor of Fortran. Kane's goal with his new package (renamed pirls: Penalized, Iteratively Reweighted Least Squares Regression) is to rectify these deficiencies while producing something fast enough to use.

In his presentation Community Finance Teaching Resources with R/Shiny, Matt Brigida showed off some opensource resources for teaching quantitative finance that are based on the new paradigm of GitHub as the place for tech savvy people to hang out and Shiny as the teaching / presentation tool for making calculations come alive. Check out some of Matt's 5 minute mini lessons. Here is an example from his What is Risk module:

There is much more than I have presented here on the R / Finance conference site. If you are interested in deep Finance and not just the tools I have highlighted, be sure to check out the presentations by Sanjiv Das, Bernhard Pfaff, Matthew Dixon and others. There is plenty of useful R code to be mined in these presentations too.

I would be remiss without mentioning Patrick Burns' keynote presentation which was highly entertaining, novel and thought provoking on many levels: everything a keynote should be. Pat launched his talk by referring to the Sapir-Worf hypothesis which posits that language controls how we think and assigned a similar role to model building. He went on to describe his Agent inspired R simulation model and showed how he calibrated this model to provide a useful tool for investigating ideas such as risk parity, variance targeting and strategies for taxing market pollution. The code for Pat's model is available here, but since his slides are not up on the conference site, and I was apparently too mesmerized to take useful notes, we will have to wait for Pat to post more on his website. (Pat's slides should be available soon.)

Finally, I would like to note that Doug Service and Sanjiv Das won the best paper prizes. This is the second year in a row for Sanjiv to win an R / Finance award. Congratulations to both Doug and Sanjiv!

Unlike most other statistical software packages, R doesn't have a native data file format. You can certainly import and export data in any number of formats, but there's no native "R data file format". The closest equivalent is the `saveRDS`

/`loadRDS`

function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don't hew to a standardized format (it's essentially a dump of R in-memory representation of the object), and so you can't read the data with any software other than R.

The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it's a *column-oriented* file format, which matches R's internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.

For users of R 3.3.0 and later, the feather package is now available on CRAN. (Users of older versions of R can install feather from GitHub.) With feather installed, you can read and write R data frames to feather files using simple functions:

`write_feather(mtcars. "mtcars.feather")`

`mtcars2 <- read_feather("mtcars.feather")`

Better yet, the mtcars.feather file can easily be read into Python, using its feather-format package. This example uses the small built-in mtcars data frame, but you should see a significant performance impact when working with larger data. Eduardo Ariño de la Rubia performed some benchmarking of feather, and found it to be significantly faster for ingesting data than other popular R functions. The chart below compares using feather, the data.table package, and `loadRDS`

to import 508Mb file of 8.5 million rows and 7 columns:

Feather wasn't the fastest function benchmarked for *writing* data — data.table's `fwrite`

function generally performed a bit better — but given that you typically read a file more often than writing it, the speedups should be very noticable in day-to-day data science activites.

For more on the feather package, check out its announcement from the RStudio blog linked below.

RStudio blog: Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

by Joseph Rickert

What makes for a good R package? With over 8,000 packages up on CRAN the quantity of packages is clearly not an issue for R users. Developing an instinct to recognize quality, however, both requires and deserves some effort. I regularly spend time on Dirk Eddelbuettel’s CRANberries site investigating new packages and monitoring changes in old favorites in order to recommend packages for inclusion in MRAN’s Package Spotlight page. As a consequence, I think I’m getting a feel for quality and I believe it comes down to this: *A good R package clearly says what it does and then really does what it says.*

It should not be surprising that documentation is the key. For an R package, the first obvious place for an author to provide quality documentation is the vignette. Hadley Wickham writes:

A vignette is a long-form guide to your package. . . A vignette is like a book chapter or an academic paper: it can describe the problem that your package is designed to solve, and then show the reader how to solve it. . . Vignettes are also useful if you want to explain the details of your package. For example, if you have implemented a complex statistical algorithm, you might want to describe all the details in a vignette so that users of your package can understand what’s going on under the hood, and be confident that you’ve implemented the algorithm correctly.

Unfortunately, less than 25% of all R packages have vignettes!

As a person who is in the habit of writing, I find it astounding that anyone would do all the creative work to develop the contents of a package and then make the effort to get it through the CRAN process and release into the open source domain without taking the basic step to explain its value, and maybe even discretely sing its praises. Nevertheless, most package authors, even some who have otherwise done great work balk at writing engaging documentation.

Although they are not visible in the plot above, there are a few packages whose authors have made extravagant attempts at documentation. The following table lists all packages with 10 or more vignettes.

Name Date Vignettes

1 caschrono 2014-03-21 10

2 catdata 2014-11-11 45

3 copula 2015-10-26 13

4 gamclass 2015-08-20 11

5 ggvis 2015-06-06 10

6 HSAUR 2015-07-28 17

7 HSAUR2 2015-07-28 19

8 HSAUR3 2015-07-29 22

9 Sleuth2 2016-01-08 13

10 Sleuth3 2016-01-08 13

11 tigerstats 2015-09-23 20

(Yes, catdata does really have 45 vignettes, but the package is the documentation for a book.)

To be fair and thorough though, I should mention that are some pretty lame vignettes out there. I sometimes do find myself putting a new package on the short list for a Spotlight evaluation just because it has a vignette, only to be disappointed when it I get around to looking at it.

I should also note that writing a vignette is neither the only way, nor the most sophisticated way to document an R package. Top shelf packages implementing a statistical algorithm or new computational method are often complemented up by a paper published in the Journal of Statistical Software or some other peer reviewed publication. kernlab, for example, uses a version of its JSS paper as a vignette. Other packages, such as statnet are backed up by informative websites, and it is also becoming common for packages authors to provide links to their GitHub development sites. See data.table's GitHub page, for example, or Dirk's Rcpp CRAN page which provides links to extensive documentation listing multiple vignettes as well as multiple websites. I find it particularly convenient that the package pdf links to these sites and also to the supporting JSS paper.

So, what about the other part of my definition: A good R package ... really does what it says. I find that good documentation does correlate positively with good code, but beyond that the best way to make a quick assessment of the quality of a package is to see if it is included in any of the CRAN Task Views. These are lists of packages organized and curated by experts who make heroic efforts to keep them current and comprehensive, if not complete. Amazingly, 27% of CRAN packages are listed in at least one Task View! This is higher than the percentage of packages that have vignettes.

MRAN provides a lookup feature that is convenient for checking a package's quality potential. With one click you can see the names of any vignettes that may be associated with a package along with any Task Views in which it may be listed.

Of course, the crucial step for checking quality and utility of a package is to try the package yourself. Doing this also provides an opportunity for you the user to contribute to the common good by just using the package in your work, talking about its merits, and maybe even providing constructive feedback to the author if you think that would be helpful.

The data for this post comes from a JSON file available on the MRAN website. The code below was used for the post.

library(jsonlite) library(ggplot2) # Read in package data as JSON file and form into a data frame json_file <- "https://mran.revolutionanalytics.com/packagedata/allpackages.json" json_data <- as.data.frame(fromJSON(paste(readLines(json_file),collapse=""))) dF <- json_data[,c(1,2,6,7)] names(dF) <- c("Name","Date","Task_View","Vignettes") dF$Vignettes <- ifelse(is.na(dF$Vignettes)==TRUE,0,dF$Vignettes) dF$Task_View <- ifelse(is.na(dF$Task_View)==TRUE,"None",dF$Task_View) head(dF) # Look at some summary statistics summary(dF$Vignettes) table(dF$Vignettes) # Find number of vignettes have_vig <- sum(dF$Vignettes>0) have_vig #1961 pct_vig <- have_vig / length(dF$Vignettes) pct_vig #0.2353859 # Find package with >= 10 Task Views dF[dF$Vignettes>=10,] # Modify data frame for printing dF10 <- dF[dF$Vignettes>=10, -3] row.names(dF10) <- NULL dF10 # Plot histogram p <- ggplot(dF, aes(x=Vignettes)) p + geom_histogram(binwidth=1) + ggtitle("The sad tale but long tail of vignettes") # Find number of packages in task views have_TV <- dF[dF$Task_View!="None",] pct_TV <- dim(have_TV)[1] / length(dF$Vignettes) pct_TV #0.2724763

by Andrie de Vries

A few weeks ago I wrote about the growth of CRAN packages, where I demonstrated how to scrape CRAN archives to get an estimate of the number of packages over time. In this post I briefly mentioned that the Ecdat package contains a dataset, CRANpackages, with snapshots recorded by John Fox and Spencer Graves.

Here is a plot of the data they collected. The dataset contains data through 2014, so I manually added the package count as of today (8,329).

In my previous post, I asked the question: "are there indications that the contribution rate is steady, accelerating or decelerating?"

This hints to analysis by John Fox where he says "The number of packages on CRAN ... has grown roughly exponentially, with residuals from the exponential trend ... showing a recent decline in the rate of growth" (Fox, 2009).

In my previous post Using segmented regression to analyse world record running times I used segmented regression to estimate a model that is piece-wise linear.

I used the same process to fit a segmented regression line through the CRAN package data.

By default the segmented package fits a single break point through the data. The results of this analysis indicates a break point occurring some time during 2008. This is entirely consistent with the observation by John Fox that the rate of growth is slowing down.

However, note that the the segmented regression line doesn't fit the data very well during the period 2008 to 2012.

With a small amount of extra work you can fit segmented models with multiple break points. To do this, you simply have to specify initial values for the search. Here I show the results of a simple model with two break points. This model finds the first break point during 2007 and the second break point during 2011.

Natural systems can not maintain exponential growth forever. There are always some limits on the system that will ultimately inhibit any further growth. This is why many systems display some kind of sigmoid curve, or S curve.

Although the growth curve of CRAN packages shows signs of slowing down, it does not seem as if there is an inflexion point in the data. An inflexion point is where the curve transitions from being convex to being concave.

Thus it seems the grown of CRAN packages will appear to be exponential for quite some time in the future!

As usual, here is the R code I used.