by Joseph Rickert

New R packages keep rolling into CRAN at a prodigious rate: 184 in May, 195 in June and July looks like it will continue the trend. I spent some time sorting through them and have picked out a few that that are interesting from a data science point of view.

ANLP provides functions for building text prediction models. It contains functions for cleaning text data, building N-grams and more. The vignette works through an example.

fakeR generates fake data based on a given data set. Factors are sampled from contingency tables and numerical data is sampled from a multivariate normal distribution. The method works reasonably well for small data sets producing fake data with the same correlation structure as the original data sets. The following example, uses the simulate_dataset() function on a subset of the mtcars data set. The correlation plots look pretty much identical.

library(fakeR)

library(corrplot)

data(mtcars)

df1 <- mtcars[,c(1:4,6)]

set.seed(77)

df2 <- as.data.frame(simulate_dataset(df1))

par(mfrow=c(1,2))

corrplot(cor(df1),method="ellipse")

corrplot(cor(df2),method="ellipse")

Note however, that the simulated data might need some processing before being used in a model. The last row in the df2 simulated data set contains a negative value for displacement. fakeR contains functions for both time dependent and time independent variables. There is a vignette.

heatmaply produces interactive heatmaps. More than just being attractive graphics these increase the already considerable value of the heatmap as an exploratory analysis tool. The following code from the package vignette shows the correlation structure of variables in the mtcars data set.

library(heatmaply)

heatmaply(cor(mtcars),

k_col = 2, k_row = 2,

limits = c(-1,1)) %>%

layout(margin = list(l = 40, b = 40))

mscstexta4r provides an R Client for the Microsoft Cognitive Services Text Analytics REST API, suite of text analytics web services built with Azure Machine Learning that can be used to analyze unstructured text. The vignette explains the explains the text processing capabilities as well as how to get going with it.

mvtboost extends the GBM model to fit boosted decision trees to multivariate continuous variables. The algorithm jointly fits models with multiple outcomes over common set of predictors. According to the authors this makes it possible to: (1) chose the number of trees and shrinkage to jointly minimize prediction error in a test set over all outcomes, (2) compare tree models across outcomes, and (3) estimate the "covariance explained" by predictors in pairs of outcomes. For this last point, the authors explain:

The covariance explained matrix can then be organized in a p×q(q+1)/2p×q(q+1)/2 table where qq is the number of outcomes, and pp is the number of predictors. Each element is the covariance explained by a predictor for any pair of outcomes. When the outcomes are standardized to unit variance, each element can be interpreted as the correlation explained in any pair of outcomes by a predictor. Like the R2R2 of the linear model, this decomposition is unambiguous only if the predictors are independent. Below we show the original covariance explained matrix.

The theory is explained in the paper "Finding Structure in Data". There are two vignettes: the mpg example and the Well-being example.

In a classification problem, label noise refers to the incorrect labeling of the training instances. The NoiseFilterR package presents an extensive collection label noise filtering algorithms along with impressive documentation and references for each algorithm. For example, the documentation for the Generalized Edition (GE), a similarity-based filter is organized like this:

The package and its vignette comprise a valuable resource for machine learning.

preprosim describes itself as a "lightweight data quality simulation for classification". It contains functions to add noise, missing values, outliers, irrelevant features and other transformations that are useful in evaluating the classification accuracy. The vignette provides a small example how missing values and noise can affect accuracy.

polmineR provides text mining tools for the analysis of large corpora using the IMS Open Corpus Workbench (CWB) as the backend. There is a vignette to get you started.

RFinfer provides functions that use the infinitesimal jackknife to generate predictions and prediction variances from random forest models. There are Introduction and Jackknife vignettes.

Finally, there are two packages I recommend installing just for the sake of sanity.

gaussfacts provides random "facts" about Carl Friedrich Gauss. One or two of these and you should feel fine again.

rmsfact invokes random quotes from Richard M. Stallman. To be used when gaussfacts doesn't quite do it for you.

by Konstantin Golyaev, Data Scientist at Microsoft

I recently participated in an internal one-day Microsoft R Server (MRS) hackathon. For an experienced base R user but a complete MRS novice, this turned out to be an interesting challenge. R has fantastic and unparalleled set of tools for exploratory data analysis, as long as your data set is small enough to fit comfortably in memory. In particular, the dplyr package offers a concise and expressive data manipulation semantics, and is wildly popular among R users.

Unlike base R, MRS is able to handle large datasets that do not fit in memory. This is achieved by splitting the data into chunks that are small enough to fit in RAM, applying operations sequentially to every chunk, and storing the results, as well as data, on disk using the binary xdf (eXtended Data Frame) format. The downside to the flexibility is the need to use MRS-specific function from the RevoScaleR package, something that would have definitely slowed me down. Thankfully, the dplyrXdf package bridges the syntax gap between base R and MRS, and enables out-of-memory operations on large datasets with the same dplyr syntax. In what follows, I will describe the problem I chose to address and share the code used to obtain the solution.

Our data science team at Microsoft is split across locations in Redmond, the Bay Area, and Boston. For this reason, I decided to investigate whether planes departing from the corresponding major airports tend to arrive on time. To do this, I used the On-Time Performance Dataset from the Research and Innovative Technology Administration of the Bureau of Transportation Statistics. This dataset spans 26 years, 1988 through 2012, and is fairly large: over 148 million records, or 14 GB of raw information. The combined dataset can be obtained from the Revolution Analytics website. While it is definitely possible to work with such a dataset on a single machine using base R, unless one has access to a really powerful machine (e.g. Azure VM Standard_DS14), it would have taken a while, and we only had several hours for the hackathon.

Instead, I used 2012 data for prototyping the code, and then leveraged the out-of-memory capabilities of MRS to execute the same code on the entire dataset. I tend to be a heavy dplyr user, which makes prototyping with small data a breeze. Unfortunately, dplyr did not work with MRS xdf datasets back then (and it still does not today), which would mean that I would have had to rewrite most of my code in a way that MRS would understand. Thankfully, the dplyrXdf package made it entirely unnecessary; I relied on it as a translation mechanism that relayed my dplyr-style logic to MRS.

The data analysis itself was quite straightforward, given the time constraints of the hackathon. I provisioned a Windows Data Science Virtual Machine on Azure, which comes with a trial version of MRS. I focused on the four major airports that our team uses most frequently: Seattle (SEA), Boston (BOS), San Francisco (SFO), and San Jose (SJC). I grouped the last two into a single entity called SVC (for ‘Silicon Valley Campus’). I eliminated a number of records with missing data, as well as the few records where arrival time was not within the two-hour window of scheduled time. I grouped the flights by departure day of week, as well as departure time, which I bucketed into four six-hour long groups: from midnight to 6:00 AM, from 6:01 AM to noon, and so on. The code for the analysis can be obtained from my GitHub repository.

Once the data was properly partitioned, I computed frequencies of untimely arrivals within each cell, and plotted the results on a grid. In total, I obtained 84 plots – 3 locations, 7 days a week, 4 departure time groups. The vertical red line on each plot indicates the time when the flights were supposed to arrive. This means any probability mass to the right of the red line is bad news, while mass to the left of the red line is good news.

The first thing to notice from the SEA and SVC plots, is that, unless you travel on a plane that departs between midnight and 6:00 AM, you are not very likely to arrive late.

**Frequency of late arrivals for Seattle (SEA)**

It appears that “red-eye” flights from the West Coast are more likely to arrive later than scheduled, and being late by as much as 90 minutes is not out of the question for such flights.

**Frequency of late arrivals for San Francisco (SFO) and San Jose (SJC)**

For flights originating from BOS, however, the story is rather different. Namely, the flights leaving after 6:00 PM are also reasonably likely to be late, in addition to flights leaving between midnight and 6:00 AM. In fact, the “red-eyes” from BOS have a weird bi-modal distribution of relative arrival time: they are either 30 minutes too early or 90 minutes late. This is something to keep in mind if you travel from BOS and have a connecting flight on the way to your final destination.

**Frequency of late arrivals for Boston (BOS)**

In conclusion, a combination of Windows Data Science VM with MRS together with dplyrXdf turned out to be a powerhouse, for three reasons. First, it was trivial to get all the tools up and running, by provisioning the VM and installing dplyrXdf from GitHub. Second, the MRS component allowed me to scale operations to a pretty large dataset without having to worry about the implementation details. Third, the dplyrXdf component made me very productive because it eliminated the need to learn MRS-specific commands and let me use my dplyr skills.

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

by Lourdes O. Montenegro

*Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.*

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the *bigrquery* package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The *bigrquery* package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the *bigrquery* package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015 require(bigrquery) downquery_template <- "SELECT connection_spec.client_geolocation.country_code AS country, AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput, COUNT(DISTINCT test_id) AS tests, FROM plx.google:m_lab.ndt.all WHERE IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) AND project = 0 AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) AND connection_spec.data_direction = 1 AND web100_log_entry.snap.HCThruOctetsAcked >= 8192 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) >= 9000000 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) < 3600000000 AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) AND web100_log_entry.snap.CongSignals > 0 AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11)) AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6) GROUP BY country ORDER BY country ASC;" downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the *rworldmap* package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) *joinCountryData2Map* joins the query results with shapefiles of country boundaries; (2) *mapCountryData* plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the *rworldmap* package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the *classInt* package to calculate natural breaks in the range of download speed results and the *RColorBrewer* package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

require(rworldmap) require(classInt) require(RColorBrewer) downloadmap <- joinCountryData2Map(downresult , joinCode='ISO2' , nameJoinColumn='country' , verbose='TRUE') par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #getting class intervals using a 'jenks' classification in classInt package classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks") catMethod = classInt[["brks"]] #getting a colour scheme from the RColorBrewer package colourPalette <- RColorBrewer::brewer.pal(5,'RdPu') mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", mapTitle="Download Speed (mbps)", catMethod=catMethod, colourPalette=colourPalette, addLegend=FALSE) do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", legendIntervals="data", legendMar = 2))

Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1] Thanks to Stephen McInerney and Chris Ritzo for code advice.

The forecast package for R, created and maintained by Professor Rob Hyndman of Monash University, is one of the more useful R packages available available on CRAN. Statistical forecasting — the process of predicting the future value of a time series — is used in just about every realm of data analysis, whether it's trying to predict a future stock price or trying to anticipate changes in the weather. If you're looking to learn about forecasting, a great place to start is the online book Forecasting: Principles and Practice (by Hyndman and George Athanasopoulos) which walks you through the theory and practice, with many examples in R based on the forecast package. Topics covered include multiple regression, Time series decomposition, exponential smoothing, and ARIMA models.

The forecast package itself recently received a major update, to version 7. One major new capability is the ability to easily chart forecasts using the ggplot2 package with the new autoplot function. For example:

fc <- forecast(fdeaths) autoplot(fc)

You can also add forecasts to any ggplot using the new geom_forecasts geom provided by the forecast package:

autoplot(mdeaths) + geom_forecast(h=36, level=c(50,80,95))

There have been several updates to the forecasting functions as well. The function for fitting linear models to time series data, tslm, has been rewritten to be more compatible with the standard lm function. It's now possible to forecast means (as well as medians) when using Box-Cox transformations. And you can now apply neural networks to time series data by building a nonlinear autoregressive model with the new nnetar function.

Those are just some of the highlights of the updates to the forecast package in version 7. For complete details, follow the links to Rob Hyndman's blog, below.

Hyndsight: forecast v7 and ggplot2 graphics ; Forecast v7 (part 2) (via traims)

by Joseph Rickert

It is always a delight to discover a new and useful R package, and it is especially nice when the discovery comes with at context and testimonial to its effectiveness. It is also satisfying to be able to check in once in awhile and get an idea of what people think is hot, or current or trending in the R world. The schedule for the upcoming useR! conference at Stanford is a touchstone for both of these purposes. It lists 128 contributed talks over 3 days. Vetted for both content and quality by the program committee, these talks represent a snapshot of topics and research areas that are of current interest to R developers and researchers and also catalog the packages and tools that support the research. As best as I can determine, the abstracts for the talks explicitly call out 154 unique packages.

Some of the talks are all about introducing new R packages that have been very recently released or are still under development on Github. Others, mention the packages that represent the current suite of tools for a particular research area. Obviously, the abstracts do not list all of the packages employed by the researchers in their work, but presumably they do mention the packages that the speakers think are most important or in describing their work and attracting an audience.

The following table maps packages to talks. There are multiple lines for packages when they map to more than one talk and vice versa. For me, browsing a list like this is akin to perusing a colleague's book shelves; noting old favorites and discovering the occasional new gem.

I am reticent to try any conclusions about the importance of the packages to the talks from the abstracts alone before the talks are given. However, it is interesting to note that all of the packages mentioned more than twice are from the RStudio suite. Moreover, shiny and rmarkdown are apparently still new and shiny. Mentioning them in an abstract still conveys some cachet. I suspect that ggplot2 will figure in more analyses than both of these packages put together, but ggplot2 has become so much a part of that fabric of R that there is no premium in promoting it.

If you are looking forward to useR! 2016, whether it attend in person or to watch the videos afterwards, gaining some familiarity with the R packages highlighted in the contributed talks is sure to enhance your experience.

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

by Joseph Rickert

The model table on the caret package website lists more that 200 variations of predictive analytics models that are available withing the caret framework. All of these models may be prepared, tuned, fit and evaluated with a common set of caret functions. All on its own, the table is an impressive testament to the utility and scope of the R language as data science tool.

For the past year or so xgboost, the extreme gradient boosting algorithm, has been getting a lot of attention. The code below compares gbm with xgboost using the segmentationData set that comes with caret. The analysis presented here is far from the last word on comparing these models, but it does show how one might go about setting up a serious comparison using caret's functions to sweep through parameter space using parallel programming, and then used synchronized bootstrap samples to make a detailed comparison.

After reading in the data and dividing it into training and test data sets, caret's trainControl() and expand.grid() functions are used to set up to train the gbm model on all of the combinations of represented in the data frame built by expand.grid(). Then train() function does the actual training and fitting of the model. Notice that all of this happens in parallel. The Task Manager on my Windows 10 laptop shows all four cores maxed out at 100%.

After model fitting, predictions on the test data are computed and an ROC curve is drawn in the usual way. The AUC for gbm was computed to be 0.8731. Here is the ROC curve.

Next, a similar process for xgboost computes the AUC to be 0.8857, a fair improvement. The following plot shows how the ROC measure behaves with increasing tree depth for the two different values of the shrinkage parameter.

The final section of code shows how to caret can be used to compare the two models using the bootstrap samples that were created in the process of constructing the two models. The boxplots show xgboost has the edge although the gbm has a tighter distribution.

The next step, which I hope to take soon, is to rerun the analysis with more complete grids of tuning parameters. For a very accessible introduction to caret have a look at Max Kuhn's 2013 useR! tutorial.

#COMPARE XGBOOST with GBM ### Packages Required library(caret) library(corrplot) # plot correlations library(doParallel) # parallel processing library(dplyr) # Used by caret library(gbm) # GBM Models library(pROC) # plot the ROC curve library(xgboost) # Extreme Gradient Boosting ### Get the Data # Load the data and construct indices to divied it into training and test data sets. data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Pull out the dependent variable testX <- testData[,-1] sapply(trainX,summary) # Look at a summary of the training data ## GENERALIZED BOOSTED RGRESSION MODEL (BGM) # Set up training control ctrl <- trainControl(method = "repeatedcv", # 10fold cross validation number = 5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE, allowParallel = TRUE) # Use the expand.grid to specify the search space # Note that the default search grid selects multiple values of each tuning parameter grid <- expand.grid(interaction.depth=c(1,2), # Depth of variable interactions n.trees=c(10,20), # Num trees to fit shrinkage=c(0.01,0.1), # Try 2 values for learning rate n.minobsinnode = 20) # set.seed(1951) # set the seed # Set up to do parallel processing registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE) # Look at the tuning results # Note that ROC was the performance criterion used to select the optimal model. gbm.tune$bestTune plot(gbm.tune) # Plot the performance of the training models res <- gbm.tune$results res ### GBM Model Predictions and Performance # Make predictions using the test data set gbm.pred <- predict(gbm.tune,testX) #Look at the confusion matrix confusionMatrix(gbm.pred,testData$Class) #Draw the ROC curve gbm.probs <- predict(gbm.tune,testX,type="prob") head(gbm.probs) gbm.ROC <- roc(predictor=gbm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) gbm.ROC$auc #Area under the curve: 0.8731 plot(gbm.ROC,main="GBM ROC") # Plot the propability of poor segmentation histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") ##---------------------------------------------- ## XGBOOST # Some stackexchange guidance for xgboost # http://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees # Set up for parallel procerssing set.seed(1951) registerDoParallel(4,cores=4) getDoParWorkers() # Train xgboost xgb.grid <- expand.grid(nrounds = 500, #the maximum number of iterations eta = c(0.01,0.1), # shrinkage max_depth = c(2,6,10)) xgb.tune <-train(x=trainX,y=trainData$Class, method="xgbTree", metric="ROC", trControl=ctrl, tuneGrid=xgb.grid) xgb.tune$bestTune plot(xgb.tune) # Plot the performance of the training models res <- xgb.tune$results res ### xgboostModel Predictions and Performance # Make predictions using the test data set xgb.pred <- predict(xgb.tune,testX) #Look at the confusion matrix confusionMatrix(xgb.pred,testData$Class) #Draw the ROC curve xgb.probs <- predict(xgb.tune,testX,type="prob") #head(xgb.probs) xgb.ROC <- roc(predictor=xgb.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) xgb.ROC$auc # Area under the curve: 0.8857 plot(xgb.ROC,main="xgboost ROC") # Plot the propability of poor segmentation histogram(~xgb.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") # Comparing Multiple Models # Having set the same seed before running gbm.tune and xgb.tune # we have generated paired samples and are in a position to compare models # using a resampling technique. # (See Hothorn at al, "The design and analysis of benchmark experiments # -Journal of Computational and Graphical Statistics (2005) vol 14 (3) # pp 675-699) rValues <- resamples(list(xgb=xgb.tune,gbm=gbm.tune)) rValues$values summary(rValues) bwplot(rValues,metric="ROC",main="GBM vs xgboost") # boxplot dotplot(rValues,metric="ROC",main="GBM vs xgboost") # dotplot #splom(rValues,metric="ROC")

Microsoft Cognitive Services, the machine-intelligence service formerly known as Project Oxford, provides a suite of cloud-based APIs useful for any developer that wants to include vision, face-recognition, speech recognition, or any of a raft of "AI"-like capabilities into software. With just an API key (you can sign up for a free key to try out any of the services in a rate-limited mode), you can call any of the services via a simple HTTP POST call, passing in parameters (and receiving results) as JSON.

While this is easy enough to do using R using the POST function in the httr package, Software developer Phil Ferriere has wrapped several of the Cognitive Services APIs related to text analysis into simple R function in his mscsweblm4r package. The functions provided include:

- weblmBreakIntoWords: Breaks a string of concatenated words into individual words
- weblmGenerateNextWords: Predict the words most likely to follow a sequence of words
- weblmCalculateJointProbability: Estimate the probability a sequence of words will appear together
- weblmCalculateConditionalProbability: Estimate the probability that a sequence of words appear together, following a specified sequence of words

Each function can generate words or probabilities based on a "model" of the type of text you are working with: web page titles, webpage text, search query text, or webpage anchor text (all based on data analyzed from the Bing services).

You can try out these services without even needing to call the functions yourself, thanks to a Shiny app created and published by Phil. His MS Cognitive Services Shiny Test App will let you pass words into any of the functions and see the results. Click on "Web Language Model API" in the left column to see the functions, and try them out:

If you find the app doesn't give you any results, it's because too many people are using it at once and hitting the rate-limiters of the free service, but you can always use Phil's source code for the app with your own key to run it (or just wait a little while).

Fore more on the mscsweblm4r package, check out Phil's blog post linked below, or check out the package vignette.

Phil Ferriere's OSS Work: Using the Microsoft Cognitive Services Web Language Model REST API from R

by Joseph Rickert

R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a conference to keep its sparkle and pizzazz. But, the conference organizers and the UIC have managed to create a vibe that keeps people coming back. The fact that invited keynote speakers (e.g. Bernhard Pfaff 2012, Sanjiv Das 2013, and Robert McDonald 2014) regularly submit papers in subsequent years is a testimony to the quality and networking importance of the event. My guess is that the single track format, quality presentations, intense compact schedule and pleasant venues comprise a winning formula.

Since I have recently written about the content of this year's conference in preparation for the event, and since most of the presentations are already online for you to examine directly I'll just present a few personal highlights here.

My favorite single visual from the conference is Bryan Lewis' depiction of corporate "Big Data" architectures as a manifestation of the impulse for completeness, control and dominance that once drove Soviet style central planning. (If you don't read Russian, run google translate on the text in the first panel.)

In his presentation, R in practice, at scale, Bryan presents a lightweight, R-centric architecture built around Redis that is adequate fro many "big data" tasks.

Matt Dziubinski's talk on Getting the Most out of Rcpp, High-Performance C++ in Practice, is probably not a talk I would have elected to attend in a multi-track conference, and I would have missed seeing a virtuoso performance. Matt got through over 120 of his 153 prepared slides in a single, lucid stream of clear (but loud) explanations in only 20 minutes. Never stopping to pause, he gave a mini-course in computer science performance evaluation (both hardware and software aspects) that addressed the Why, What and How of it all.

Ryan Hafen's presentation, Interactively Exploring Financial Trades in R, showed how to use a tool chain built around Tessera and the NxCore R package to perform exploratory data analysis on a large NxCore data set containing approximately 1.25 billion records of 47 variables without leaving the R environment. The following slide provides an example of the kinds of insights that are possible.

In his presentation, Quantitative Analysis of Dual Moving Average Indicators in Automated Trading, Douglas Service showed how to use stochastic differential equations and the Itô calculus to derive a closed form solution for expected Log returns under the Luxor trading strategy and a baseline set of simplifying assumptions. If you like seeing the Math you will be pleased to see that Doug provides all of the details.

Michael Kane (glmnetlib: A Low-Level Library for Regularized Regression) discussed the motivations for continuing to improve linear models and showed the progress he is making on re-implementing glmnet which, although very efficient, does not support arbitrary link family combinations or out of memory calculations and is written in the obscure Mortran flavor of Fortran. Kane's goal with his new package (renamed pirls: Penalized, Iteratively Reweighted Least Squares Regression) is to rectify these deficiencies while producing something fast enough to use.

In his presentation Community Finance Teaching Resources with R/Shiny, Matt Brigida showed off some opensource resources for teaching quantitative finance that are based on the new paradigm of GitHub as the place for tech savvy people to hang out and Shiny as the teaching / presentation tool for making calculations come alive. Check out some of Matt's 5 minute mini lessons. Here is an example from his What is Risk module:

There is much more than I have presented here on the R / Finance conference site. If you are interested in deep Finance and not just the tools I have highlighted, be sure to check out the presentations by Sanjiv Das, Bernhard Pfaff, Matthew Dixon and others. There is plenty of useful R code to be mined in these presentations too.

I would be remiss without mentioning Patrick Burns' keynote presentation which was highly entertaining, novel and thought provoking on many levels: everything a keynote should be. Pat launched his talk by referring to the Sapir-Worf hypothesis which posits that language controls how we think and assigned a similar role to model building. He went on to describe his Agent inspired R simulation model and showed how he calibrated this model to provide a useful tool for investigating ideas such as risk parity, variance targeting and strategies for taxing market pollution. The code for Pat's model is available here, but since his slides are not up on the conference site, and I was apparently too mesmerized to take useful notes, we will have to wait for Pat to post more on his website. (Pat's slides should be available soon.)

Finally, I would like to note that Doug Service and Sanjiv Das won the best paper prizes. This is the second year in a row for Sanjiv to win an R / Finance award. Congratulations to both Doug and Sanjiv!