by Ramkumar Chandrasekeran, Software Engineer - Microsoft R Tiger Team

by Ramkumar Chandrasekeran, Software Engineer - Microsoft R Tiger Team

DeployR Enterprise is designed to deliver analytics solutions at scale to whomever needs it: inside or outside the enterprise. It also guarantees secure delivery of your analytics via DeployR web services. These secure web services integrate seamlessly with existing enterprise security solutions: Single Sign-On, LDAP, Active Directory, PAM, and Basic Authentication, can enforce access privileges already defined by your IT department for existing enterprise users and also have the capability to safely support anonymous users when needed.

**SSL/TLS**

In all production environments, it is strongly recommended that SSL/HTTPS be enabled so that client applications can make API calls that connect over HTTPS. Use the following guide to Enable TLS/SSL protocols in DeployR

**LDAP and LDAPS**

The standard protocol for reading data from and writing data to Active Directory (AD) domain controllers (DCs) is LDAP. AD LDAP traffic is unsecured by default, which makes it possible to use network-monitoring software to view the LDAP traffic between clients and DCs. You can make LDAP traffic confidential and secure by using Secure Sockets Layer (SSL) / Transport Layer Security (TLS) technology. This combination is referred to as LDAP over SSL — or LDAPS. To ensure that no one else can read the traffic, SSL/TLS establishes an encrypted tunnel between an LDAP client and a DC.

**Reasons for enabling LDAPS include:**

- Organizational security policies typically require that all client/server communication is encrypted.
- Applications use simple BIND to transport credentials and authenticate against a DC. As simple BIND exposes the users’ credentials in clear text, using SSL/TLS to encrypt the authentication session is strongly recommended.
- Use of proxy binding or password change over LDAP, which requires LDAPS. (Bind to an AD LDS Instance Through a Proxy Object)
- Applications that integrate with LDAP servers (such as Active Directory or Active Directory Domain Controllers) might require encrypted LDAP communications.

Posted by Joseph Rickert at 08:30 in advanced tips, Microsoft, R | Permalink | Comments (0)

Here's a little puzzle that might shed some light on some apparently confusing behaviour by missing values (NAs) in R:

What is NA^0 in R?

You can get the answer easily by typing at the R command line:

> NA^0

[1] 1

But the interesting question that arises is: why is it 1? Most people might expect that the answer would be NA, like most expressions that include NA. But here's the trick to understanding this outcome: think of NA not as a number, but as a **placeholder** for a number that exists, but whose value we don't know.

Now think of all of the numbers that could replace NA in the expression NA^0. Any positive number to the power zero is 1. Same goes for any negative number. Even zero to the power zero is defined by mathematicians to be 1 (for reasons I'm not going to go into here). So that means **whatever** number you substitute for NA in the expression NA^0, the answer will be 1. And so that's the answer R gives.

There are a few other instances where using the indeterminate NA in an expression can lead to a specific non-NA result. Consider this example:

> NA || TRUE

[1] TRUE

Here. the NA is holding the place of a logical value^{1}, so it could be representing only TRUE or FALSE. But whatever it represents, the answer will be the same:

> TRUE || TRUE

[1] TRUE

> FALSE || TRUE

[1] TRUE

By the same token, any(x) can return TRUE even if the logical vector includes NAs, as long as x includes at least one TRUE value. Similarly, NA && FALSE is always FALSE.

There are a few other examples as well (if you know some, share them in the comments). But always remember: if you're ever confused by the behaviour of NA in R, think about what values it might contain, and if changing them changes the outcome. That might explain what's going on. For more on how R handles NAs, see the R Language Definition.

^{1}**Footnote**: I'm deliberately ignoring the storage mode of NA, which can come in logical, integer, double and character flavours. In all the examples above, it gets coerced to the type of the other elements in the expression.

Posted by David Smith at 09:43 in advanced tips, R | Permalink | Comments (16)

by Joseph Rickert

Over the years I have seen several excellent tutorials at useR!conferences that were not only very satisfying "you had to be there" experiences but were also backed up with meticulously prepared materials of lasting value. This year, quite a few useR!20i6 tutorials measure up to this level of quality. My take on why things turned out this way is that GitHub, Markdown, and Jupyter notebooks have been universally adopted as workshop / tutorial creation tools, and that having the right tools encourages creativity and draws out one's best efforts.

Jenny Bryan's tutorial Using Git and GitHub with R, Rstudio, and R Markdown and the tutorial by Andrie de Vries and Micheleen Harris: Using R with Jupyter notebooks for reproducible research are two superb, Escheresque self-referencing examples of what I am talking about. Bryan's tutorial which uses GitHub and R Markdown to teach GitHub and R Markdown is an impressive introduction to these two essential resources. And, the tutorial by de Vries and Harris makes very effective use of GitHub and Jupyter Notebooks. Moreover, this tutorial sets the gold standard for how to set up a system for interactive user participation. Harris and de Vries staged their tutorial on Microsoft's Azure Data Science VM. The Linux version of this VM comes provisioned with JupyterHub, a set of processes that enables a multi-user Jupyter Notebook server. Once the VM is loaded with the training materials, its only a matter of giving students a username and password to grant them immediate access to the interactive workshop materials. Have a look at notebook 06 to see how to set all of this up.

After seeing this, and comparing it to other tutorials where instructors wasted the better part of an hour trying to get students up and running with local copies of their course materials I can't see why everyone wouldn't opt for a cloud solution to this problem. When word gets out, the Data Science VM is going to be the standard for delivering technical workshops.

Posted by Joseph Rickert at 08:31 in advanced tips, data science, events, Microsoft, R | Permalink | Comments (5)

Many newcomers to R got their start learning the language with Computerworld's Beginner's Guide to R, a 6-part introduction to the basics of the language. Now, budding R users who want to take their skills to the next level have a new guide to help them: Computerword's Advanced Beginner's Guide to R. Written by Sharon Machlis, author of the prior Beginner's guide and regular reporter of R news at Computerworld, this new 72-page guide dives into some trickier topics related to R: extracting data via API, data wrangling, and data visualization.

On the data wrangling front, the guide provides some recipes for handling messy data. You'll learn how to transform data and add the resulting data as a new column to a data frame. There's also an extended look at restructuring data: transforming "wide" data to "long" data, and vice versa.

For visualizing data, there's a basic intro to the ggplot2 package and its grammar of graphics. There's also an in-depth tutorial on creating choropleths: geographics maps with regions shaded by data values.

And as an in-dept example of importing data, you'll learn how to use the Google Analytics API to download and prepare data on traffic to your website.

Finally, don't miss the comprehensive index of R packages for data import, data wrangling, and visualization.

To download the 72-page PDF (free by providing your email address) visit Computerworld at the link below.

Computerworld Crash Course: Advanced Beginner's Guide to R

Posted by David Smith at 07:49 in advanced tips, R | Permalink | Comments (0)

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

Posted by Joseph Rickert at 08:30 in advanced tips, events, packages, R, statistics | Permalink | Comments (1)

by Lourdes O. Montenegro

*Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.*

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the *bigrquery* package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The *bigrquery* package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the *bigrquery* package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015 require(bigrquery) downquery_template <- "SELECT connection_spec.client_geolocation.country_code AS country, AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput, COUNT(DISTINCT test_id) AS tests, FROM plx.google:m_lab.ndt.all WHERE IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) AND project = 0 AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) AND connection_spec.data_direction = 1 AND web100_log_entry.snap.HCThruOctetsAcked >= 8192 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) >= 9000000 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) < 3600000000 AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) AND web100_log_entry.snap.CongSignals > 0 AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11)) AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6) GROUP BY country ORDER BY country ASC;" downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the *rworldmap* package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) *joinCountryData2Map* joins the query results with shapefiles of country boundaries; (2) *mapCountryData* plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the *rworldmap* package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the *classInt* package to calculate natural breaks in the range of download speed results and the *RColorBrewer* package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

require(rworldmap) require(classInt) require(RColorBrewer) downloadmap <- joinCountryData2Map(downresult , joinCode='ISO2' , nameJoinColumn='country' , verbose='TRUE') par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #getting class intervals using a 'jenks' classification in classInt package classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks") catMethod = classInt[["brks"]] #getting a colour scheme from the RColorBrewer package colourPalette <- RColorBrewer::brewer.pal(5,'RdPu') mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", mapTitle="Download Speed (mbps)", catMethod=catMethod, colourPalette=colourPalette, addLegend=FALSE) do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", legendIntervals="data", legendMar = 2))

Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1] Thanks to Stephen McInerney and Chris Ritzo for code advice.

Posted by Joseph Rickert at 08:30 in advanced tips, big data, data science, packages, R | Permalink | Comments (1)

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

Posted by Joseph Rickert at 08:30 in advanced tips, data science, packages, R, statistics | Permalink | Comments (3)

by Joseph Rickert

The model table on the caret package website lists more that 200 variations of predictive analytics models that are available withing the caret framework. All of these models may be prepared, tuned, fit and evaluated with a common set of caret functions. All on its own, the table is an impressive testament to the utility and scope of the R language as data science tool.

For the past year or so xgboost, the extreme gradient boosting algorithm, has been getting a lot of attention. The code below compares gbm with xgboost using the segmentationData set that comes with caret. The analysis presented here is far from the last word on comparing these models, but it does show how one might go about setting up a serious comparison using caret's functions to sweep through parameter space using parallel programming, and then used synchronized bootstrap samples to make a detailed comparison.

After reading in the data and dividing it into training and test data sets, caret's trainControl() and expand.grid() functions are used to set up to train the gbm model on all of the combinations of represented in the data frame built by expand.grid(). Then train() function does the actual training and fitting of the model. Notice that all of this happens in parallel. The Task Manager on my Windows 10 laptop shows all four cores maxed out at 100%.

After model fitting, predictions on the test data are computed and an ROC curve is drawn in the usual way. The AUC for gbm was computed to be 0.8731. Here is the ROC curve.

Next, a similar process for xgboost computes the AUC to be 0.8857, a fair improvement. The following plot shows how the ROC measure behaves with increasing tree depth for the two different values of the shrinkage parameter.

The final section of code shows how to caret can be used to compare the two models using the bootstrap samples that were created in the process of constructing the two models. The boxplots show xgboost has the edge although the gbm has a tighter distribution.

The next step, which I hope to take soon, is to rerun the analysis with more complete grids of tuning parameters. For a very accessible introduction to caret have a look at Max Kuhn's 2013 useR! tutorial.

#COMPARE XGBOOST with GBM ### Packages Required library(caret) library(corrplot) # plot correlations library(doParallel) # parallel processing library(dplyr) # Used by caret library(gbm) # GBM Models library(pROC) # plot the ROC curve library(xgboost) # Extreme Gradient Boosting ### Get the Data # Load the data and construct indices to divied it into training and test data sets. data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Pull out the dependent variable testX <- testData[,-1] sapply(trainX,summary) # Look at a summary of the training data ## GENERALIZED BOOSTED RGRESSION MODEL (BGM) # Set up training control ctrl <- trainControl(method = "repeatedcv", # 10fold cross validation number = 5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE, allowParallel = TRUE) # Use the expand.grid to specify the search space # Note that the default search grid selects multiple values of each tuning parameter grid <- expand.grid(interaction.depth=c(1,2), # Depth of variable interactions n.trees=c(10,20), # Num trees to fit shrinkage=c(0.01,0.1), # Try 2 values for learning rate n.minobsinnode = 20) # set.seed(1951) # set the seed # Set up to do parallel processing registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE) # Look at the tuning results # Note that ROC was the performance criterion used to select the optimal model. gbm.tune$bestTune plot(gbm.tune) # Plot the performance of the training models res <- gbm.tune$results res ### GBM Model Predictions and Performance # Make predictions using the test data set gbm.pred <- predict(gbm.tune,testX) #Look at the confusion matrix confusionMatrix(gbm.pred,testData$Class) #Draw the ROC curve gbm.probs <- predict(gbm.tune,testX,type="prob") head(gbm.probs) gbm.ROC <- roc(predictor=gbm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) gbm.ROC$auc #Area under the curve: 0.8731 plot(gbm.ROC,main="GBM ROC") # Plot the propability of poor segmentation histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") ##---------------------------------------------- ## XGBOOST # Some stackexchange guidance for xgboost # http://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees # Set up for parallel procerssing set.seed(1951) registerDoParallel(4,cores=4) getDoParWorkers() # Train xgboost xgb.grid <- expand.grid(nrounds = 500, #the maximum number of iterations eta = c(0.01,0.1), # shrinkage max_depth = c(2,6,10)) xgb.tune <-train(x=trainX,y=trainData$Class, method="xgbTree", metric="ROC", trControl=ctrl, tuneGrid=xgb.grid) xgb.tune$bestTune plot(xgb.tune) # Plot the performance of the training models res <- xgb.tune$results res ### xgboostModel Predictions and Performance # Make predictions using the test data set xgb.pred <- predict(xgb.tune,testX) #Look at the confusion matrix confusionMatrix(xgb.pred,testData$Class) #Draw the ROC curve xgb.probs <- predict(xgb.tune,testX,type="prob") #head(xgb.probs) xgb.ROC <- roc(predictor=xgb.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) xgb.ROC$auc # Area under the curve: 0.8857 plot(xgb.ROC,main="xgboost ROC") # Plot the propability of poor segmentation histogram(~xgb.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") # Comparing Multiple Models # Having set the same seed before running gbm.tune and xgb.tune # we have generated paired samples and are in a position to compare models # using a resampling technique. # (See Hothorn at al, "The design and analysis of benchmark experiments # -Journal of Computational and Graphical Statistics (2005) vol 14 (3) # pp 675-699) rValues <- resamples(list(xgb=xgb.tune,gbm=gbm.tune)) rValues$values summary(rValues) bwplot(rValues,metric="ROC",main="GBM vs xgboost") # boxplot dotplot(rValues,metric="ROC",main="GBM vs xgboost") # dotplot #splom(rValues,metric="ROC")

Posted by Joseph Rickert at 08:30 in advanced tips, applications, data science, packages, R | Permalink | Comments (1)

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In her series on principal components analysis for regression in R, Win-Vector LLC's Dr. Nina Zumel broke the demonstration down into the following pieces:

- Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of
*y*-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict. - And now Part 3: how to pick the number of components to retain for analysis.

In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3, she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:

- Pick 2, as that is all you can legibly graph.
- Pick enough to cover some fixed fraction of the variation (say 95%).
- (for variance scaled data only) Retain components with singular values at least 1.0.
- Look for a "knee in the curve" (the curve being the plot of the singular value magnitudes).
- Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the *y*-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:

However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:

In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general, we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "*y*" or outcome) and do *all* of the following:

- Fit model and significance prune incoming variables.
- Convert incoming variables into consistent response units by
*y*-aware scaling. - Fit model and significance prune resulting latent components.

The above will become much clearer and much more specific if you click here to read part 3.

Posted by Joseph Rickert at 08:30 in advanced tips, applications, R, statistics | Permalink | Comments (1)

by Joseph Rickert

R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a conference to keep its sparkle and pizzazz. But, the conference organizers and the UIC have managed to create a vibe that keeps people coming back. The fact that invited keynote speakers (e.g. Bernhard Pfaff 2012, Sanjiv Das 2013, and Robert McDonald 2014) regularly submit papers in subsequent years is a testimony to the quality and networking importance of the event. My guess is that the single track format, quality presentations, intense compact schedule and pleasant venues comprise a winning formula.

Since I have recently written about the content of this year's conference in preparation for the event, and since most of the presentations are already online for you to examine directly I'll just present a few personal highlights here.

My favorite single visual from the conference is Bryan Lewis' depiction of corporate "Big Data" architectures as a manifestation of the impulse for completeness, control and dominance that once drove Soviet style central planning. (If you don't read Russian, run google translate on the text in the first panel.)

In his presentation, R in practice, at scale, Bryan presents a lightweight, R-centric architecture built around Redis that is adequate fro many "big data" tasks.

Matt Dziubinski's talk on Getting the Most out of Rcpp, High-Performance C++ in Practice, is probably not a talk I would have elected to attend in a multi-track conference, and I would have missed seeing a virtuoso performance. Matt got through over 120 of his 153 prepared slides in a single, lucid stream of clear (but loud) explanations in only 20 minutes. Never stopping to pause, he gave a mini-course in computer science performance evaluation (both hardware and software aspects) that addressed the Why, What and How of it all.

Ryan Hafen's presentation, Interactively Exploring Financial Trades in R, showed how to use a tool chain built around Tessera and the NxCore R package to perform exploratory data analysis on a large NxCore data set containing approximately 1.25 billion records of 47 variables without leaving the R environment. The following slide provides an example of the kinds of insights that are possible.

In his presentation, Quantitative Analysis of Dual Moving Average Indicators in Automated Trading, Douglas Service showed how to use stochastic differential equations and the Itô calculus to derive a closed form solution for expected Log returns under the Luxor trading strategy and a baseline set of simplifying assumptions. If you like seeing the Math you will be pleased to see that Doug provides all of the details.

Michael Kane (glmnetlib: A Low-Level Library for Regularized Regression) discussed the motivations for continuing to improve linear models and showed the progress he is making on re-implementing glmnet which, although very efficient, does not support arbitrary link family combinations or out of memory calculations and is written in the obscure Mortran flavor of Fortran. Kane's goal with his new package (renamed pirls: Penalized, Iteratively Reweighted Least Squares Regression) is to rectify these deficiencies while producing something fast enough to use.

In his presentation Community Finance Teaching Resources with R/Shiny, Matt Brigida showed off some opensource resources for teaching quantitative finance that are based on the new paradigm of GitHub as the place for tech savvy people to hang out and Shiny as the teaching / presentation tool for making calculations come alive. Check out some of Matt's 5 minute mini lessons. Here is an example from his What is Risk module:

There is much more than I have presented here on the R / Finance conference site. If you are interested in deep Finance and not just the tools I have highlighted, be sure to check out the presentations by Sanjiv Das, Bernhard Pfaff, Matthew Dixon and others. There is plenty of useful R code to be mined in these presentations too.

I would be remiss without mentioning Patrick Burns' keynote presentation which was highly entertaining, novel and thought provoking on many levels: everything a keynote should be. Pat launched his talk by referring to the Sapir-Worf hypothesis which posits that language controls how we think and assigned a similar role to model building. He went on to describe his Agent inspired R simulation model and showed how he calibrated this model to provide a useful tool for investigating ideas such as risk parity, variance targeting and strategies for taxing market pollution. The code for Pat's model is available here, but since his slides are not up on the conference site, and I was apparently too mesmerized to take useful notes, we will have to wait for Pat to post more on his website. (Pat's slides should be available soon.)

Finally, I would like to note that Doug Service and Sanjiv Das won the best paper prizes. This is the second year in a row for Sanjiv to win an R / Finance award. Congratulations to both Doug and Sanjiv!

Posted by Joseph Rickert at 08:30 in advanced tips, events, finance, packages, R | Permalink | Comments (0)

Got comments or suggestions for the blog editor?

Email David Smith.

Email David Smith.

- academia
- advanced tips
- announcements
- applications
- beginner tips
- big data
- courses
- current events
- data science
- developer tips
- events
- finance
- government
- graphics
- high-performance computing
- life sciences
- Microsoft
- open source
- other industry
- packages
- popularity
- predictive analytics
- profiles
- R
- R is Hot
- random
- reviews
- Revolution
- Rmedia
- roundups
- sports
- statistics
- user groups

- Find R packages

CRAN package directory at MRAN - Download Microsoft R Open

Free, high-performance R - R Project site

Information about the R project

- FlowingData

Modern data visualization - One R Tip A Day

Code examples for graphics and analysis - Probability and statistics blog

Monte Carlo simulations in R - R Bloggers

Daily news and tutorials about R, contributed by R bloggers worldwide. - R Project group on analyticbridge.com

Community and discussion forum - Statistical Modeling, Causal Inference, and Social Science

Andrew Gelman's statistics blog