by Joseph Rickert

My guess is that a good many statistics students first encounter the bivariate Normal distribution as one or two hastily covered pages in an introductory text book, and then don't think much about it again until someone asks them to generate two random variables with a given correlation structure. Fortunately for R users, a little searching on the internet will turn up several nice tutorials with R code explaining various aspects of the bivariate Normal. For this post, I have gathered together a few examples and tweaked the code a little to make comparisons easier.

Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.

To set up for the simulations this first block of code defines N, the number of random samples to simulate, the means of the random variables, and and the covariance matrix. It also provides a small function for drawing confidence ellipses on the simulated data.

library(mixtools) #for ellipse

N <- 200 # Number of random samples

set.seed(123)

# Target parameters for univariate normal distributions

rho <- -0.6

mu1 <- 1; s1 <- 2

mu2 <- 1; s2 <- 8

# Parameters for bivariate normal distribution

mu <- c(mu1,mu2) # Mean

sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),

2) # Covariance matrix

# Function to draw ellipse for bivariate normal data

ellipse_bvn <- function(bvn, alpha){

Xbar <- apply(bvn,2,mean)

S <- cov(bvn)

ellipse(Xbar, S, alpha = alpha, col="red")

}

The first method, the way to go if you just want to get on with it, is to use the mvrnorm() function from the MASS package.

library(MASS)

bvn1 <- mvrnorm(N, mu = mu, Sigma = sigma ) # from MASS package

colnames(bvn1) <- c("bvn1_X1","bvn1_X2")

It takes so little code to do the simulation it might be possible to tweet in a homework assignment.

A look at the source code for mvrnorm() shows that it uses eignevectors to generate the random samples. The documentation for the function states that this method was selected because it is stabler than the alternative of using a Cholesky decomposition which might be faster.

For the second method, let's go ahead and directly generate generate bivariate Normal random variates with the Cholesky decomposition. Remember that the Cholesky decomposition of sigma (a positive definite matrix) yields a matrix M such that M times its transpose gives sigma back again. Multiplying M by a matrix of standard random Normal variates and adding the desired mean gives a matrix of the desired random samples. A lecture from Colin Rundel covers some of the theory.

M <- t(chol(sigma))

# M %*% t(M)

Z <- matrix(rnorm(2*N),2,N) # 2 rows, N/2 columns

bvn2 <- t(M %*% Z) + matrix(rep(mu,N), byrow=TRUE,ncol=2)

colnames(bvn2) <- c("bvn2_X1","bvn2_X2")

For the third method we make use of a special property of the bivariate normal that is discussed in almost all of those elementary textbooks. If X_{1} and X_{2} are two jointly distributed random variables, then the conditional distribution of X_{2} given X_{1} is itself normal with: mean = m_{2} + r(s_{2}/s_{1})(X_{1} - m_{1}) and variance = (1 - r^{2})s^{2}X_{2}.

Hence, a sample from a bivariate Normal distribution can be simulated by first simulating a point from the marginal distribution of one of the random variables and then simulating from the second random variable conditioned on the first. A brief proof of the underlying theorem is available here.

rbvn<-function (n, m1, s1, m2, s2, rho)

{

X1 <- rnorm(n, mu1, s1)

X2 <- rnorm(n, mu2 + (s2/s1) * rho *

(X1 - mu1), sqrt((1 - rho^2)*s2^2))

cbind(X1, X2)

}

bvn3 <- rbvn(N,mu1,s1,mu2,s2,rho)

colnames(bvn3) <- c("bvn3_X1","bvn3_X2")

The fourth method, my favorite, comes from Professor Darren Wiliinson's Gibbs Sampler tutorial. This is a very nice idea; using the familiar bivariate Normal distribution to illustrate the basics of the Gibbs Sampling Algorithm. Note that this looks very much like the previous method, except that now we are alternately sampling from the full conditional distributions.

gibbs<-function (n, mu1, s1, mu2, s2, rho)

{

mat <- matrix(ncol = 2, nrow = n)

x <- 0

y <- 0

mat[1, ] <- c(x, y)

for (i in 2:n) {

x <- rnorm(1, mu1 +

(s1/s2) * rho * (y - mu2), sqrt((1 - rho^2)*s1^2))

y <- rnorm(1, mu2 +

(s2/s1) * rho * (x - mu1), sqrt((1 - rho^2)*s2^2))

mat[i, ] <- c(x, y)

}

mat

}

bvn4 <- gibbs(N,mu1,s1,mu2,s2,rho)

colnames(bvn4) <- c("bvn4_X1","bvn4_X2")

The fifth and final way uses the rmvnorm() function form the mvtnorm package with the singular value decomposition method selected. The functions in this package are overkill for what we are doing here, but mvtnorm is probably the package you would want to use if you are calculating probabilities from high dimensional multivariate distributions. It implements numerical methods for carefully calculating the high dimensional integrals involved that are based on some papers by Professor Alan Genz dating from the early '90s. These methods are briefly explained in the package vignette.

library (mvtnorm)

bvn5 <- mvtnorm::rmvnorm(N,mu,sigma, method="svd")

colnames(bvn5) <- c("bvn5_X1","bvn5_X2")

Note that I have used the :: operator here to make sure that R uses the rmvnorm() function from the mvtnorm package. There is also a rmvnorm() function in the mixtools package that I used to get the ellipse function. Loading the packages in the wrong order could lead to the rookie mistake of having the function you want inadvertently overwritten.

Next, we plot the results of drawing just 100 random samples for each method. This allows us to see how the algorithms spread data over the sample space as they are just getting started.

bvn <- list(bvn1,bvn2,bvn3,bvn4,bvn5)

par(mfrow=c(3,2))

plot(bvn1, xlab="X1",ylab="X2",main= "All Samples")

for(i in 2:5){

points(bvn[[i]],col=i)

}

for(i in 1:5){

item <- paste("bvn",i,sep="")

plot(bvn[[i]],xlab="X1",ylab="X2",main=item, col=i)

ellipse_bvn(bvn[[i]],.5)

ellipse_bvn(bvn[[i]],.05)

}

par(mfrow=c(1,1))

The first plot shows all 500 random samples color coded by the method with which they were generated. The remaining plots show the samples generated by each method. In each of these plots the ellipses mark the 0.5 and 0.95 probability regions, i.e. the area within the ellipses should contain 50% and 95% of the points respectively. Note that bvn4 which uses the Gibbs sampling algorithm looks like all of the rest. In most use cases for the Gibbs it takes the algorithm some time to converge to the target distribution. In our case, we start out with a pretty good guess.

Finally, a word about accuracy: nice coverage of the sample space is not sufficient to produce accurate results. A little experimentation will show that, for all of the methods outlined above, regularly achieving a sample covariance matrix that is close to the target, sigma, requires something on the order of 10,000 samples as is Illustrated below.

> sigma

[,1] [,2]

[1,] 4.0 -9.6

[2,] -9.6 64.0

for(i in 1:5){

print(round(cov(bvn[[i]]),1))

}

bvn1_X1 bvn1_X2

bvn1_X1 4.0 -9.5

bvn1_X2 -9.5 63.8

bvn2_X1 bvn2_X2

bvn2_X1 3.9 -9.5

bvn2_X2 -9.5 64.5

bvn3_X1 bvn3_X2

bvn3_X1 4.1 -9.8

bvn3_X2 -9.8 63.7

bvn4_X1 bvn4_X2

bvn4_X1 4.0 -9.7

bvn4_X2 -9.7 64.6

bvn5_X1 bvn5_X2

bvn5_X1 4.0 -9.6

bvn5_X2 -9.6 65.3

Many people coming to R for the first time find it disconcerting to realize that there are several ways to do some fundamental calculation in R. My take is that rather than being a point of frustration, having multiple options indicates that richness of the R language. A close look at the package documentation will often show that yet another method to do something is a response to some subtle need that was not previously addressed. Enjoy the diversity!

As anyone who has tried Pokémon Go recently is probably aware, Pokémon come in different types. A Pokémon's type affects where and when it appears, and the types of attacks it is vulnerable to. Some types, like Normal, Water and Grass are common; others, like Fairy and Dragon are rare. Many Pokémon have two or more types.

To get a sense of the distribution of Pokémon types, Joshua Kunst used R to download data from the Pokémon API and created a treemap of all the Pokémon types (and for those with more than 1 type, the secondary type). Johnathon's original used the 800+ Pokémon from the modern universe, but I used his R code to recreate the map for the 151 original Pokémon used in Pokémon Go.

Pokémon have many other attributes as well, including: weight, height, attack and defense ratings, hit points, and speed. It's hard to visualize so many variables in a 2-dimensional cart, so Joshua used a technique called t-Distributed Stochastic Neighbor Embedding (implemented in the tsne package for R) to cluster similar Pokémon in a two-dimensional chart, and used R's image-handling capbilities to include avatars for each of the Pokémon.

This chart, which includes modern Pokémon along with the 151 originals in Pokémon Go, is colored according to each Pokémon's primary type. As you can see, the TSNE algorithm is super effective at clustering Pokémon according to type.

For more details on Joshua's analysis, including interactive versions of these charts and the R code that created them, follow the link below.

Joshua Kunst: Pokémon: Visualize 'em all! (via Matthew Bashton)

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

The forecast package for R, created and maintained by Professor Rob Hyndman of Monash University, is one of the more useful R packages available available on CRAN. Statistical forecasting — the process of predicting the future value of a time series — is used in just about every realm of data analysis, whether it's trying to predict a future stock price or trying to anticipate changes in the weather. If you're looking to learn about forecasting, a great place to start is the online book Forecasting: Principles and Practice (by Hyndman and George Athanasopoulos) which walks you through the theory and practice, with many examples in R based on the forecast package. Topics covered include multiple regression, Time series decomposition, exponential smoothing, and ARIMA models.

The forecast package itself recently received a major update, to version 7. One major new capability is the ability to easily chart forecasts using the ggplot2 package with the new autoplot function. For example:

fc <- forecast(fdeaths) autoplot(fc)

You can also add forecasts to any ggplot using the new geom_forecasts geom provided by the forecast package:

autoplot(mdeaths) + geom_forecast(h=36, level=c(50,80,95))

There have been several updates to the forecasting functions as well. The function for fitting linear models to time series data, tslm, has been rewritten to be more compatible with the standard lm function. It's now possible to forecast means (as well as medians) when using Box-Cox transformations. And you can now apply neural networks to time series data by building a nonlinear autoregressive model with the new nnetar function.

Those are just some of the highlights of the updates to the forecast package in version 7. For complete details, follow the links to Rob Hyndman's blog, below.

Hyndsight: forecast v7 and ggplot2 graphics ; Forecast v7 (part 2) (via traims)

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In her series on principal components analysis for regression in R, Win-Vector LLC's Dr. Nina Zumel broke the demonstration down into the following pieces:

- Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of
*y*-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict. - And now Part 3: how to pick the number of components to retain for analysis.

In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3, she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:

- Pick 2, as that is all you can legibly graph.
- Pick enough to cover some fixed fraction of the variation (say 95%).
- (for variance scaled data only) Retain components with singular values at least 1.0.
- Look for a "knee in the curve" (the curve being the plot of the singular value magnitudes).
- Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the *y*-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:

However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:

In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general, we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "*y*" or outcome) and do *all* of the following:

- Fit model and significance prune incoming variables.
- Convert incoming variables into consistent response units by
*y*-aware scaling. - Fit model and significance prune resulting latent components.

The above will become much clearer and much more specific if you click here to read part 3.

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

Apache Spark, the open-source cluster computing framework, will soon see a major update with the upcoming release of Spark 2.0. This update promises to be faster than Spark 1.6, thanks to a run-time compiler that generates optimized bytecode. It also promises to be easier for developers to use, with streamlined APIs and a more complete SQL implementation. (Here's a tutorial on using SQL with Spark.) Spark 2.0 will also include a new "structured streaming" API, which will allow developers to write algorithm for streaming data without having to worry about the fact that streaming data is always incomplete; algorithms written for complete DataFrame objects will work for streams as well.

This update also includes some news for R users. First, the DataFrame object continues to be the primary interface for R (and Python users). Although the DataSets structure in Spark is more general, using the single-table DataFrame construct makes sense for R and Python which have analogous native (or near-native, in Python's case) data structures. In addition, Spark 2.0 is set to add a few new distributed statistical modeling algorithms: generalized linear models (in addition to the Normal least-squares and logistic regression models in Spark 1.6); Naive Bayes; survival (censored) regression; and K-means clustering. The addition of survival regression is particularly interesting. It's a type of model used in situations where the outcome isn't always completely known: for example, some (but not all) patients may have yet experienced remission in a cancer study. It's also used for reliability analysis and lifetime estimation in manufacturing, where some (but not all) parts may have failed by the end of the observation period. To my knowledge, this is the first distributed implementation of the survival regression algorithm.

For R users, these models can be applied to large data sets stored in Spark DataFrames, and then computed using the Spark distributed computing framework. Access to the algorithms is via the SparkR package, which hews closely to standard R interfaces for model training. You'll need to first create a SparkDataFrame object in R, as a reference to a Spark DataFrame. Then, to perform logistic regression (for example), you'll use R's standard glm function, using the SparkDataFrame object as the data argument. (This elegantly uses R's object-oriented dispatch architecture to call the Spark-specific GLM code.) The example below creates a Spark DataFrame, and then uses Spark to fit a logisic regression to it:

```
df <- createDataFrame(sqlContext, iris)
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
```

The object model contains most (but not all) of the output created by R's traditional glm algorithm, so most standard R functions that work with GLM objects work here as well.

For more on what's coming in Spark 2.0 check out the DataBricks blog post below, and the preview documentation for SparkR contains more info as well. Also, you might want to check out how you can use Spark on HDInsight with Microsoft R Server.

Databricks: Preview of Apache Spark 2.0 now on Databricks Community Edition

John Mount Ph. D.

Data Scientist at Win-Vector LLC

Win-Vector LLC's Dr. Nina Zumel has just started a two part series on Principal Components Regression that we think is well worth your time. You can read her article here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:

- It can find important latent structure and relations.
- It can reduce over fit.
- It can ease the curse of dimensionality.
- It is used in a ritualistic manner in many scientific disciplines. In some fields it is considered ignorant and uncouth to regress using original variables.

We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.

Wikipedia: PCA

And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is *already* supplied in a reliable analysis platform (such as R).

Dr. Zumel uses the expressive and graphical power of R to work through the *use* of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "*y*-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.