*by Bob Horton, Microsoft Senior Data Scientist*

Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier. In an earlier post, I described a simple “turtle’s eye view” of these plots: a classifier is used to sort cases in order from most to least likely to be positive, and a Logo-like turtle marches along this string of cases. The turtle considers all the cases it has passed as having tested positive. Depending on their actual class they are either false positives (FP) or true positives (TP); this is equivalent to adjusting a score threshold. When the turtle passes a TP it takes a step upward on the y-axis, and when it passes a FP it takes a step rightward on the x-axis. The step sizes are inversely proportional to the number of actual positives (in the y-direction) or negatives (in the x-direction), so the path always ends at coordinates (1, 1). The result is a plot of true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity), which is all an ROC curve is.

Computing the area under the curve is one way to summarize it in a single value; this metric is so common that if data scientists say “area under the curve” or “AUC”, you can generally assume they mean an ROC curve unless otherwise specified.

Probably the most straightforward and intuitive metric for classifier performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people a completely bogus screening test that always reports “negative” will be 99.9999% accurate. Unlike accuracy, ROC curves are insensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.

In this post I’ll work through the geometry exercise of computing the area, and develop a concise vectorized function that uses this approach. Then we’ll look at another way of viewing AUC which leads to a probabilistic interpretation.

Let’s start with a simple artificial data set:

```
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
```

Here the vector `prediction`

holds ersatz scores; these normally would be assigned by a classifier, but here we’ve just assigned numbers so that the decreasing order of the scores matches the given order of the category labels. Scores 9 and 10, one representing a positive case and the other a negative case, are replaced by their average so that the data will contain ties without otherwise disturbing the order.

To plot an ROC curve, we’ll need to compute the true positive and false positive rates. In the earlier article we did this using cumulative sums of positives (or negatives) along the sorted binary labels. But here we’ll use the `pROC`

package to make it official:

```
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
```

`## Area under the curve: 0.825`

```
roc_df <- data.frame(
TPR=rev(roc_obj$sensitivities),
FPR=rev(1 - roc_obj$specificities),
labels=roc_obj$response,
scores=roc_obj$predictor)
```

The `roc`

function returns an object with plot methods and other conveniences, but for our purposes all we want from it is vectors of TPR and FPR values. TPR is the same as sensitivity, and FPR is 1 - specificity (see “confusion matrix” in Wikipedia). Unfortunately, the `roc`

function reports these values sorted in the order of ascending score; we want to start in the lower left hand corner, so I reverse the order. According to the `auc`

function from the pROC package, our simulated category and prediction data gives an AUC of 0.825; we’ll compare other attempts at computing AUC to this value.

If the ROC curve were a perfect step function, we could find the area under it by adding a set of vertical bars with widths equal to the spaces between points on the FPR axis, and heights equal to the step height on the TPR axis. Since actual ROC curves can also include portions representing sets of values with tied scores which are not square steps, we need to adjust the area for these segments. In the figure below we use green bars to represent the areas under the steps. Adjustments for sets of tied values will be shown as blue rectangles; half the area of each of these blue rectagles is below a sloped segment of the curve.

The function for drawing polygons in base R takes vectors of x and y values; we’ll start by defining a `rectangle`

function that uses a simpler and more specialized syntax; it takes x and y coordinates for the lower left corner of the rectangle, and a height and width. It sets some default display options, and passes along any other parameters we might specify (like color) to the `polygon`

function.

```
rectangle <- function(x, y, width, height, density=12, angle=-45, ...)
polygon(c(x,x,x+width,x+width), c(y,y+height,y+height,y),
density=density, angle=angle, ...)
```

The spaces between TPR (or FPR) values can be calculated by `diff`

. Since this results in a vector one position shorter than the original data, we pad each difference vector with a zero at the end:

```
roc_df <- transform(roc_df,
dFPR = c(diff(FPR), 0),
dTPR = c(diff(TPR), 0))
```

For this figure, we’ll draw the ROC curve last to place it on top of the other elements, so we start by drawing an empty graph (`type='n'`

) spanning from 0 to 1 on each axis. Since the data set has exactly ten positive and ten negative cases, the TPR and FPR values will all be multiples of 1/10, and the points of the ROC curve will all fall on a regularly spaced grid. We draw the grid using light blue horizontal and vertical lines spaced one tenth of a unit apart. Now we can pass the values we calculated above to the rectangle function, using `mapply`

(the multi-variate version of `sapply`

) to iterate over all the cases and draw all the green and blue rectangles. Finally we plot the ROC curve (that is, we plot TPR against FPR) on top of everything in red.

```
plot(0:10/10, 0:10/10, type='n', xlab="FPR", ylab="TPR")
abline(h=0:10/10, col="lightblue")
abline(v=0:10/10, col="lightblue")
with(roc_df, {
mapply(rectangle, x=FPR, y=0,
width=dFPR, height=TPR, col="green", lwd=2)
mapply(rectangle, x=FPR, y=TPR,
width=dFPR, height=dTPR, col="blue", lwd=2)
lines(FPR, TPR, type='b', lwd=3, col="red")
})
```

The area under the red curve is all of the green area plus half of the blue area. For adding areas we only care about the height and width of each rectangle, not its (x,y) position. The heights of the green rectangles, which all start from 0, are in the TPR column and widths are in the dFPR column, so the total area of all the green rectangles is the dot product of TPR and dFPR. Note that the vectored approach computes a rectangle for each data point, even when the height or width is zero (in which case it doesn’t hurt to add them). Similarly, the heights and widths of the blue rectangles (if there are any) are in columns dTPR and dFPR, so their total area is the dot product of these vectors. For regions of the graph that form square steps, one or the other of these values will be zero, so you only get blue rectangles (of non-zero area) if both TPR and FPR change in the same step. Only half the area of each blue rectangle is below its segment of the ROC curve (which is a diagonal of a blue rectangle). Remember the ‘real’ `auc`

function gave us an AUC of 0.825, so that is the answer we’re looking for.

```
simple_auc <- function(TPR, FPR){
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
with(roc_df, simple_auc(TPR, FPR))
```

`## [1] 0.825`

Now let’s try a completely different approach. Here we generate a matrix representing all possible combinations of a positive case with a negative case. Each row represents a positive case, in order from the highest-scoring positive case at the bottom to the lowest-scoring positive case at the top. Similarly, the columns represent the negative cases, sorted with the highest scores at the left. Each cell represents a comparison between a particular positive case and a particular negative case, and we mark the cell by whether its positive case has a higher score (or higher overall rank) than its negative case. If your classifier is any good, most of the positive cases will outrank most of the negative cases, and any exceptions will be in the upper left corner, where low-ranking positives are being compared to high-ranking negatives.

```
rank_comparison_auc <- function(labels, scores, plot_image=TRUE, ...){
score_order <- order(scores, decreasing=TRUE)
labels <- as.logical(labels[score_order])
scores <- scores[score_order]
pos_scores <- scores[labels]
neg_scores <- scores[!labels]
n_pos <- sum(labels)
n_neg <- sum(!labels)
M <- outer(sum(labels):1, 1:sum(!labels),
function(i, j) (1 + sign(pos_scores[i] - neg_scores[j]))/2)
AUC <- mean (M)
if (plot_image){
image(t(M[nrow(M):1,]), ...)
library(pROC)
with( roc(labels, scores),
lines((1 + 1/n_neg)*((1 - specificities) - 0.5/n_neg),
(1 + 1/n_pos)*sensitivities - 0.5/n_pos,
col="blue", lwd=2, type='b'))
text(0.5, 0.5, sprintf("AUC = %0.4f", AUC))
}
return(AUC)
}
rank_comparison_auc(labels=as.logical(category), scores=prediction)
```

` `

## [1] 0.825

The blue line is an ROC curve computed in the conventional manner (slid and stretched a bit to get the coordinates to line up with the corners of the matrix cells). This makes it evident that the ROC curve marks the boundary of the area where the positive cases outrank the negative cases. The AUC can be computed by adjusting the values in the matrix so that cells where the positive case outranks the negative case receive a `1`

, cells where the negative case has higher rank receive a `0`

, and cells with ties get `0.5`

(since applying the `sign`

function to the difference in scores gives values of 1, -1, and 0 to these cases, we put them in the range we want by adding one and dividing by two.) We find the AUC by averaging these values.

The probabilistic interpretation is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. This is evident from the figure, where the total area of the plot is normalized to one, the cells of the matrix enumerate all possible combinations of positive and negative cases, and the fraction under the curve comprises the cells where the positive case outranks the negative one.

We can use this observation to approximate AUC:

```
auc_probability <- function(labels, scores, N=1e7){
pos <- sample(scores[labels], N, replace=TRUE)
neg <- sample(scores[!labels], N, replace=TRUE)
# sum( (1 + sign(pos - neg))/2)/N # does the same thing
(sum(pos > neg) + sum(pos == neg)/2) / N # give partial credit for ties
}
auc_probability(as.logical(category), prediction)
```

`## [1] 0.8249989`

Now let’s try our new AUC functions on a bigger dataset. I’ll use the simulated dataset from the earlier blog post, where the labels are in the `bad_widget`

column of the test set dataframe, and the scores are in a vector called `glm_response_scores`

.

This data has no tied scores, so for testing let’s make a modified version that has ties. We’ll plot a black line representing the original data; since each point has a unique score, the ROC curve is a step function. Then we’ll generate tied scores by rounding the score values, and plot the rounded ROC in red. Note that we are using “response” scores from a `glm`

model, so they all fall in the range from 0 to 1. When we round these scores to one decimal place, there are 11 possible rounded scores, from 0.0 to 1.0. The AUC values calculated with the `pROC`

package are indicated on the figure.

```
roc_full_resolution <- roc(test_set$bad_widget, glm_response_scores)
rounded_scores <- round(glm_response_scores, digits=1)
roc_rounded <- roc(test_set$bad_widget, rounded_scores)
plot(roc_full_resolution, print.auc=TRUE)
```

```
##
## Call:
## roc.default(response = test_set$bad_widget, predictor = glm_response_scores)
##
## Data: glm_response_scores in 59 controls (test_set$bad_widget FALSE) < 66 cases (test_set$bad_widget TRUE).
## Area under the curve: 0.9037
```

```
lines(roc_rounded, col="red", type='b')
text(0.4, 0.43, labels=sprintf("AUC: %0.3f", auc(roc_rounded)), col="red")
```

Now we can try our AUC functions on both sets to check that they can handle both step functions and segments with intermediate slopes.

```
options(digits=22)
set.seed(1234)
results <- data.frame(
`Full Resolution` = c(
auc = as.numeric(auc(roc_full_resolution)),
simple_auc = simple_auc(rev(roc_full_resolution$sensitivities), rev(1 - roc_full_resolution$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, glm_response_scores,
main="Full-resolution scores (no ties)"),
auc_probability = auc_probability(test_set$bad_widget, glm_response_scores)
),
`Rounded Scores` = c(
auc = as.numeric(auc(roc_rounded)),
simple_auc = simple_auc(rev(roc_rounded$sensitivities), rev(1 - roc_rounded$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, rounded_scores,
main="Rounded scores (ties in all segments)"),
auc_probability = auc_probability(test_set$bad_widget, rounded_scores)
)
)
```

Full.Resolution | Rounded.Scores | |
---|---|---|

auc | 0.90369799691833586 | 0.89727786337955828 |

simple_auc | 0.90369799691833586 | 0.89727786337955828 |

rank_comparison_auc | 0.90369799691833586 | 0.89727786337955828 |

auc_probability | 0.90371970000000001 | 0.89716879999999999 |

So we have two new functions that give exactly the same results as the function from the `pROC`

package, and our probabilistic function is pretty close. Of course, these functions are intended as demonstrations; you should normally use standard packages such as `pROC`

or `ROCR`

for actual work.

Here we’ve focused on calculating AUC and understanding the probabilistic interpretation. The probability associated with AUC is somewhat arcane, and is not likely to be exactly what you are looking for in practice (unless you actually will be randomly selecting a positive and a negative case, and you really want to know the probability that the classifier will score the positive case higher.) While AUC gives a single-number summary of classifier performance that is suitable in some circumstances, other metrics are often more appropriate. In many applications, overall behavior of a classifier across all possible score thresholds is of less interest than the behavior in a specific range. For example, in marketing the goal is often to identify a highly enriched target group with a low false positive rate. In other applications it may be more important to clearly identify a group of cases likely to be negative. For example, when pre-screening for a disease or defect you may want to rule out as many cases as you can before you start running expensive confirmatory tests. More generally, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If you know these costs, you should probably use them. A good introduction relating ROC curves to economic utility functions, complete with story and characters, is given in the excellent blog post “ML Meets Economics.”

The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also has a function (`cvAlpha.glmnet`

) to do crossvalidation for both elastic net parameters α and λ, as well as some utility functions.

The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here is a simple example:

mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars) ## Call: ## glmnet.formula(formula = mpg ~ cyl + disp + hp, data = mtcars) ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha: 1 ## Lambda summary: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03326 0.11690 0.41000 1.02800 1.44100 5.05500

Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments that `predict.glmnet`

needs.

```
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
```

You may have noticed the options "use model.frame" and "sparse model matrix" in the printed output above. glmnetUtils includes a couple of options to improve performance, especially on wide datasets and/or have many categorical (factor) variables.

The standard R method for creating a model matrix out of a data frame uses the `model.frame`

function, which has a major disadvantage when it comes to wide data. It generates a *terms* object, which specifies how the original columns of data relate to the columns in the model matrix. This involves creating and storing a (roughly) square matrix of size *p* × *p*, where *p* is the number of variables in the model. When *p* > 10000, which isn't uncommon these days, the terms object can exceed a gigabyte in size. Even if there is enough memory to store the object, processing it can be very slow.

Another issue with the standard approach is the treatment of factors. Normally, `model.matrix`

will turn an *N*-level factor into an indicator matrix with *N*−1 columns, with one column being dropped. This is necessary for unregularised models as fit with `lm`

and `glm`

, since the full set of *N* columns is linearly dependent. However, this may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.

To deal with these problems, glmnetUtils by default will avoid using `model.frame`

, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be much faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero *is* meaningful (usually). Machine learners may also recognise this as one-hot encoding.

glmnetUtils can also generate a *sparse* model matrix, using the `sparse.model.matrix`

function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels.

One piece missing from the standard glmnet package is a way of choosing α, the elastic net mixing parameter, similar to how `cv.glmnet`

chooses λ, the shrinkage parameter. To fix this, glmnetUtils provides the `cvAlpha.glmnet`

function, which uses crossvalidation to examine the impact on the model of changing α and λ. The interface is the same as for the other functions:

# Leukemia dataset from Trevor Hastie's website: # http://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData load("~/Leukemia.rdata") leuk <- do.call(data.frame, Leukemia) cvAlpha.glmnet(y ~ ., data=leuk, family="binomial") ## Call: ## cvAlpha.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial") ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1 ## Number of crossvalidation folds for lambda: 10

`cvAlpha.glmnet`

uses the algorithm described in the help for `cv.glmnet`

, which is to fix the distribution of observations across folds and then call `cv.glmnet`

in a loop with different values of α. Optionally, you can parallelise this outer loop, by setting the `outerParallel`

argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:

- Via
`parLapply`

in the parallel package. To use this, set`outerParallel`

to a valid cluster object created by`makeCluster`

. - Via
`rxExec`

as supplied by Microsoft R Server’s RevoScaleR package. To use this, set`outerParallel`

to a valid compute context created by`RxComputeContext`

, or a character string specifying such a context.

The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. The easiest way to install it is via devtools:

```
library(devtools)
install_github("hong-revo/glmnetUtils")
```

A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at hongooi@microsoft.com.

by Srini Kumar, Director of Data Science at Microsoft

We tend to think of R and other such ML tools only in the context of the workplace, to do “weighty” things aimed at saving millions. A little judicious use of R may help us hugely in our personal lives too. The ideas of regression, classification trees etc. can be powerful tools in valuation, as I found out.

Recently, I was in a five-car accident on the infamous 101 in the San Francisco bay area. Luckily, none of us required an ambulance and all of us walked away. However, my car was, in insurance parlance, a "total loss". I was left wondering what I should expect as a check from my insurance company. I found the data I needed on the web, and used R to very quickly come up with a model to value the car. While its being astonishingly accurate was probably an exception, its placing the value in the ballpark illustrates how easy it is to use R for a quick yet reasonably accurate analysis.

First off, we need to recognize that our expected value should not be the blue book value of the car. It should be the amount we have to pay (discounting the taxes and other non-discretionary expenses) to get a similar car from a used (or in the higher end, "pre-owned") car from a car dealer. Therefore, I searched for all the available cars of that model and year available in the United States, and got a list of 70 cars from all over the country.

The only part that involved some drudgery was copying the location, mileage and the asking price for each car from that PDF file and putting it in a spreadsheet. A reasonable guess is that the car's value depends mostly on the mileage on it, and a reasonable assumption (which turned out to be a good one) that it also depended on where it is available.

The rest of the analysis was quite easy. Having read in the tab-delimited format data, I checked the mileage and price for different states by way of initial exploration. As an aside, I tend to use and recommend the tab separated format over the comma separated format always. Text fields rarely contain tabs, but contain commas far more frequently.

As we can see, there is too wide a spread among the states, and too little data from mine. Anyway, the simple linear regression on mileage and states yielded a model, and its prediction was $23,122.47. I had to guess the mileage on my car and guessed it reasonably accurately to about 45,000. After doing this, especially since the data points from my state were too few, I tried a decision tree to check for the dependence of price on state, and got this:

The decision tree algorithm evidently did not sense the state to be a factor to determine the price.

Armed with this knowledge, I waited to talk with the insurance company. To their credit, what they offered me was less than 0.4 percent off the predicted value! To be sure, there were unknowns. I had guessed my mileage, my car had been loaded with options, and I did not scrape any options from the publicly available data. An additional regression I did completely ignoring the state and focusing only on the mileage would have given me an error of a little over 3 percent. However, it was interesting that the modeling could predict the price to within about 4 percent, setting very well the stage for negotiations if it came to it, particularly since the alternative is to make subjective guesses.

So, the next time you need to value something, provided some data on it is available, you can, in less than an hour, come up with objective and defensible estimates to help you negotiate. Here is the R code, if you would like to try it yourself. The only thing is that you need to be able to find data on that item or a comparable one, which is easy to do.

The results of many scientific papers are wrong. There are many reasons for this, including p-hacking, publication bias, and the general inability to replicate results. But there's another, more mundane cause: incorrect calculation of p-values in statistical tests. This could be caused by simple transcription errors when plugging numbers into a statistical tool, incorrect rounding, or misapplication of the test itself (say, applying a two-sided test when a 1-sided p-value is appropriate). Such errors should be picked up in the peer review process, but given that even expert statisicians sometimes struggle to explain p-values, it's not surprising that some errors get through.

That's why Michèle B. Nuijten, a PhD student at Tilburg University, created the R package statcheck. Given a paper to be published in a psychology journal, statcheck searches for statistical results from \(t\), \(F\), \(r\), \(\chi^2\), and \(Z\) tests, and compares the published p-value to a value calculated by R. This is possible only because the American Psychological Association Style Guide has a very specific format for reporting statistical results, listing the p-value next to the reported test statistic. Statcheck also attempts to detect if the surrounding language mentions a "one-sided" or "one-tailed" test and calculates the p-value in R accordingly (although this process isn't perfect). Anyone can use statcheck by uploading a PDF or HTML version of their paper to the statcheck web application, or by using the statcheck function within R directly.

Nuijten recounts the origins and development of statcheck in an interesting article in Retraction Watch. One major surprise: when they applied statcheck to p-values reported in eight major psychology journals from 1985 to 2013:

Half of the papers in psychology contain at least one statistical reporting inconsistency, and one in eight papers contain an inconsistency that might have affected the statistical conclusion.

Since then, they've further automated statcheck by automatically sharing the results of its analyses for 50,000 papers at PubPeer. Not everyone was pleased by the notifications (a former president of the Association for Psychological Science called it 'methodological terrorism'), but the process did reveal more inconsistencies in published papers.

For more on statcheck, check out its website at the link below.

Michèle B. Nuijten: R package “statcheck”: Extract statistics from articles and recompute p values (Epskamp & Nuijten, 2016)

by Joseph Rickert

My guess is that a good many statistics students first encounter the bivariate Normal distribution as one or two hastily covered pages in an introductory text book, and then don't think much about it again until someone asks them to generate two random variables with a given correlation structure. Fortunately for R users, a little searching on the internet will turn up several nice tutorials with R code explaining various aspects of the bivariate Normal. For this post, I have gathered together a few examples and tweaked the code a little to make comparisons easier.

Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.

To set up for the simulations this first block of code defines N, the number of random samples to simulate, the means of the random variables, and and the covariance matrix. It also provides a small function for drawing confidence ellipses on the simulated data.

library(mixtools) #for ellipse

N <- 200 # Number of random samples

set.seed(123)

# Target parameters for univariate normal distributions

rho <- -0.6

mu1 <- 1; s1 <- 2

mu2 <- 1; s2 <- 8

# Parameters for bivariate normal distribution

mu <- c(mu1,mu2) # Mean

sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),

2) # Covariance matrix

# Function to draw ellipse for bivariate normal data

ellipse_bvn <- function(bvn, alpha){

Xbar <- apply(bvn,2,mean)

S <- cov(bvn)

ellipse(Xbar, S, alpha = alpha, col="red")

}

The first method, the way to go if you just want to get on with it, is to use the mvrnorm() function from the MASS package.

library(MASS)

bvn1 <- mvrnorm(N, mu = mu, Sigma = sigma ) # from MASS package

colnames(bvn1) <- c("bvn1_X1","bvn1_X2")

It takes so little code to do the simulation it might be possible to tweet in a homework assignment.

A look at the source code for mvrnorm() shows that it uses eignevectors to generate the random samples. The documentation for the function states that this method was selected because it is stabler than the alternative of using a Cholesky decomposition which might be faster.

For the second method, let's go ahead and directly generate generate bivariate Normal random variates with the Cholesky decomposition. Remember that the Cholesky decomposition of sigma (a positive definite matrix) yields a matrix M such that M times its transpose gives sigma back again. Multiplying M by a matrix of standard random Normal variates and adding the desired mean gives a matrix of the desired random samples. A lecture from Colin Rundel covers some of the theory.

M <- t(chol(sigma))

# M %*% t(M)

Z <- matrix(rnorm(2*N),2,N) # 2 rows, N/2 columns

bvn2 <- t(M %*% Z) + matrix(rep(mu,N), byrow=TRUE,ncol=2)

colnames(bvn2) <- c("bvn2_X1","bvn2_X2")

For the third method we make use of a special property of the bivariate normal that is discussed in almost all of those elementary textbooks. If X_{1} and X_{2} are two jointly distributed random variables, then the conditional distribution of X_{2} given X_{1} is itself normal with: mean = m_{2} + r(s_{2}/s_{1})(X_{1} - m_{1}) and variance = (1 - r^{2})s^{2}X_{2}.

Hence, a sample from a bivariate Normal distribution can be simulated by first simulating a point from the marginal distribution of one of the random variables and then simulating from the second random variable conditioned on the first. A brief proof of the underlying theorem is available here.

rbvn<-function (n, m1, s1, m2, s2, rho)

{

X1 <- rnorm(n, mu1, s1)

X2 <- rnorm(n, mu2 + (s2/s1) * rho *

(X1 - mu1), sqrt((1 - rho^2)*s2^2))

cbind(X1, X2)

}

bvn3 <- rbvn(N,mu1,s1,mu2,s2,rho)

colnames(bvn3) <- c("bvn3_X1","bvn3_X2")

The fourth method, my favorite, comes from Professor Darren Wiliinson's Gibbs Sampler tutorial. This is a very nice idea; using the familiar bivariate Normal distribution to illustrate the basics of the Gibbs Sampling Algorithm. Note that this looks very much like the previous method, except that now we are alternately sampling from the full conditional distributions.

gibbs<-function (n, mu1, s1, mu2, s2, rho)

{

mat <- matrix(ncol = 2, nrow = n)

x <- 0

y <- 0

mat[1, ] <- c(x, y)

for (i in 2:n) {

x <- rnorm(1, mu1 +

(s1/s2) * rho * (y - mu2), sqrt((1 - rho^2)*s1^2))

y <- rnorm(1, mu2 +

(s2/s1) * rho * (x - mu1), sqrt((1 - rho^2)*s2^2))

mat[i, ] <- c(x, y)

}

mat

}

bvn4 <- gibbs(N,mu1,s1,mu2,s2,rho)

colnames(bvn4) <- c("bvn4_X1","bvn4_X2")

The fifth and final way uses the rmvnorm() function form the mvtnorm package with the singular value decomposition method selected. The functions in this package are overkill for what we are doing here, but mvtnorm is probably the package you would want to use if you are calculating probabilities from high dimensional multivariate distributions. It implements numerical methods for carefully calculating the high dimensional integrals involved that are based on some papers by Professor Alan Genz dating from the early '90s. These methods are briefly explained in the package vignette.

library (mvtnorm)

bvn5 <- mvtnorm::rmvnorm(N,mu,sigma, method="svd")

colnames(bvn5) <- c("bvn5_X1","bvn5_X2")

Note that I have used the :: operator here to make sure that R uses the rmvnorm() function from the mvtnorm package. There is also a rmvnorm() function in the mixtools package that I used to get the ellipse function. Loading the packages in the wrong order could lead to the rookie mistake of having the function you want inadvertently overwritten.

Next, we plot the results of drawing just 100 random samples for each method. This allows us to see how the algorithms spread data over the sample space as they are just getting started.

bvn <- list(bvn1,bvn2,bvn3,bvn4,bvn5)

par(mfrow=c(3,2))

plot(bvn1, xlab="X1",ylab="X2",main= "All Samples")

for(i in 2:5){

points(bvn[[i]],col=i)

}

for(i in 1:5){

item <- paste("bvn",i,sep="")

plot(bvn[[i]],xlab="X1",ylab="X2",main=item, col=i)

ellipse_bvn(bvn[[i]],.5)

ellipse_bvn(bvn[[i]],.05)

}

par(mfrow=c(1,1))

The first plot shows all 500 random samples color coded by the method with which they were generated. The remaining plots show the samples generated by each method. In each of these plots the ellipses mark the 0.5 and 0.95 probability regions, i.e. the area within the ellipses should contain 50% and 95% of the points respectively. Note that bvn4 which uses the Gibbs sampling algorithm looks like all of the rest. In most use cases for the Gibbs it takes the algorithm some time to converge to the target distribution. In our case, we start out with a pretty good guess.

Finally, a word about accuracy: nice coverage of the sample space is not sufficient to produce accurate results. A little experimentation will show that, for all of the methods outlined above, regularly achieving a sample covariance matrix that is close to the target, sigma, requires something on the order of 10,000 samples as is Illustrated below.

> sigma

[,1] [,2]

[1,] 4.0 -9.6

[2,] -9.6 64.0

for(i in 1:5){

print(round(cov(bvn[[i]]),1))

}

bvn1_X1 bvn1_X2

bvn1_X1 4.0 -9.5

bvn1_X2 -9.5 63.8

bvn2_X1 bvn2_X2

bvn2_X1 3.9 -9.5

bvn2_X2 -9.5 64.5

bvn3_X1 bvn3_X2

bvn3_X1 4.1 -9.8

bvn3_X2 -9.8 63.7

bvn4_X1 bvn4_X2

bvn4_X1 4.0 -9.7

bvn4_X2 -9.7 64.6

bvn5_X1 bvn5_X2

bvn5_X1 4.0 -9.6

bvn5_X2 -9.6 65.3

Many people coming to R for the first time find it disconcerting to realize that there are several ways to do some fundamental calculation in R. My take is that rather than being a point of frustration, having multiple options indicates that richness of the R language. A close look at the package documentation will often show that yet another method to do something is a response to some subtle need that was not previously addressed. Enjoy the diversity!

As anyone who has tried Pokémon Go recently is probably aware, Pokémon come in different types. A Pokémon's type affects where and when it appears, and the types of attacks it is vulnerable to. Some types, like Normal, Water and Grass are common; others, like Fairy and Dragon are rare. Many Pokémon have two or more types.

To get a sense of the distribution of Pokémon types, Joshua Kunst used R to download data from the Pokémon API and created a treemap of all the Pokémon types (and for those with more than 1 type, the secondary type). Johnathon's original used the 800+ Pokémon from the modern universe, but I used his R code to recreate the map for the 151 original Pokémon used in Pokémon Go.

Pokémon have many other attributes as well, including: weight, height, attack and defense ratings, hit points, and speed. It's hard to visualize so many variables in a 2-dimensional cart, so Joshua used a technique called t-Distributed Stochastic Neighbor Embedding (implemented in the tsne package for R) to cluster similar Pokémon in a two-dimensional chart, and used R's image-handling capbilities to include avatars for each of the Pokémon.

This chart, which includes modern Pokémon along with the 151 originals in Pokémon Go, is colored according to each Pokémon's primary type. As you can see, the TSNE algorithm is super effective at clustering Pokémon according to type.

For more details on Joshua's analysis, including interactive versions of these charts and the R code that created them, follow the link below.

Joshua Kunst: Pokémon: Visualize 'em all! (via Matthew Bashton)

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi

The forecast package for R, created and maintained by Professor Rob Hyndman of Monash University, is one of the more useful R packages available available on CRAN. Statistical forecasting — the process of predicting the future value of a time series — is used in just about every realm of data analysis, whether it's trying to predict a future stock price or trying to anticipate changes in the weather. If you're looking to learn about forecasting, a great place to start is the online book Forecasting: Principles and Practice (by Hyndman and George Athanasopoulos) which walks you through the theory and practice, with many examples in R based on the forecast package. Topics covered include multiple regression, Time series decomposition, exponential smoothing, and ARIMA models.

The forecast package itself recently received a major update, to version 7. One major new capability is the ability to easily chart forecasts using the ggplot2 package with the new autoplot function. For example:

fc <- forecast(fdeaths) autoplot(fc)

You can also add forecasts to any ggplot using the new geom_forecasts geom provided by the forecast package:

autoplot(mdeaths) + geom_forecast(h=36, level=c(50,80,95))

There have been several updates to the forecasting functions as well. The function for fitting linear models to time series data, tslm, has been rewritten to be more compatible with the standard lm function. It's now possible to forecast means (as well as medians) when using Box-Cox transformations. And you can now apply neural networks to time series data by building a nonlinear autoregressive model with the new nnetar function.

Those are just some of the highlights of the updates to the forecast package in version 7. For complete details, follow the links to Rob Hyndman's blog, below.

Hyndsight: forecast v7 and ggplot2 graphics ; Forecast v7 (part 2) (via traims)

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.