Bayesian Inference is a way of combining information from data with things we think we already know. For example, if we wanted to get an estimate of the mean height of people, we could use our prior knowledge that people are generally between 5 and 6 feet tall to inform the results from the data we collect. If our prior is informative and we don't have much data, this will help us to get a better estimate. If we have a lot of data, even if the prior is wrong (say, our population is NBA players), the prior won't change the estimate much. You might say that including such "subjective" information in a statistical model isn't right, but there's subjectivity in the selection of any statistical model. Bayesian Inference makes that subjectivity explicit.

Bayesian Inference can seem complicated, but as Brandon Rohrer explains, it's based on straighforward principles of conditional probability. Watch his video below for an elegant explanation of the basics.

If you'd like to try out some Bayesian statistics yourself, R has many packages for Bayesian Inference.

Data Science and Robots Blog: How Bayesian inference works

As Google learned, predicting the spread of influenza, even with mountains of data, is notoriously difficult. Nonetheless, bioinformatician and R user Shirin Glander has created a two-part tutorial about predicting flu deaths with R (part 2 here). The analysis is based on just 136 cases of influenza A H7N9 in China in 2013 (data provided in the outbreaks package) so the intent was not to create a generally predictive model, but by providing all of the R code and graphics Shirin has created a useful example of real-word predictive modeling with R.

The tutorial covers loading and cleaning the data (including a nice example of using the mice package to impute missing values) and begins with some exploratory data visualizations. I was particularly impressed by the use of density charts (using the stat_density2d ggplot2 aesthetic) to highlight differences in the scatterplots of flu cases ending in death and recovery.

For the statistical analysis, Shirin applies several different kinds of predictive models, including:

- Decision trees (implemented using rpart and visualized using fancyRpartPlot from the rattle package)
- Random Forests (using caret's "rf" training method)
- Elastic-Net Regularized Generalized Linear Models (using caret's "glmnet" training method)
- K-nearest neighbors clustering (using caret's "kknn" training method)
- Penalized Discriminant Analysis (using caret's "pda" training method)
- and in Part 2, Extreme gradient boosting using the xgboost package and various preprocessing techniques from the caret package

Due to the limited data size, there's not too much difference between the models: in each case, 13-15 of the 23 cases were classified correctly. Nonetheless, the post provides a useful template for applying several different model types to the same data set, and using the power of the caret package to normalize the data and optimize the models.

(By the way, if you like the style of Shirin's blog, she's also created a useful guide to creating an R blog using Github, JekyllBootstrap, and RMarkdown.)

For Shirin's complete analyses of the flu data, including the R code, follow the links below.

Shirin's playgRound: Can we predict flu deaths with Machine Learning and R? (part 2) (via Thomas Dinsmore's ML/DL blog)

*by Bob Horton, Microsoft Senior Data Scientist*

Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier. In an earlier post, I described a simple “turtle’s eye view” of these plots: a classifier is used to sort cases in order from most to least likely to be positive, and a Logo-like turtle marches along this string of cases. The turtle considers all the cases it has passed as having tested positive. Depending on their actual class they are either false positives (FP) or true positives (TP); this is equivalent to adjusting a score threshold. When the turtle passes a TP it takes a step upward on the y-axis, and when it passes a FP it takes a step rightward on the x-axis. The step sizes are inversely proportional to the number of actual positives (in the y-direction) or negatives (in the x-direction), so the path always ends at coordinates (1, 1). The result is a plot of true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity), which is all an ROC curve is.

Computing the area under the curve is one way to summarize it in a single value; this metric is so common that if data scientists say “area under the curve” or “AUC”, you can generally assume they mean an ROC curve unless otherwise specified.

Probably the most straightforward and intuitive metric for classifier performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people a completely bogus screening test that always reports “negative” will be 99.9999% accurate. Unlike accuracy, ROC curves are insensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.

In this post I’ll work through the geometry exercise of computing the area, and develop a concise vectorized function that uses this approach. Then we’ll look at another way of viewing AUC which leads to a probabilistic interpretation.

Let’s start with a simple artificial data set:

```
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
```

Here the vector `prediction`

holds ersatz scores; these normally would be assigned by a classifier, but here we’ve just assigned numbers so that the decreasing order of the scores matches the given order of the category labels. Scores 9 and 10, one representing a positive case and the other a negative case, are replaced by their average so that the data will contain ties without otherwise disturbing the order.

To plot an ROC curve, we’ll need to compute the true positive and false positive rates. In the earlier article we did this using cumulative sums of positives (or negatives) along the sorted binary labels. But here we’ll use the `pROC`

package to make it official:

```
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
```

`## Area under the curve: 0.825`

```
roc_df <- data.frame(
TPR=rev(roc_obj$sensitivities),
FPR=rev(1 - roc_obj$specificities),
labels=roc_obj$response,
scores=roc_obj$predictor)
```

The `roc`

function returns an object with plot methods and other conveniences, but for our purposes all we want from it is vectors of TPR and FPR values. TPR is the same as sensitivity, and FPR is 1 - specificity (see “confusion matrix” in Wikipedia). Unfortunately, the `roc`

function reports these values sorted in the order of ascending score; we want to start in the lower left hand corner, so I reverse the order. According to the `auc`

function from the pROC package, our simulated category and prediction data gives an AUC of 0.825; we’ll compare other attempts at computing AUC to this value.

If the ROC curve were a perfect step function, we could find the area under it by adding a set of vertical bars with widths equal to the spaces between points on the FPR axis, and heights equal to the step height on the TPR axis. Since actual ROC curves can also include portions representing sets of values with tied scores which are not square steps, we need to adjust the area for these segments. In the figure below we use green bars to represent the areas under the steps. Adjustments for sets of tied values will be shown as blue rectangles; half the area of each of these blue rectagles is below a sloped segment of the curve.

The function for drawing polygons in base R takes vectors of x and y values; we’ll start by defining a `rectangle`

function that uses a simpler and more specialized syntax; it takes x and y coordinates for the lower left corner of the rectangle, and a height and width. It sets some default display options, and passes along any other parameters we might specify (like color) to the `polygon`

function.

```
rectangle <- function(x, y, width, height, density=12, angle=-45, ...)
polygon(c(x,x,x+width,x+width), c(y,y+height,y+height,y),
density=density, angle=angle, ...)
```

The spaces between TPR (or FPR) values can be calculated by `diff`

. Since this results in a vector one position shorter than the original data, we pad each difference vector with a zero at the end:

```
roc_df <- transform(roc_df,
dFPR = c(diff(FPR), 0),
dTPR = c(diff(TPR), 0))
```

For this figure, we’ll draw the ROC curve last to place it on top of the other elements, so we start by drawing an empty graph (`type='n'`

) spanning from 0 to 1 on each axis. Since the data set has exactly ten positive and ten negative cases, the TPR and FPR values will all be multiples of 1/10, and the points of the ROC curve will all fall on a regularly spaced grid. We draw the grid using light blue horizontal and vertical lines spaced one tenth of a unit apart. Now we can pass the values we calculated above to the rectangle function, using `mapply`

(the multi-variate version of `sapply`

) to iterate over all the cases and draw all the green and blue rectangles. Finally we plot the ROC curve (that is, we plot TPR against FPR) on top of everything in red.

```
plot(0:10/10, 0:10/10, type='n', xlab="FPR", ylab="TPR")
abline(h=0:10/10, col="lightblue")
abline(v=0:10/10, col="lightblue")
with(roc_df, {
mapply(rectangle, x=FPR, y=0,
width=dFPR, height=TPR, col="green", lwd=2)
mapply(rectangle, x=FPR, y=TPR,
width=dFPR, height=dTPR, col="blue", lwd=2)
lines(FPR, TPR, type='b', lwd=3, col="red")
})
```

The area under the red curve is all of the green area plus half of the blue area. For adding areas we only care about the height and width of each rectangle, not its (x,y) position. The heights of the green rectangles, which all start from 0, are in the TPR column and widths are in the dFPR column, so the total area of all the green rectangles is the dot product of TPR and dFPR. Note that the vectored approach computes a rectangle for each data point, even when the height or width is zero (in which case it doesn’t hurt to add them). Similarly, the heights and widths of the blue rectangles (if there are any) are in columns dTPR and dFPR, so their total area is the dot product of these vectors. For regions of the graph that form square steps, one or the other of these values will be zero, so you only get blue rectangles (of non-zero area) if both TPR and FPR change in the same step. Only half the area of each blue rectangle is below its segment of the ROC curve (which is a diagonal of a blue rectangle). Remember the ‘real’ `auc`

function gave us an AUC of 0.825, so that is the answer we’re looking for.

```
simple_auc <- function(TPR, FPR){
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
with(roc_df, simple_auc(TPR, FPR))
```

`## [1] 0.825`

Now let’s try a completely different approach. Here we generate a matrix representing all possible combinations of a positive case with a negative case. Each row represents a positive case, in order from the highest-scoring positive case at the bottom to the lowest-scoring positive case at the top. Similarly, the columns represent the negative cases, sorted with the highest scores at the left. Each cell represents a comparison between a particular positive case and a particular negative case, and we mark the cell by whether its positive case has a higher score (or higher overall rank) than its negative case. If your classifier is any good, most of the positive cases will outrank most of the negative cases, and any exceptions will be in the upper left corner, where low-ranking positives are being compared to high-ranking negatives.

```
rank_comparison_auc <- function(labels, scores, plot_image=TRUE, ...){
score_order <- order(scores, decreasing=TRUE)
labels <- as.logical(labels[score_order])
scores <- scores[score_order]
pos_scores <- scores[labels]
neg_scores <- scores[!labels]
n_pos <- sum(labels)
n_neg <- sum(!labels)
M <- outer(sum(labels):1, 1:sum(!labels),
function(i, j) (1 + sign(pos_scores[i] - neg_scores[j]))/2)
AUC <- mean (M)
if (plot_image){
image(t(M[nrow(M):1,]), ...)
library(pROC)
with( roc(labels, scores),
lines((1 + 1/n_neg)*((1 - specificities) - 0.5/n_neg),
(1 + 1/n_pos)*sensitivities - 0.5/n_pos,
col="blue", lwd=2, type='b'))
text(0.5, 0.5, sprintf("AUC = %0.4f", AUC))
}
return(AUC)
}
rank_comparison_auc(labels=as.logical(category), scores=prediction)
```

` `

## [1] 0.825

The blue line is an ROC curve computed in the conventional manner (slid and stretched a bit to get the coordinates to line up with the corners of the matrix cells). This makes it evident that the ROC curve marks the boundary of the area where the positive cases outrank the negative cases. The AUC can be computed by adjusting the values in the matrix so that cells where the positive case outranks the negative case receive a `1`

, cells where the negative case has higher rank receive a `0`

, and cells with ties get `0.5`

(since applying the `sign`

function to the difference in scores gives values of 1, -1, and 0 to these cases, we put them in the range we want by adding one and dividing by two.) We find the AUC by averaging these values.

The probabilistic interpretation is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. This is evident from the figure, where the total area of the plot is normalized to one, the cells of the matrix enumerate all possible combinations of positive and negative cases, and the fraction under the curve comprises the cells where the positive case outranks the negative one.

We can use this observation to approximate AUC:

```
auc_probability <- function(labels, scores, N=1e7){
pos <- sample(scores[labels], N, replace=TRUE)
neg <- sample(scores[!labels], N, replace=TRUE)
# sum( (1 + sign(pos - neg))/2)/N # does the same thing
(sum(pos > neg) + sum(pos == neg)/2) / N # give partial credit for ties
}
auc_probability(as.logical(category), prediction)
```

`## [1] 0.8249989`

Now let’s try our new AUC functions on a bigger dataset. I’ll use the simulated dataset from the earlier blog post, where the labels are in the `bad_widget`

column of the test set dataframe, and the scores are in a vector called `glm_response_scores`

.

This data has no tied scores, so for testing let’s make a modified version that has ties. We’ll plot a black line representing the original data; since each point has a unique score, the ROC curve is a step function. Then we’ll generate tied scores by rounding the score values, and plot the rounded ROC in red. Note that we are using “response” scores from a `glm`

model, so they all fall in the range from 0 to 1. When we round these scores to one decimal place, there are 11 possible rounded scores, from 0.0 to 1.0. The AUC values calculated with the `pROC`

package are indicated on the figure.

```
roc_full_resolution <- roc(test_set$bad_widget, glm_response_scores)
rounded_scores <- round(glm_response_scores, digits=1)
roc_rounded <- roc(test_set$bad_widget, rounded_scores)
plot(roc_full_resolution, print.auc=TRUE)
```

```
##
## Call:
## roc.default(response = test_set$bad_widget, predictor = glm_response_scores)
##
## Data: glm_response_scores in 59 controls (test_set$bad_widget FALSE) < 66 cases (test_set$bad_widget TRUE).
## Area under the curve: 0.9037
```

```
lines(roc_rounded, col="red", type='b')
text(0.4, 0.43, labels=sprintf("AUC: %0.3f", auc(roc_rounded)), col="red")
```

Now we can try our AUC functions on both sets to check that they can handle both step functions and segments with intermediate slopes.

```
options(digits=22)
set.seed(1234)
results <- data.frame(
`Full Resolution` = c(
auc = as.numeric(auc(roc_full_resolution)),
simple_auc = simple_auc(rev(roc_full_resolution$sensitivities), rev(1 - roc_full_resolution$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, glm_response_scores,
main="Full-resolution scores (no ties)"),
auc_probability = auc_probability(test_set$bad_widget, glm_response_scores)
),
`Rounded Scores` = c(
auc = as.numeric(auc(roc_rounded)),
simple_auc = simple_auc(rev(roc_rounded$sensitivities), rev(1 - roc_rounded$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, rounded_scores,
main="Rounded scores (ties in all segments)"),
auc_probability = auc_probability(test_set$bad_widget, rounded_scores)
)
)
```

Full.Resolution | Rounded.Scores | |
---|---|---|

auc | 0.90369799691833586 | 0.89727786337955828 |

simple_auc | 0.90369799691833586 | 0.89727786337955828 |

rank_comparison_auc | 0.90369799691833586 | 0.89727786337955828 |

auc_probability | 0.90371970000000001 | 0.89716879999999999 |

So we have two new functions that give exactly the same results as the function from the `pROC`

package, and our probabilistic function is pretty close. Of course, these functions are intended as demonstrations; you should normally use standard packages such as `pROC`

or `ROCR`

for actual work.

Here we’ve focused on calculating AUC and understanding the probabilistic interpretation. The probability associated with AUC is somewhat arcane, and is not likely to be exactly what you are looking for in practice (unless you actually will be randomly selecting a positive and a negative case, and you really want to know the probability that the classifier will score the positive case higher.) While AUC gives a single-number summary of classifier performance that is suitable in some circumstances, other metrics are often more appropriate. In many applications, overall behavior of a classifier across all possible score thresholds is of less interest than the behavior in a specific range. For example, in marketing the goal is often to identify a highly enriched target group with a low false positive rate. In other applications it may be more important to clearly identify a group of cases likely to be negative. For example, when pre-screening for a disease or defect you may want to rule out as many cases as you can before you start running expensive confirmatory tests. More generally, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If you know these costs, you should probably use them. A good introduction relating ROC curves to economic utility functions, complete with story and characters, is given in the excellent blog post “ML Meets Economics.”

The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also has a function (`cvAlpha.glmnet`

) to do crossvalidation for both elastic net parameters α and λ, as well as some utility functions.

The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here is a simple example:

mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars) ## Call: ## glmnet.formula(formula = mpg ~ cyl + disp + hp, data = mtcars) ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha: 1 ## Lambda summary: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03326 0.11690 0.41000 1.02800 1.44100 5.05500

Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments that `predict.glmnet`

needs.

```
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
```

You may have noticed the options "use model.frame" and "sparse model matrix" in the printed output above. glmnetUtils includes a couple of options to improve performance, especially on wide datasets and/or have many categorical (factor) variables.

The standard R method for creating a model matrix out of a data frame uses the `model.frame`

function, which has a major disadvantage when it comes to wide data. It generates a *terms* object, which specifies how the original columns of data relate to the columns in the model matrix. This involves creating and storing a (roughly) square matrix of size *p* × *p*, where *p* is the number of variables in the model. When *p* > 10000, which isn't uncommon these days, the terms object can exceed a gigabyte in size. Even if there is enough memory to store the object, processing it can be very slow.

Another issue with the standard approach is the treatment of factors. Normally, `model.matrix`

will turn an *N*-level factor into an indicator matrix with *N*−1 columns, with one column being dropped. This is necessary for unregularised models as fit with `lm`

and `glm`

, since the full set of *N* columns is linearly dependent. However, this may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.

To deal with these problems, glmnetUtils by default will avoid using `model.frame`

, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be much faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero *is* meaningful (usually). Machine learners may also recognise this as one-hot encoding.

glmnetUtils can also generate a *sparse* model matrix, using the `sparse.model.matrix`

function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels.

One piece missing from the standard glmnet package is a way of choosing α, the elastic net mixing parameter, similar to how `cv.glmnet`

chooses λ, the shrinkage parameter. To fix this, glmnetUtils provides the `cvAlpha.glmnet`

function, which uses crossvalidation to examine the impact on the model of changing α and λ. The interface is the same as for the other functions:

# Leukemia dataset from Trevor Hastie's website: # http://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData load("~/Leukemia.rdata") leuk <- do.call(data.frame, Leukemia) cvAlpha.glmnet(y ~ ., data=leuk, family="binomial") ## Call: ## cvAlpha.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial") ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1 ## Number of crossvalidation folds for lambda: 10

`cvAlpha.glmnet`

uses the algorithm described in the help for `cv.glmnet`

, which is to fix the distribution of observations across folds and then call `cv.glmnet`

in a loop with different values of α. Optionally, you can parallelise this outer loop, by setting the `outerParallel`

argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:

- Via
`parLapply`

in the parallel package. To use this, set`outerParallel`

to a valid cluster object created by`makeCluster`

. - Via
`rxExec`

as supplied by Microsoft R Server’s RevoScaleR package. To use this, set`outerParallel`

to a valid compute context created by`RxComputeContext`

, or a character string specifying such a context.

The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. The easiest way to install it is via devtools:

```
library(devtools)
install_github("hong-revo/glmnetUtils")
```

A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at hongooi@microsoft.com.

by Srini Kumar, Director of Data Science at Microsoft

We tend to think of R and other such ML tools only in the context of the workplace, to do “weighty” things aimed at saving millions. A little judicious use of R may help us hugely in our personal lives too. The ideas of regression, classification trees etc. can be powerful tools in valuation, as I found out.

Recently, I was in a five-car accident on the infamous 101 in the San Francisco bay area. Luckily, none of us required an ambulance and all of us walked away. However, my car was, in insurance parlance, a "total loss". I was left wondering what I should expect as a check from my insurance company. I found the data I needed on the web, and used R to very quickly come up with a model to value the car. While its being astonishingly accurate was probably an exception, its placing the value in the ballpark illustrates how easy it is to use R for a quick yet reasonably accurate analysis.

First off, we need to recognize that our expected value should not be the blue book value of the car. It should be the amount we have to pay (discounting the taxes and other non-discretionary expenses) to get a similar car from a used (or in the higher end, "pre-owned") car from a car dealer. Therefore, I searched for all the available cars of that model and year available in the United States, and got a list of 70 cars from all over the country.

The only part that involved some drudgery was copying the location, mileage and the asking price for each car from that PDF file and putting it in a spreadsheet. A reasonable guess is that the car's value depends mostly on the mileage on it, and a reasonable assumption (which turned out to be a good one) that it also depended on where it is available.

The rest of the analysis was quite easy. Having read in the tab-delimited format data, I checked the mileage and price for different states by way of initial exploration. As an aside, I tend to use and recommend the tab separated format over the comma separated format always. Text fields rarely contain tabs, but contain commas far more frequently.

As we can see, there is too wide a spread among the states, and too little data from mine. Anyway, the simple linear regression on mileage and states yielded a model, and its prediction was $23,122.47. I had to guess the mileage on my car and guessed it reasonably accurately to about 45,000. After doing this, especially since the data points from my state were too few, I tried a decision tree to check for the dependence of price on state, and got this:

The decision tree algorithm evidently did not sense the state to be a factor to determine the price.

Armed with this knowledge, I waited to talk with the insurance company. To their credit, what they offered me was less than 0.4 percent off the predicted value! To be sure, there were unknowns. I had guessed my mileage, my car had been loaded with options, and I did not scrape any options from the publicly available data. An additional regression I did completely ignoring the state and focusing only on the mileage would have given me an error of a little over 3 percent. However, it was interesting that the modeling could predict the price to within about 4 percent, setting very well the stage for negotiations if it came to it, particularly since the alternative is to make subjective guesses.

So, the next time you need to value something, provided some data on it is available, you can, in less than an hour, come up with objective and defensible estimates to help you negotiate. Here is the R code, if you would like to try it yourself. The only thing is that you need to be able to find data on that item or a comparable one, which is easy to do.

The results of many scientific papers are wrong. There are many reasons for this, including p-hacking, publication bias, and the general inability to replicate results. But there's another, more mundane cause: incorrect calculation of p-values in statistical tests. This could be caused by simple transcription errors when plugging numbers into a statistical tool, incorrect rounding, or misapplication of the test itself (say, applying a two-sided test when a 1-sided p-value is appropriate). Such errors should be picked up in the peer review process, but given that even expert statisicians sometimes struggle to explain p-values, it's not surprising that some errors get through.

That's why Michèle B. Nuijten, a PhD student at Tilburg University, created the R package statcheck. Given a paper to be published in a psychology journal, statcheck searches for statistical results from \(t\), \(F\), \(r\), \(\chi^2\), and \(Z\) tests, and compares the published p-value to a value calculated by R. This is possible only because the American Psychological Association Style Guide has a very specific format for reporting statistical results, listing the p-value next to the reported test statistic. Statcheck also attempts to detect if the surrounding language mentions a "one-sided" or "one-tailed" test and calculates the p-value in R accordingly (although this process isn't perfect). Anyone can use statcheck by uploading a PDF or HTML version of their paper to the statcheck web application, or by using the statcheck function within R directly.

Nuijten recounts the origins and development of statcheck in an interesting article in Retraction Watch. One major surprise: when they applied statcheck to p-values reported in eight major psychology journals from 1985 to 2013:

Half of the papers in psychology contain at least one statistical reporting inconsistency, and one in eight papers contain an inconsistency that might have affected the statistical conclusion.

Since then, they've further automated statcheck by automatically sharing the results of its analyses for 50,000 papers at PubPeer. Not everyone was pleased by the notifications (a former president of the Association for Psychological Science called it 'methodological terrorism'), but the process did reveal more inconsistencies in published papers.

For more on statcheck, check out its website at the link below.

Michèle B. Nuijten: R package “statcheck”: Extract statistics from articles and recompute p values (Epskamp & Nuijten, 2016)

by Joseph Rickert

My guess is that a good many statistics students first encounter the bivariate Normal distribution as one or two hastily covered pages in an introductory text book, and then don't think much about it again until someone asks them to generate two random variables with a given correlation structure. Fortunately for R users, a little searching on the internet will turn up several nice tutorials with R code explaining various aspects of the bivariate Normal. For this post, I have gathered together a few examples and tweaked the code a little to make comparisons easier.

Here are five different ways to simulate random samples bivariate Normal distribution with a given mean and covariance matrix.

To set up for the simulations this first block of code defines N, the number of random samples to simulate, the means of the random variables, and and the covariance matrix. It also provides a small function for drawing confidence ellipses on the simulated data.

library(mixtools) #for ellipse

N <- 200 # Number of random samples

set.seed(123)

# Target parameters for univariate normal distributions

rho <- -0.6

mu1 <- 1; s1 <- 2

mu2 <- 1; s2 <- 8

# Parameters for bivariate normal distribution

mu <- c(mu1,mu2) # Mean

sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),

2) # Covariance matrix

# Function to draw ellipse for bivariate normal data

ellipse_bvn <- function(bvn, alpha){

Xbar <- apply(bvn,2,mean)

S <- cov(bvn)

ellipse(Xbar, S, alpha = alpha, col="red")

}

The first method, the way to go if you just want to get on with it, is to use the mvrnorm() function from the MASS package.

library(MASS)

bvn1 <- mvrnorm(N, mu = mu, Sigma = sigma ) # from MASS package

colnames(bvn1) <- c("bvn1_X1","bvn1_X2")

It takes so little code to do the simulation it might be possible to tweet in a homework assignment.

A look at the source code for mvrnorm() shows that it uses eignevectors to generate the random samples. The documentation for the function states that this method was selected because it is stabler than the alternative of using a Cholesky decomposition which might be faster.

For the second method, let's go ahead and directly generate generate bivariate Normal random variates with the Cholesky decomposition. Remember that the Cholesky decomposition of sigma (a positive definite matrix) yields a matrix M such that M times its transpose gives sigma back again. Multiplying M by a matrix of standard random Normal variates and adding the desired mean gives a matrix of the desired random samples. A lecture from Colin Rundel covers some of the theory.

M <- t(chol(sigma))

# M %*% t(M)

Z <- matrix(rnorm(2*N),2,N) # 2 rows, N/2 columns

bvn2 <- t(M %*% Z) + matrix(rep(mu,N), byrow=TRUE,ncol=2)

colnames(bvn2) <- c("bvn2_X1","bvn2_X2")

For the third method we make use of a special property of the bivariate normal that is discussed in almost all of those elementary textbooks. If X_{1} and X_{2} are two jointly distributed random variables, then the conditional distribution of X_{2} given X_{1} is itself normal with: mean = m_{2} + r(s_{2}/s_{1})(X_{1} - m_{1}) and variance = (1 - r^{2})s^{2}X_{2}.

Hence, a sample from a bivariate Normal distribution can be simulated by first simulating a point from the marginal distribution of one of the random variables and then simulating from the second random variable conditioned on the first. A brief proof of the underlying theorem is available here.

rbvn<-function (n, m1, s1, m2, s2, rho)

{

X1 <- rnorm(n, mu1, s1)

X2 <- rnorm(n, mu2 + (s2/s1) * rho *

(X1 - mu1), sqrt((1 - rho^2)*s2^2))

cbind(X1, X2)

}

bvn3 <- rbvn(N,mu1,s1,mu2,s2,rho)

colnames(bvn3) <- c("bvn3_X1","bvn3_X2")

The fourth method, my favorite, comes from Professor Darren Wiliinson's Gibbs Sampler tutorial. This is a very nice idea; using the familiar bivariate Normal distribution to illustrate the basics of the Gibbs Sampling Algorithm. Note that this looks very much like the previous method, except that now we are alternately sampling from the full conditional distributions.

gibbs<-function (n, mu1, s1, mu2, s2, rho)

{

mat <- matrix(ncol = 2, nrow = n)

x <- 0

y <- 0

mat[1, ] <- c(x, y)

for (i in 2:n) {

x <- rnorm(1, mu1 +

(s1/s2) * rho * (y - mu2), sqrt((1 - rho^2)*s1^2))

y <- rnorm(1, mu2 +

(s2/s1) * rho * (x - mu1), sqrt((1 - rho^2)*s2^2))

mat[i, ] <- c(x, y)

}

mat

}

bvn4 <- gibbs(N,mu1,s1,mu2,s2,rho)

colnames(bvn4) <- c("bvn4_X1","bvn4_X2")

The fifth and final way uses the rmvnorm() function form the mvtnorm package with the singular value decomposition method selected. The functions in this package are overkill for what we are doing here, but mvtnorm is probably the package you would want to use if you are calculating probabilities from high dimensional multivariate distributions. It implements numerical methods for carefully calculating the high dimensional integrals involved that are based on some papers by Professor Alan Genz dating from the early '90s. These methods are briefly explained in the package vignette.

library (mvtnorm)

bvn5 <- mvtnorm::rmvnorm(N,mu,sigma, method="svd")

colnames(bvn5) <- c("bvn5_X1","bvn5_X2")

Note that I have used the :: operator here to make sure that R uses the rmvnorm() function from the mvtnorm package. There is also a rmvnorm() function in the mixtools package that I used to get the ellipse function. Loading the packages in the wrong order could lead to the rookie mistake of having the function you want inadvertently overwritten.

Next, we plot the results of drawing just 100 random samples for each method. This allows us to see how the algorithms spread data over the sample space as they are just getting started.

bvn <- list(bvn1,bvn2,bvn3,bvn4,bvn5)

par(mfrow=c(3,2))

plot(bvn1, xlab="X1",ylab="X2",main= "All Samples")

for(i in 2:5){

points(bvn[[i]],col=i)

}

for(i in 1:5){

item <- paste("bvn",i,sep="")

plot(bvn[[i]],xlab="X1",ylab="X2",main=item, col=i)

ellipse_bvn(bvn[[i]],.5)

ellipse_bvn(bvn[[i]],.05)

}

par(mfrow=c(1,1))

The first plot shows all 500 random samples color coded by the method with which they were generated. The remaining plots show the samples generated by each method. In each of these plots the ellipses mark the 0.5 and 0.95 probability regions, i.e. the area within the ellipses should contain 50% and 95% of the points respectively. Note that bvn4 which uses the Gibbs sampling algorithm looks like all of the rest. In most use cases for the Gibbs it takes the algorithm some time to converge to the target distribution. In our case, we start out with a pretty good guess.

Finally, a word about accuracy: nice coverage of the sample space is not sufficient to produce accurate results. A little experimentation will show that, for all of the methods outlined above, regularly achieving a sample covariance matrix that is close to the target, sigma, requires something on the order of 10,000 samples as is Illustrated below.

> sigma

[,1] [,2]

[1,] 4.0 -9.6

[2,] -9.6 64.0

for(i in 1:5){

print(round(cov(bvn[[i]]),1))

}

bvn1_X1 bvn1_X2

bvn1_X1 4.0 -9.5

bvn1_X2 -9.5 63.8

bvn2_X1 bvn2_X2

bvn2_X1 3.9 -9.5

bvn2_X2 -9.5 64.5

bvn3_X1 bvn3_X2

bvn3_X1 4.1 -9.8

bvn3_X2 -9.8 63.7

bvn4_X1 bvn4_X2

bvn4_X1 4.0 -9.7

bvn4_X2 -9.7 64.6

bvn5_X1 bvn5_X2

bvn5_X1 4.0 -9.6

bvn5_X2 -9.6 65.3

Many people coming to R for the first time find it disconcerting to realize that there are several ways to do some fundamental calculation in R. My take is that rather than being a point of frustration, having multiple options indicates that richness of the R language. A close look at the package documentation will often show that yet another method to do something is a response to some subtle need that was not previously addressed. Enjoy the diversity!

As anyone who has tried Pokémon Go recently is probably aware, Pokémon come in different types. A Pokémon's type affects where and when it appears, and the types of attacks it is vulnerable to. Some types, like Normal, Water and Grass are common; others, like Fairy and Dragon are rare. Many Pokémon have two or more types.

To get a sense of the distribution of Pokémon types, Joshua Kunst used R to download data from the Pokémon API and created a treemap of all the Pokémon types (and for those with more than 1 type, the secondary type). Johnathon's original used the 800+ Pokémon from the modern universe, but I used his R code to recreate the map for the 151 original Pokémon used in Pokémon Go.

Pokémon have many other attributes as well, including: weight, height, attack and defense ratings, hit points, and speed. It's hard to visualize so many variables in a 2-dimensional cart, so Joshua used a technique called t-Distributed Stochastic Neighbor Embedding (implemented in the tsne package for R) to cluster similar Pokémon in a two-dimensional chart, and used R's image-handling capbilities to include avatars for each of the Pokémon.

This chart, which includes modern Pokémon along with the 151 originals in Pokémon Go, is colored according to each Pokémon's primary type. As you can see, the TSNE algorithm is super effective at clustering Pokémon according to type.

For more details on Joshua's analysis, including interactive versions of these charts and the R code that created them, follow the link below.

Joshua Kunst: Pokémon: Visualize 'em all! (via Matthew Bashton)

by Joseph Rickert

Just about two and a half years ago I wrote about some resources for doing Bayesian statistics in R. Motivated by the tutorial Modern Bayesian Tools for Time Series Analysis by Harte and Weylandt that I attended at R/Finance last month, and the upcoming tutorial An Introduction to Bayesian Inference using R Interfaces to Stan that Ben Goodrich is going to give at useR! I thought I'd look into what's new. Well, Stan is what's new! Yes, Stan has been under development and available for some time. But somehow, while I wasn't paying close attention, two things happened: (1) the rstan package evolved to make the mechanics of doing Bayesian in R analysis really easy and (2) the Stan team produced and/or organized an amazing amount of documentation.

My impressions of doing Bayesian analysis in R were set in the WinBUGS era. The separate WinBUGs installation was always tricky, and then moving between the BRugs and R2WinBUGS packages presented some additional challenges. My recent Stan experience was nothing like this. I had everything up and running in just a few minutes. The directions for getting started with rstan are clear and explicit about making sure that you have the right tool chain in place for your platform. Since I am running R 3.3.0 on Windows 10 I installed Rtools34. This went quickly and as expected except that C:\Rtools\gcc-4.x-y\bin did not show up in my path variable. Not a big deal: I used the menus in the Windows System Properties box to edit the Path statement by hand. After this, rstan installed like any other R package and I was able to run the 8schools example from the package vignette. The following 10 minute video by Ehsan Karim takes you through the install process and the vignette example.

The Stan documentation includes four major components: (1) The Stan Language Manual, (2) Examples of fully worked out problems, (3) Contributed Case Studies and (4) both slides and video tutorials. This is an incredibly rich cache of resources that makes a very credible case for the ambitious project of teaching people with some R experience both Bayesian Statistics and Stan at the same time. The "trick" here is that the documentation operates at multiple levels of sophistication with entry points for students with different backgrounds. For example, a person with some R and the modest statistics background required for approaching Gelman and Hill's extraordinary text: Data Analysis Using Regression and Multilevel/Hierarchical Models can immediately beginning running rstan code for the book's examples. To run the rstan version of the example in section 5.1, Logistic Regression with One Predictor, with no changes a student only needs only to copy the R scripts and data into her local environment. In this case, she would need the R script: 5._LogisticRegressionWithOnePredictor. R, the data: nes1992_vote.data.R and the Stan code: nes_logit.stan**.** The Stan code for this simple model is about as straightforward as it gets: variable declarations, parameter identification and the model itself.

data { | |

int<lower=0> N; | |

vector[N] income; | |

int<lower=0,upper=1> vote[N]; | |

} | |

parameters { | |

vector[2] beta; | |

} | |

model { | |

vote ~ bernoulli_logit(beta[1] + beta[2] * income); | |

} |

Running the script will produce the iconic logistic regression plot:

I'll wind down by curbing my enthusiasm just a little by pointing out that Stan is not the only game in town. JAGS is a popular alternative, and there is plenty that can be done with unaugmented R code alone as the Bayesian Inference Task View makes abundantly clear.

If you are a book person and new to Bayesian statistics, I highly recommend Bayesian Essentials with R by Jean-Michel Marin and Christian Robert. The authors provide a compact introduction to Bayesian statistics that is backed up with numerous R examples. Also, the new book by Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan looks like it is going to be an outstanding read. The online supplements to the book are certainly worth a look.

Finally, if you are a Bayesian or a thinking about becoming one and you are going to useR!, be sure to catch the following talks:

- Bayesian analysis of generalized linear mixed models with JAGS, by Martyn Plummer
- bamdit: An R Package for Bayesian meta-Analysis of diagnostic test data by Pablo Emilio Verde
- Fitting complex Bayesian models with R-INLA and MCMC by Virgilio Gómez-Rubio
- bayesboot: An R package for easy Bayesian bootstrapping by Rasmus Arnling Bååth
- An Introduction to Bayesian Inference using R Interfaces to Stan by Ben Goodrich
- DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms Using the ROAR Package by Ali Zaidi