Bob Horton

Sr Data Scientist, Microsoft

Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow the “more details” link to find the R code used to generate it. There is a lot of R in Wikipedia).

I rearranged it a bit to put the values in a dataframe, to make it a bit easier to think of the “color” column as a confounding variable:

x | y | color |
---|---|---|

1 | 6 | 1 |

2 | 7 | 1 |

3 | 8 | 1 |

4 | 9 | 1 |

8 | 1 | 2 |

9 | 2 | 2 |

10 | 3 | 2 |

11 | 4 | 2 |

If we do not consider this confounder, we find that the coefficient of x is negative (the dashed line in the figure above):

`coefficients(lm(y ~ x, data=simpson_data))`

```
## (Intercept) x
## 8.3333333 -0.5555556
```

If we do take the confouder into account, we see the coefficient of x is positive:

`coefficients(lm(y ~ x + color, data=simpson_data))`

```
## (Intercept) x color
## 17 1 -12
```

In his book *Causality*, Judea Pearl makes a more sweeping statement regarding Simpson’s paradox: “Any statistical relationship between two variables may be reversed by including additional factors in the analysis.” [Pearl2009]

That sounds fun; let’s try it.

First we’ll make variables `x`

and `y`

with a simple linear relationship. I’ll use the same slopes and intercepts as in the Wikipedia figure, both to show the parallel and to demonstrate the incredible cosmic power I have to bend coefficients to my will.

```
set.seed(1)
N <- 3000
x <- rnorm(N)
m <- -0.5555556
b <- 8.3333333
y <- m * x + b + rnorm(length(x))
plot(x, y, col="gray", pch=20, asp=1)
fit <- lm(y ~ x)
abline(fit, lty=2, lwd=2)
```

When we look at the slope of the regression line determined by fitting the model, it is almost exactly equal to the constant `m`

that we used to determine `y`

.

`coefficients(fit)`

```
## (Intercept) x
## 8.3284021 -0.5358175
```

We get out what we put in; the coefficient of x is essentially the slope we originally gave `y`

when we generated it (-0.5555556). This is the ‘effect’ of `x`

, in that a one unit increase in `x`

apparently increases `y`

by this amount.

Now think about how to concoct a confounding variable to reverse the coefficient of `x`

. This figure shows one way to approach the problem – group the points into a set of parallel stripes, with the stripes sloping in a different direction from the overall dataset:

```
m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below
```

The stripes were made by specifying a reference line with a slope equal to the x-coefficient we want to achieve, and calculating the distance to that line for each point. Putting these distances into categories (by rounding off some multiple of the distance) then groups the points into stripes (shown as colors in the figure). A regression line was then fitted separately to the set of points within each stripe. The regression lines for the stripes on the very ends can be a bit wild, since these groups are very small and scattered, but the ones near the center, representing the majority of the data points, have a quite consistent slope.

The equation for determining the distance from a point to a line is (of course) right there in Wikipedia.

With a little rearranging to express the line in terms of y-intercept (`b`

) and slope (`m`

), and leaving off the absolute value so that points below the line have negative distances (and thus end up in a different group from the stripe with a positive distance of the same magnitude), we get this function:

```
point_line_distance <- function(b, m, x, y)
(y - (m*x + b))/sqrt(m^2 + 1)
```

Here are functions for putting the points into stripewise groups, determining the regression coefficients for each group, and putting it all together into a figure:

```
confounded_data_frame <- function(x, y, m, num_grp){
b <- 0 # intercept doesn't matter
d <- point_line_distance(b, m, x, y)
d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
data.frame(x=x, y=y,
group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}
find_group_coefficients <- function(data){
coef <- t(sapply(levels(data$group),
function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}
striped_scatterplot <- function(formula, grouped_data){
# blue on top and red on bottom, to match the Wikipedia figure
colors <- rev(rainbow(length(levels(grouped_data$group)), end=2/3))
plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
grp_coef <- find_group_coefficients(grouped_data)
# if some coefficents get dropped, colors won't match exactly
for (r in 1:nrow(grp_coef))
abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}
```

Note that the regression lines for each group are not exactly parallel to the stripes. This is because linear regression is about minimizing the squared error on the y-axis, not the distance of points from the line. However, the thinner the stripes are, the closer the group regression lines are to our target slope. If we make a large number of thin stripes, the coefficient of `x`

when the groups are taken into account is essentially the same as the slope of the reference line we used to orient the stripes:

```
cdf100 <- confounded_data_frame(x, y, m_new, num_grp=100)
# without confounder
coefficients(lm(y ~ x, cdf100))['x']
```

```
## x
## -0.5358175
```

```
# with confounder
coefficients(lm(y ~ x + group, cdf100))['x']
```

```
## x
## 0.9961566
```

This approach gives us the power to synthesize simulated confounders that can change the coefficient of `x`

to pretty much any value we choose when a model is fitted with the confounder taken into account. Plus, it makes pretty rainbows.

While Simpson’s Paradox is typically described in terms of categorical confounders, the same reversal principle applies to continuous confounders. But that’s a topic for another post.

[Pearl2009]: Pearl, J. Causality: Models, Reasoning and Inference (2ed). Cambridge University Press, New York 2009.

by Andrie de Vries

Back in 2011, I asked a question on StackOverflow: "How to make a great R reproducible example?".

This question attracted some great answers, including answers by Hadley Wickham and Joris Meys (co-author of R for Dummies).

In June of this year Tyler Rinker added a new answer. Tyler published the wakefield package. In his own words:

I am developing the wakefield package to address this need to quickly share reproducible data, sometimes dput() works fine for smaller data sets but many of the problems we deal with are much larger, sharing such a large data set via dput() is impractical.

I think it is a brilliant idea to create a package that allows you to easily create data with a specified structure.

The package has some very clever ideas. It contains functions that "knows" about certain data types, e.g. age() generates age ranges and coin() generates a bernoulli sample, to name just a few. You can also specify correlation between variables - a helpful feature if you want to demonstrate a specific statistical model.

The package is not yet on CRAN, but is extensively documented at github.

wakefieldis designed to quickly generate random data sets. The user passes`n`

(number of rows) and predefined vectors to the`r_data_frame`

function to produce a`dplyr::tbl_df`

object.

Here is an example from the documentation (modified only very slightly):

This produces the following plot. Notice the correlation in the data - people with high initial grades tend to maintain high grades over time, and vice versa.

To install the package, uncomment the first two lines of code and try the examples:

by Nina Zumel

Principal Consultant Win-Vector LLC

We've just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we've tried to touch on the highlights of the papers, and to play around with variations of our own.

**A Simpler Explanation of Differential Privacy**: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in*Science*(Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”,*Science*, vol 349, no. 6248, pp. 636-638, August 2015). Note that Cynthia Dwork, one of the inventors of differential privacy, originally used it in the analysis of sensitive information.**Using differential privacy to reuse training data**: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.**A simple differentially private procedure**: The bootstrap as an alternative to Laplace noise to introduce differential privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself. Image Credit

*Editor's Note:**The R code includes an example of using vtreat, a package for preparing and cleaning data frames based on level-based feature pruning.*

by Joseph Rickert

We all "know" that correlation does not imply causation, that unmeasured and unknown factors can confound a seemingly obvious inference. But, who has not been tempted by the seductive quality of strong correlations?

Fortunately, it is also well known that a well done randomized experiment can account for the unknown confounders and permit valid causal inferences. But what can you do when it is impractical, impossible or unethical to conduct a randomized experiment? (For example, we wouldn't want to ask a randomly assigned cohort of people to go through life with less education to prove that education matters.) One way of coping with confounders when randomization is infeasible is to introduce what Economists call instrumental variables. This is a devilishly clever and apparently fragile notion that takes some effort to wrap one's head around.

On Tuesday October 20th, we at the Bay Area useR Group (BARUG) had the good fortune to have Hyunseung Kang describe the work that he and his colleagues at the Wharton School have been doing to extend the usefulness of instrumental variables. Hyunseung's talk started with elementary notions: like explaining the effectiveness of randomized experiments, described the essential notion of instrumental variables and developed the background necessary for understanding the new results in this area. The slides from Hyunseung's talk available for download in two parts from the BARUG website. As with most presentations, these slides are little more than the mute residue of talk itself. Nevertheless, Hyunseung makes such imaginative used of animation and build slides that the deck is worth working through.

The following slide from Hyunseung's presentation captures the essence of the instrumental approach.

The general idea is that one or more variables, the instruments, are added to the model for the purpose of inducing randomness into the outcome. This has to be done in a way that conforms with the three assumptions mentioned in the figure. The first assumption, A1, is that the instrument variables are relevant to the process. The second assumption, A2, states that randomness is only induced into the exposure variables and not also into the outcome. The third assumption, A3, is a strong one: there are no unmeasured confounders. The claim is that if these three assumptions are met then causal effects can be estimated with coefficients for the exposure variables that are consistent and asymptotically unbiased.

In the education example developed by Hyunseung, the instrumental variables are the subject's proximity to 2 year and 4 year colleges. Here is where the "rubber meets the road" so to speak. Assessing the relevancy of the instrumental variables and interpreting their effects are subject to the kinds of difficulties described by Andrew Gelman in his post of a few years back.

In the second part of his presentation Hyunseung presents new work: (1) two methods that provide robust confidence intervals when assumption A1 is violated, (2) a method for implementing a sensitivity analysis to assess the sensitivity of an instrumental variable model to violations of assumptions A2 and A3, and (3) the R package ivmodel that ties it all together.

To delve even deeper into this topic have a look at the paper: Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization.

by Joseph Rickert

In a recent previous post, I wrote about support vector machines, the representative master algorithm of the 5th tribe of machine learning practitioners described by Pedro Domingos in his book, The Master Algorithm. Here we look into algorithms favored by the first tribe, the symbolists, who see learning as the process of inverse deduction. Pedro writes:

Another limitation of inverse deduction is that it's very computational intensive, which makes it hard to scale to massive data sets. For these, the symbolist algorithm of choice is decision tree induction. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. (p85)

The de facto standard for decision trees or “recursive partitioning” trees as they are known in the literature, is the CART algorithm by Breiman et al. (1984) implemented in R's rpart package. Stripped down to it’s essential structure, CART is a two stage algorithm. In the first stage, the algorithm conducts an exhaustive search over each variable to find the best split by maximizing an information criterion that will result in cells that are as pure as possible for one or the other of the class variables. In the second stage, a constant model is fit to each cell of the resulting partition. The algorithm then proceeds in a recursive “greedy” fashion making splits and not looking back to see how things might have been before making the next split. Although hugely successful in practice, the algorithm has two vexing problems: (1) overfitting and (2) selection bias – the algorithm favors features with many possible splits^{1}. Overfitting occurs because the algorithm has “no concept of statistical significance” ^{2}. While overfitting is usually handled with cross validation and pruning there doesn’t seem to be an easy way to deal with selection bias in the CART / rpart framework.

To address these issues Hothorn, Hornik and Zeileis introduced the party package into R about ten years ago which provides an implementation of conditional inference trees. (Unbiased Recursive Partitioning: A Conditional Inference Framework) Party’s ctree() function separates the selection of variables for splitting and the splitting process itself into two different steps and explicitly addresses bias selection by implementing statistical testing and a stopping procedure in the first step. Very roughly, the algorithm proceeds as follows:

- Each node of the tree is represented by a set of weights. Then, for each covariate vector X, the algorithm tests the null hypothesis that the dependent variable Y is independent of X. If the hypothesis cannot be rejected then the algorithm stops. Otherwise, the covariate with the strongest association with Y is selected for splitting.
- The algorithm performs a split and updates the weights describing the tree.
- Steps 1 and 2 are repeated recursively with the new parameter settings.

The details, along with enough theory to use the ctree algorithm with some confidence, are presented in this accessible vignette: “party: A Laboratory for Recursive Partitioning. The following example contrasts the ctree() and rpart() algorithms.

We begin by dividing the segmationData data set that comes with the caret package into training and test sets and fitting a ctree() model to it using the default parameters. No attempt is made to optimize the model. Next, we use the model to predict values of the Class variable on the test data set and calculate the area under the ROC curve to be 0.8326.

# Script to compare ctree with rpart library(party) library(rpart) library(caret) library(pROC) ### Get the Data # Load the data and construct indices to divide it into training and test data sets. data(segmentationData) # Load the segmentation data set data <- segmentationData[,3:61] data$Class <- ifelse(data$Class=="PS",1,0) # trainIndex <- createDataPartition(data$Class,p=.7,list=FALSE) trainData <- data[trainIndex,] testData <- data[-trainIndex,] #------------------------ set.seed(23) # Fit Conditional Tree Model ctree.fit <- ctree(Class ~ ., data=trainData) ctree.fit plot(ctree.fit,main="ctree Model") #Make predictions using the test data set ctree.pred <- predict(ctree.fit,testData) #Draw the ROC curve ctree.ROC <- roc(predictor=as.numeric(ctree.pred), response=testData$Class) ctree.ROC$auc #Area under the curve: 0.8326 plot(ctree.ROC,main="ctree ROC")

Here are the text and graphical descriptions of the resulting tree.

1) FiberWidthCh1 <= 9.887543; criterion = 1, statistic = 383.388 2) TotalIntenCh2 <= 42511; criterion = 1, statistic = 115.137 3) TotalIntenCh1 <= 39428; criterion = 1, statistic = 20.295 4)* weights = 504 3) TotalIntenCh1 > 39428 5)* weights = 9 2) TotalIntenCh2 > 42511 6) AvgIntenCh1 <= 199.2768; criterion = 1, statistic = 28.037 7) IntenCoocASMCh3 <= 0.5188792; criterion = 0.99, statistic = 14.022 8)* weights = 188 7) IntenCoocASMCh3 > 0.5188792 9)* weights = 7 6) AvgIntenCh1 > 199.2768 10)* weights = 36 1) FiberWidthCh1 > 9.887543 11) ShapeP2ACh1 <= 1.227156; criterion = 1, statistic = 48.226 12)* weights = 169 11) ShapeP2ACh1 > 1.227156 13) IntenCoocContrastCh3 <= 12.32349; criterion = 1, statistic = 22.349 14) SkewIntenCh4 <= 1.148388; criterion = 0.998, statistic = 16.78 15)* weights = 317 14) SkewIntenCh4 > 1.148388 16)* weights = 109 13) IntenCoocContrastCh3 > 12.32349 17) AvgIntenCh2 <= 244.9512; criterion = 0.999, statistic = 19.382 18)* weights = 53 17) AvgIntenCh2 > 244.9512 19)* weights = 22

Next, we fit an rpart() model to the training data using the default parameter settings and calculate the AUC to be 0.8536 on the test data.

# Fit CART Model rpart.fit <- rpart(Class ~ ., data=trainData,cp=0) rpart.fit plot(as.party(rpart.fit),main="rpart Model") #Make predictions using the test data set rpart.pred <- predict(rpart.fit,testData) #Draw the ROC curve rpart.ROC <- roc(predictor=as.numeric(rpart.pred), response=testData$Class) rpart.ROC$auc #Area under the curve: 0.8536 plot(rpart.ROC)

The resulting pruned tree does better than ctree(), but at the expense of building a slightly deeper tree.

1) root 1414 325.211500 0.64144270 2) TotalIntenCh2>=42606.5 792 191.635100 0.41035350 4) FiberWidthCh1>=11.19756 447 85.897090 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.548390 0.09677419 * 9) ShapeP2ACh1>=1.225676 292 66.065070 0.34589040 18) SkewIntenCh4< 1.41772 254 53.259840 0.29921260 36) TotalIntenCh4< 127285.5 214 40.373830 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.943660 0.16901410 * 73) EqEllipseOblateVolCh1< 383.1453 72 17.500000 0.41666670 146) AvgIntenCh1>=110.2253 40 6.400000 0.20000000 * 147) AvgIntenCh1< 110.2253 32 6.875000 0.68750000 * 37) TotalIntenCh4>=127285.5 40 9.900000 0.55000000 * 19) SkewIntenCh4>=1.41772 38 8.552632 0.65789470 * 5) FiberWidthCh1< 11.19756 345 82.388410 0.60579710 10) KurtIntenCh1< -0.3447192 121 28.000000 0.36363640 20) TotalIntenCh1>=13594 98 19.561220 0.27551020 * 21) TotalIntenCh1< 13594 23 4.434783 0.73913040 * 11) KurtIntenCh1>=-0.3447192 224 43.459820 0.73660710 22) AvgIntenCh1>=454.3329 7 0.000000 0.00000000 * 23) AvgIntenCh1< 454.3329 217 39.539170 0.76036870 46) VarIntenCh4< 130.9745 141 31.333330 0.66666670 92) NeighborAvgDistCh1>=256.5239 30 6.300000 0.30000000 * 93) NeighborAvgDistCh1< 256.5239 111 19.909910 0.76576580 * 47) VarIntenCh4>=130.9745 76 4.671053 0.93421050 * 3) TotalIntenCh2< 42606.5 622 37.427650 0.93569130 6) ShapeP2ACh1< 1.236261 11 2.545455 0.36363640 * 7) ShapeP2ACh1>=1.236261 611 31.217680 0.94599020 * >

Note, however, that complexity parameter for rpart(), cp, is set to zero rpart() builds a massive tree, a portion of which is shown below, and over fits the data yielding an AUC of 0.806

1) root 1414 325.2115000 0.64144270 2) TotalIntenCh2>=42606.5 792 191.6351000 0.41035350 4) FiberWidthCh1>=11.19756 447 85.8970900 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.5483900 0.09677419 16) EntropyIntenCh1>=6.672119 133 7.5187970 0.06015038 32) AngleCh1< 108.6438 82 0.0000000 0.00000000 * 33) AngleCh1>=108.6438 51 6.7450980 0.15686270 66) EqEllipseLWRCh1>=1.184478 26 0.9615385 0.03846154 132) DiffIntenDensityCh3>=26.47004 19 0.0000000 0.00000000 * 133) DiffIntenDensityCh3< 26.47004 7 0.8571429 0.14285710 * 67) EqEllipseLWRCh1< 1.184478 25 5.0400000 0.28000000 134) IntenCoocContrastCh3>=9.637027 9 0.0000000 0.00000000 * 135) IntenCoocContrastCh3< 9.637027 16 3.9375000 0.43750000 * 17) EntropyIntenCh1< 6.672119 22 4.7727270 0.31818180 34) ShapeBFRCh1>=0.6778205 13 0.0000000 0.00000000 * 35) ShapeBFRCh1< 0.6778205 9 1.5555560 0.77777780 * 9) ShapeP2ACh1>=1.225676 292 66.0650700 0.34589040 18) SkewIntenCh4< 1.41772 254 53.2598400 0.29921260 36) TotalIntenCh4< 127285.5 214 40.3738300 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.9436600 0.16901410 144) IntenCoocEntropyCh3< 7.059374 133 16.2857100 0.14285710 288) NeighborMinDistCh1>=21.91001 116 11.5431000 0.11206900 576) NeighborAvgDistCh1>=170.2248 108 8.2500000 0.08333333 1152) FiberAlign2Ch4< 1.481728 68 0.9852941 0.01470588 2304) XCentroid>=100.5 61 0.0000000 0.00000000 * 2305) XCentroid< 100.5 7 0.8571429 0.14285710 * 1153) FiberAlign2Ch4>=1.481728 40 6.4000000 0.20000000 2306) SkewIntenCh1< 0.9963465 27 1.8518520 0.07407407

In practice, rpart()'s complexity parameter (default value cp = .01) is effective in controlling tree growth and overfitting. It does, however, have an "ad hoc" feel to it. In contrast, the ctree() algorithm implements tests for statistical significance within the process of growing a decision tree. It automatically curtails excessive growth, inherently addresses both overfitting and bias and offers the promise of achieving good models with less computation.

Finally, note that rpart() and ctree() construct different trees that offer about the same performance. Some practitioners who value decision trees for their interpretability find this disconcerting. End users of machine learning models often want at story that tells them something true about their customer's behavior or buying preferences etc. But, the likelihood of there being multiple satisfactory answers to a complex problem is inherent to the process of inverse deduction. As Hothorn et al. comment:

Since a key reason for the popularity of tree based methods stems from their ability to represent the estimated regression relationship in an intuitive way, interpretations drawn from regression trees must be taken with a grain of salt.

1 Hothorn et al. (2006): Unbiased Recursive Partitioning: A Conditional Inference Framework J COMPUT GRAPH STAT Vol(15) No(3) Sept 2006

2. Mingers 1987: Expert Systems-Rule Induction with Statistical Data

*by Bob HortonMicrosoft Senior Data Scientist*

Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s Machine Learning course on Coursera. Here I present a simple simulation that illustrates this idea.

Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to train the model is called the *training error*, and the error on an independent sample is called the *validation error*.

A model will commonly perform better (that is, have lower error) on the data it was trained on than on an independent sample. The difference between the training error and the validation error reflects *overfitting* of the model. Overfitting is like memorizing the answers for a test instead of learning the principles (to borrow a metaphor from the Wikipedia article). Memorizing works fine if the test is exactly like the study guide, but it doesn’t work very well if the test questions are different; that is, it doesn’t generalize. In fact, the more a model is overfitted, the higher its validation error is likely to be. This is because the spurious correlations the overfitted model memorized from the training set most likely don’t apply in the validation set.

Overfitting is usually more extreme with small training sets. In large training sets the random noise tends to average out, so that the underlying patterns are more clear. But in small training sets, there is less opportunity for averaging out the noise, and accidental correlations consequently have more influence on the model. Learning curves let us visualize this relationship between training set size and the degree of overfitting.

We start with a function to generate simulated data:

```
sim_data <- function(N, noise_level=1){
X1 <- sample(LETTERS[1:10], N, replace=TRUE)
X2 <- sample(LETTERS[1:10], N, replace=TRUE)
X3 <- sample(LETTERS[1:10], N, replace=TRUE)
y <- 100 + ifelse(X1 == X2, 10, 0) + rnorm(N, sd=noise_level)
data.frame(X1, X2, X3, y)
}
```

The input columns X1, X2, and X3 are categorical variables which each have 10 possible values, represented by capital letters `A`

through `J`

. The outcome is cleverly named `y`

; it has a base level of 100, but if the values in the first two `X`

variables are equal, this is increased by 10. On top of this we add some normally distributed noise. Any other pattern that might appear in the data is accidental.

Now we can use this function to generate a simulated data set for experiments.

```
set.seed(123)
data <- sim_data(25000, noise=10)
```

There are many possible error functions, but I prefer the root mean squared error:

`rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))`

To generate a learning curve, we fit models at a series of different training set sizes, and calculate the training error and validation error for each model. Then we will plot these errors against the training set size. Here the parameters are a model formula, the data frame of simulated data, the validation set size (vss), the number of different training set sizes we want to plot, and the smallest training set size to start with. The largest training set will be all the rows of the dataset that are not used for validation.

```
run_learning_curve <- function(model_formula, data, vss=5000, num_tss=30, min_tss=1000){
library(data.table)
max_tss <- nrow(data) - vss
tss_vector <- seq(min_tss, max_tss, length=num_tss)
data.table::rbindlist( lapply (tss_vector, function(tss){
vs_idx <- sample(1:nrow(data), vss)
vs <- data[vs_idx,]
ts_eligible <- setdiff(1:nrow(data), vs_idx)
ts <- data[sample(ts_eligible, tss),]
fit <- lm( model_formula, ts)
training_error <- rmse(ts$y, predict(fit, ts))
validation_error <- rmse(vs$y, predict(fit, vs))
data.frame(tss=tss,
error_type = factor(c("training", "validation"),
levels=c("validation", "training")),
error=c(training_error, validation_error))
}) )
}
```

We’ll use a formula that considers all combinations of the input columns. Since these are categorical inputs, they will be represented by dummy variables in the model, with each combination of variable values getting its own coefficient.

`learning_curve <- run_learning_curve(y ~ X1*X2*X3, data)`

With this example, you get a series of warnings:

```
## Warning in predict.lm(fit, vs): prediction from a rank-deficient fit may be
## misleading
```

This is R trying to tell you that you don’t have enough rows to reliably fit all those coefficients. In this simulation, training set sizes above about 7500 don’t trigger the warning, though as we’ll see the curve still shows some evidence of overfitting.

```
library(ggplot2)
ggplot(learning_curve, aes(x=tss, y=error, linetype=error_type)) +
geom_line(size=1, col="blue") + xlab("training set size") + geom_hline(y=10, linetype=3)
```

In this figure, the X-axis represents different training set sizes and the Y-axis represents error. Validation error is shown in the solid blue line on the top part of the figure, and training error is shown by the dashed blue line in the bottom part. As the training set sizes get larger, these curves converge toward a level representing the amount of irreducible error in the data. This plot was generated using a simulated dataset where we know exactly what the irreducible error is; in this case it is the standard deviation of the Gaussian noise we added to the output in the simulation (10; the root mean squared error is essentially the same as standard deviation for reasonably large sample sizes). We don’t expect any model to reliably fit this error since we know it was completely random.

One interesting thing about this simulation is that the underlying system is very simple, yet it can take many thousands of training examples before the validation error of this model gets very close to optimum. In real life, you can easily encounter systems with many more variables, much higher cardinality, far more complex patterns, and of course lots and lots of those unpredictable variations we call “noise”. You can easily encounter situations where truly enormous numbers of samples are needed to train your model without excessive overfitting. On the other hand, if your training and validation error curves have already converged, more data may be superfluous. Learning curves can help you see if you are in a situation where more data is likely to be of benefit for training your model better.

by John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Previously we worked on:

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation. For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training. Another variation called "leave one out" (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts *about the fitting procedure* and *not about the actual model in hand* (so they are answering a different question than test/train split). There is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

To read more: Win-Vector - How do you know if your model is going to work? Part 4: Cross-validation techniques

BlueSky Statistics is a new GUI-driven statistical data analysis tool for Windows. It provides a series of dialogs to import and manipluate data, and to perform statistical analysis and visualization tasks. (Think: more like SPSS than RStudio.) The underlying operations are implemented using R code, which you can inspect and reuse. This video gives you a more detailed introduction:

The basic version is open-source (here's the GitHub project), and you can download for free here. (There is also a paid Commercial Edition that adds technical support, some advanced statistics and machine learning dialogs, and the ability to extend the system with your own dialogs.) After you download and install, you'll also need to provide your own installation of R (Revolution R Open works too), and install the various packages that BlueSky needs to operate. (Packages for R/RRO 3.2.1 are provided with the download, and you can install them from a menu item.)

After you've installed BlueSky (look in your Documents folder for BlueSky.exe), the first step is to import some data using the File menu. In just a couple of minutes I was able to open a comma-separated file (airquality.csv in the Sample Datasets) folder and use the "Scatterplot" icon to create this chart:

I haven't had much of a chance to dive into the capabilities of BlueSky yet, so I'll leave a full review for a later date. If you've tried it yourself, let us know what you think in the comments.

BlueSky Statistics: Download

*by John Mount (more articles) and Nina Zumel (more articles) of Win-Vector LLC*

"Essentially, all models are wrong, but some are useful." George Box

Here's a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn't crash and burn in the real world. We've discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it's better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest "Statistics as it should be" series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a "data science" oriented manner. Meaning we are going to consider scoring system utility in terms of service to a *negotiable* business goal (one of the many ways data science differs from pure machine learning). To organize the ideas into digestible chunks, we are presenting this article as a four part series. This part (part 1) sets up the specific problem.

Win-Vector blog: HOW DO YOU KNOW IF YOUR MODEL IS GOING TO WORK? PART1: THE PROBLEM