by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

by Yuzhou Song, Microsoft Data Scientist

R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.

In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.

**Step 1: Import the Data**

At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.

library(RevoScaleR)

data.path <- "./"

file.name <- "ccFraud.csv"

fraud_csv_path <- file.path(data.path,file.name)

Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.

colClasses <- c("integer","factor","factor","integer",

"numeric","integer","integer","numeric","factor")

names(colClasses)<-c("custID","gender","state","cardholder",

"balance","numTrans","numIntlTrans",

"creditLine","fraudRisk")

fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)

With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:

rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)

and you will get the following:

**Step 2: Process Data**

Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.

Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:

fraud_flag_csv_source <- RxTextData(file=file.path(data.path,

"ccFraudFlag.csv"))

Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:

make_train_flag <- function(data){

data <- as.data.frame(data)

set.seed(34)

data$trainFlag <- sample(c(0,1),size=nrow(data),

replace=TRUE,prob=c(0.3,0.7))

return(data)

}

Then, use rxDateStep to complete the transformation:

rxDataStep(inData=fraud_csv_source,

outFile=fraud_flag_csv_source,

transformFunc=make_train_flag,overwrite = TRUE)

again, we can check the output file information by using rxGetInfo function:

rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)

we can find the trainFlag column has been appended to the last:

Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:

train_csv_source <- RxTextData(

file=file.path(data.path,"train.csv"))

test_csv_source <- RxTextData(

file=file.path(data.path,"test.csv"))

Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:

rxDataStep(inData=fraud_flag_csv_source,

outFile=train_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 1),

overwrite = TRUE)

rxDataStep(inData=fraud_flag_csv_source,

outFile=test_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 0),

overwrite = TRUE)

A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.

down_sample <- function(data){

data <- as.data.frame(data)

data_bad <- subset(data,fraudRisk == 1)

data_good <- subset(data,fraudRisk == 0)

# good to bad ratio 4:1

rate <- nrow(data_bad)*4/nrow(data_good)

set.seed(34)

data_good$keepTag <- sample(c(0,1),replace=TRUE,

size=nrow(data_good),prob=c(1-rate,rate))

data_good_down <- subset(data_good,keepTag == 1)

data_good_down$keepTag <- NULL

data_down <- rbind(data_bad,data_good_down)

data_down$trainFlag <- NULL

return(data_down)

}

Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:

train_downsample_csv_source <- RxTextData(

file=file.path(data.path,

"train_downsample.csv"),

colClasses = colClasses)

rxDataStep(inData=train_csv_source,

outFile=train_downsample_csv_source,

transformFunc = down_sample,

reportProgress = 0,

overwrite = TRUE)

**Step 3: Training**

In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:

training_vars <- rxGetVarNames(train_downsample_csv_source)

training_vars <- training_vars[!(training_vars %in%

c("fraudRisk","custID"))]

formula <- as.formula(paste("fraudRisk~",

paste(training_vars, collapse = "+")))

The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:

boosted_fit <- rxBTrees(formula = formula,

data = train_downsample_csv_source,

learningRate = 0.2,

minSplit = 10,

minBucket = 10,

# small number of tree for testing purpose

nTree = 20,

seed = 5,

lossFunction ="multinomial",

reportProgress = 0)

**Step 4: Prediction and Evaluation**

We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:

test_data <- rxImport(test_csv_source)

Then, we take the imported testing set and fitted model object as input for rxPredict function:

predictions <- rxPredict(modelObject = boosted_fit,

data = test_data,

type = "response",

overwrite = TRUE,

reportProgress = 0)

In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:

threshold <- 0.5

predictions <- data.frame(predictions$X1_prob)

names(predictions) <- c("Boosted_Probability")

predictions$Boosted_Prediction <- ifelse(

predictions$Boosted_Probability > threshold, 1, 0)

predictions$Boosted_Prediction <- factor(

predictions$Boosted_Prediction,

levels = c(1, 0))

scored_test_data <- cbind(test_data, predictions)

evaluate_model <- function(data, observed, predicted) {

confusion <- table(data[[observed]],

data[[predicted]])

print(confusion)

tp <- confusion[1, 1]

fn <- confusion[1, 2]

fp <- confusion[2, 1]

tn <- confusion[2, 2]

accuracy <- (tp+tn)/(tp+fn+fp tn)

precision <- tp / (tp + fp)

recall <- tp / (tp + fn)

fscore <- 2*(precision*recall)/(precision+recall)

metrics <- c("Accuracy" = accuracy,

"Precision" = precision,

"Recall" = recall,

"F-Score" = fscore)

return(metrics)

}

* *

roc_curve <- function(data, observed, predicted) {

data <- data[, c(observed, predicted)]

data[[observed]] <- as.numeric(

as.character(data[[observed]]))

rxRocCurve(actualVarName = observed,

predVarNames = predicted,

data = data)

}

* *

boosted_metrics <- evaluate_model(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Prediction")

roc_curve(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Probability")

The confusion Matrix:

0 1

0 2752288 67659

1 77117 101796

ROC curve is (AUC=0.95):

**Summary:**

In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.

by Joseph Rickert

When I first went to grad school, the mathematicians advised me cultivate the habit of reading with a pencil. This turned into a lifelong habit and useful skill for reading all sorts of things: literature, reports and newspapers for example; not just technical papers. However, reading statistics and data science papers, or really anything that includes some data, considerably "ups the ante". For this sort of exercise, I need a tool to calculate, to try some variations that test my intuition and see how well I'm following the arguments. The idea here is not so much to replicate the paper but to accept the author's invitation to engage with the data and work through the analysis. Ideally, I'd want something not much more burdensome than than a pencil (maybe a tablet based implementation of R), but standard R on my notebook comes pretty close to the perfect tool.

Recently, I sat down with Bradley Efron's 1987 paper "Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve", the paper where he elaborates on the idea of using conditional logistic regression to estimate hazard rates and survival curves. This paper is classic Efron: drawing you in with a great story well before you realize how much work it's going to be to follow it to the end. Efron writes with a fairly informal style that encourages the reader to continue. Struggling to keep up with some of his arguments I nevertheless get the feeling that Efron is doing his best help me follow along, dropping hints every now and then about where to look if I lose the trail.

The basic idea of conditional logistic regression is to group the data into discrete time intervals with n_{i} patients at risk in each interval, i, and then assume that the intervals really are independent and that the s_{i} events (deaths or some other measure of "success") in each interval, follow a binomial distribution with parameters n_{i} and h_{i} where:

h_{i} = Prob(patient i dies during the i^{th} interval | patient i survives until the beginning of the i^{th} interval).

The modest goal of this post was to see if I could reproduce Efron's Figure 3 which shows survival curves for three different models for A arm of a clinical trial examining treatments for head and neck cancer. I figured that getting to Figure 3 represents the minimum amount of comprehension required to begin experimenting with conditional logistic regression.

I entered the data buried in the caption to Efron's Table 1 and was delighted when R's Survival package replicated survival times also in the caption.

# Data for Efron's Table 1 # Enter raw data for arm A Adays <- c(7, 34, 42, 63, 64, 74, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157, 160, 160, 165, 173, 176, 185, 218, 225, 241, 248, 273, 277, 279, 297, 319, 405, 417, 420, 440, 523, 523, 583, 594, 1101, 1116, 1146, 1226, 1349, 1412, 1417) Astatus <- rep(1,51) Astatus[c(6,27,34,36,42,46,48,49,50)] <-0 Aobj <- Surv(time = Adays, Astatus==1) Aobj # [1] 7 34 42 63 64 74+ 83 84 91 108 112 129 133 133 # [15] 139 140 140 146 149 154 157 160 160 165 173 176 185+ 218 # [29] 225 241 248 273 277 279+ 297 319+ 405 417 420 440 523 523+ # [43] 583 594 1101 1116+ 1146 1226+ 1349+ 1412+ 1417

Doing the same with the data from arm B of the trial led to a set of Kaplan-Meier curves that pretty much match the curves in Efron's Figure 1.

All of this was straightforward but I was puzzled that the summary of the Kaplan-Meier curve for arm A (See KM_A in the code below) doesn't match the values for month, n_{i} and s_{i} in Efron's Table 1, until I realized these values were for the beginning of the month. To match the table I compute n_{i} by putting 51 in front of the vector KM_A$n.risk, and add a 1 to the end of the vector KM_A$n.event to get s_{i}. (See the "set up variables for models section in the code below.)

After this, one more "trick" was required to get to Figure 3. Most of the time, I suppose those of us working with logistic regression to construct machine learning models are accustomed to specifying the outcome as a binary variable of "ones" and "zeros", or as a two level factor. But, how exactly does one specify the parameters n_{i} and s_{i} for the binomial models that comprise each outcome? After a close reading of the documentation (See the first paragraph under Details for ?glm) I was very pleased to see that glm() permits the dependent variable to be a matrix.

The formula for Efron's cubic model looks like this:

Y <- matrix(c(si, failure),n,2) # Response matrix

form <- formula(Y ~ t + t2 + t3)

The rest of the code is straight forward and leads to a pretty good reproduction of Figure 3.

In this figure, the black line represents the survival curve for a Life Table model where the hazard probabilities are estimated by h_{i} = s_{i} / n_{i}. The blue triangles map the survival curve for the cubic model given above, and the red curve with crosses plots a cubic spline model where h_{i} = t_{i} + (t_{i} - 11)^{2} + (t_{i} - 11)^{3 }.

What a delightful little diagram! In addition to illustrating the technique of using a very basic statistical tool to model time-to-event data, the process leading to Figure 3 reveals something about the intuition and care a professional statistician puts into the exploratory modeling process.

There is much more in Efron's paper. What I have shown here is just the "trailer". Efron presents a careful analysis of the data for both arms of the clinical trial data, goes on to study maximum likelihood estimates for conditional logistic regression models and their standard errors, and proves a result about average ratio of the asymptotic variance between parametric and non-parametric hazard rate estimates.

Enjoy this classic paper and write some "am I reading this right" code of your own!

Here is my code for the models and plots.

by Lixun Zhang, Data Scientist at Microsoft

As a data scientist, I have experience with R. Naturally, when I was first exposed to Microsoft R Open (MRO, formerly Revolution R Open) and Microsoft R Server (MRS, formerly Revolution R Enterprise), I wanted to know the answers for 3 questions:

- What do R, MRO, and MRS have in common?
- What’s new in MRO and MRS compared with R?
- Why should I use MRO or MRS instead of R?

The publicly available information on MRS either describes it at a high level or explains the specific functions and the underlying algorithms. When they compare R, MRO, and MRS, the materials tend to be high level without many details at the functions and packages level, with which data scientists are most familiar. And they don’t answer the above questions in a comprehensive way. So I designed my own tests (and the code behind the tests is available on GitHub). Below are my answers to the three questions above. MRO has an optional MKL library and unless noted otherwise the observations hold true, whether MKL is installed on MRO or not.

After installing R, MRO, and MRS, you'll notice that everything you can do in R can be done in MRO or MRS. For example, you can use *glm()* to fit a logistic regression and *kmeans()* to carry out cluster analysis. As another example, you can install packages from CRAN. In fact, a package installed in R can be used in MRO or MRS and vice versa if the package is installed in a library tree that's shared among them. You can use the command *.libPaths()* to set and get library trees for R, MRO and MRS. Finally, you can use your favorite IDEs such as RStudio and Visual Studio with RTVS for R, MRO or MRS. In other words, MRO and MRS are 100% compatible with R in terms of functions, packages, and IDEs.

While everything you do in R can done in MRO and MRS, the reverse is not true, due to the additional components in MRO and MRS. MRO allows users to install an optional math library MKL for multithreaded performance. This library shows up as a package named *"RevoUtilsMath"* in MRO.

MRS comes with more packages and functions than R. From the package perspective, most of the additional ones are not on CRAN and are available only after installing MRS. One such example is the RevoScaleR package. MRS also installs the MKL library by default. As for functions, MRS has High Performance Analysis (HPA) version of many base R functions, which are included in the RevoScaleR package. For example, the HPA version of *glm()* is *rxGlm()* and for *kmeans()* it is *rxKmeans()*. These HPA functions can be used in the same way as their base R counterparts with additional options. In addition, these functions can work with a special data format (XDF) that's customized for MRS.

In a nutshell, MRS solves two problems associated with using R: capacity (handling the size of datasets and models) and speed. And MRO solves the problem associated with speed.

The following table summarizes the performance comparisons for R, MRO, and MRS. In terms of capacity, using HPA in MRS increases the size of data that can be analyzed. From the speed perspective, certain matrix related base R functions can perform better in MRO and MRS than base R due to MKL. The HPA functions in MRS perform better than their base R counterparts for large datasets. More details on this comparison can be found in the notebook on GitHub.

It should be noted that while there are packages such as *“bigmemory”* and *“ff”* that help address some of the big data problems, they were not included in the benchmark tests.

For data scientists trying to determine which of these platforms should be used under different scenarios, the following table can be used as a reference. Depending on the amount of data and the availability of MRS's HPA functions, the table summarizes scenarios where R, MRO, and MRS can be used. It can be observed that whenever R can be used, MRO can be used with the additional benefit of multi-thread computation for certain matrix related computations. And MRS can be used whenever R or MRO can be used and it allows the possibility of using HPA functions that provide better performance in terms of both speed and capacity.

Follow the link below for my in-depth comparison of R, MRO and MRS.

Lixun Zhang: Introduction to Microsoft R Open and Microsoft R Server

*by Bob Horton, Senior Data Scientist, Microsoft*

This is a follow-up to my earlier post on learning curves. A learning curve is a plot of predictive error for training and validation sets over a range of training set sizes. Here we’re using simulated data to explore some fundamental relationships between training set size, model complexity, and prediction error.

Start by simulating a dataset:

```
sim_data <- function(N, num_inputs=8, input_cardinality=10){
inputs <- rep(input_cardinality, num_inputs)
names(inputs) <- paste0("X", seq_along(inputs))
as.data.frame(lapply (inputs, function(cardinality)
sample(LETTERS[1:cardinality], N, replace=TRUE)))
}
```

The input columns are named X1, X2, etc.; these are all categorical variables with single capital letters representing the different categories. Cardinality is the number of possible values in the column; our default cardinality of 10 means we sample from the capital letters `A`

through `J`

.

Next we’ll add an outcome variable (`y`

); it has a base level of 100, but if the values in the first two `X`

variables are equal, this is increased by 10. On top of this we add some normally distributed noise.

```
set.seed(123)
data <- sim_data(3e4, input_cardinality=10)
noise <- 2
data <- transform(data, y = ifelse(X1 == X2, 110, 100) +
rnorm(nrow(data), sd=noise))
```

With linear models, we handle an interaction between two categorical variables by adding an interaction term; the number of possibilities in this interaction term is basically the product of the cardinalities. In this simulated data set, only the first two columns affect the outcome, and the other input columns don’t contain any useful information. We’ll use it to demonstrate how adding non-informative variables affects overfitting and training set size requirements.

As in the earlier post, I’ll use the root mean squared error of the predictions as the error function because RMSE is essentially the same as standard deviation. No model should be able to make predictions with a root mean squared error less than the standard deviation of the random noise we added.

`rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))`

The cross-validation function trains a model using the supplied formula and modeling function, then tests its performance on a held-out test set. The training set will be sampled from the data available for training; to use approximately a 10% sample of the training data, set `prob_train`

to `0.1`

.

```
cross_validate <- function(model_formula, fit_function, error_function,
validation_set, training_data, prob_train=1){
training_set <- training_data[runif(nrow(training_data)) < prob_train,]
tss <- nrow(training_set)
outcome_var <- as.character(model_formula[[2]])
fit <- fit_function( model_formula, training_set)
training_error <- error_function(training_set[[outcome_var]],
predict(fit, training_set))
validation_error <- error_function(validation_set[[outcome_var]],
predict(fit, validation_set))
data.frame(tss=tss,
formula=deparse(model_formula),
training=training_error,
validation=validation_error,
stringsAsFactors=FALSE)
}
```

Construct a family of formulas, then use `expand_grid`

to make a dataframe with all the combinations of formulas and sampling probabilities:

```
generate_formula <- function(num_inputs, degree=2, outcome="y"){
inputs <- paste0("X", 1:num_inputs)
rhs <- paste0("(", paste(inputs, collapse=" + "), ") ^ ", degree)
paste(outcome, rhs, sep=" ~ ")
}
formulae <- lapply(2:(ncol(data) - 1), generate_formula)
prob <- 2^(seq(0, -6, by=-0.5))
parameter_table <- expand.grid(formula=formulae,
sampling_probability=prob,
stringsAsFactors=FALSE)
```

Separate the training and validation data:

```
validation_fraction <- 0.25
in_validation_set <- runif(nrow(data)) < validation_fraction
vset <- data[in_validation_set,]
tdata <- data[!in_validation_set,]
run_param_row <- function(i){
param <- parameter_table[i,]
cross_validate(formula(param$formula[[1]]), lm, rmse,
vset, tdata, param$sampling_probability[[1]])
}
```

Now call the cross-validate function on each row of the parameter table. The `foreach`

package makes it easy to process these jobs in parallel:

```
library(foreach)
library(doParallel)
```

```
registerDoParallel() # automatically manages cluster
learning_curve_results <- foreach(i=1:nrow(parameter_table)) %dopar% run_param_row(i)
learning_curve_table <- data.table::rbindlist(learning_curve_results)
```

The `rbindlist()`

function from the `data.table`

package puts the results together into a single data frame; this is both cleaner and dramatically faster than the old `do.call("rbind", ...)`

approach (though we’re just combining a small number of rows, so speed is not an issue here).

Now plot the results. Since we’ll do another plot later, I’ll wrap the plotting code in a function to make it more reusable.

```
plot_learning_curve <- function(lct, title, base_error, plot_training_error=TRUE, ...){
library(dplyr)
library(tidyr)
library(ggplot2)
lct_long <- lct %>% gather(type, error, -tss, -formula)
lct_long$type <- relevel(lct_long$type, "validation")
plot_me <- if (plot_training_error) lct_long else lct_long[lct_long$type=="validation",]
ggplot(plot_me, aes(x=log10(tss), y=error, col=formula, linetype=type)) +
ggtitle(title) + geom_hline(yintercept=base_error, linetype=2) +
geom_line(size=1) + xlab("log10(training set size)") + coord_cartesian(...)
}
```

```
plot_learning_curve(learning_curve_table, title="Extraneous variables are distracting",
base_error=noise, ylim=c(0,4))
```

This illustrates the phenomenon that adding more inputs to a model increases the requirements for training data. This is true even if the extra inputs do not contain any information. The cases where the training error is zero are actually rank-deficient (like having fewer equations than unknowns), and if you try this at home you will get warnings to that effect; this is an extreme kind of overfitting. Other learning algorithms might handle this better than `lm`

, but the general idea is that those extra columns are distracting, and it takes more examples to falsify all the spurious correlations that get dredged up from those distractors.

But what if the additional columns considered by the more complex formulas actually did contain predictive information? Keeping the same `X`

-values, we can modify `y`

so that these other columns matter:

```
data <- transform(data, y = 100 + (X1==X2) * 10 +
(X2==X3) * 3 +
(X3==X4) * 3 +
(X4==X5) * 3 +
(X5==X6) * 3 +
(X6==X7) * 3 +
(X7==X8) * 3 +
rnorm(nrow(data), sd=noise))
validation_fraction <- 0.25
in_validation_set <- runif(nrow(data)) < validation_fraction
vset <- data[in_validation_set,]
tdata <- data[!in_validation_set,]
run_param_row <- function(i){
param <- parameter_table[i,]
formula_string <- param$formula[[1]]
prob <- param$sampling_probability[[1]]
cross_validate(formula(formula_string), lm, rmse, vset, tdata, prob)
}
learning_curve_results <-
foreach (i=1:nrow(parameter_table)) %dopar% run_param_row(i)
lct <- data.table::rbindlist(learning_curve_results)
```

This time we’ll leave the training errors off the plot to focus on the validation error; this is what really matters when you are trying to generalize predictions.

```
plot_learning_curve(lct, title="Crossover Point", base_error=noise,
plot_training_error=FALSE, ylim=c(1.5, 5))
```

Now we see another important phenomenon: The simple models that work best with small training sets are out-preformed by more complex models on larger training sets. But these complex models are only usable if they are given sufficient data; plotting a learning curve makes it clear whether you have used sufficient data or not.

Learning curves give valuable insights into the model training process. In some cases this can help you decide to expend effort or expense on gathering more data. In other cases you may discover that your models have learned all they can from just a fraction of the data that is already available. This might encourage you to investigate more complex models that may be capable of learning the finer details of the dataset, possibly leading to better predictions.

These curves can are computationally intensive, as is fitting even a single model on a large dataset in R. Parallelization helped here, but in a future post I’ll show similar patterns in learning curves for much bigger data sets (using real data, rather than synthetic) by taking advantage of the scalable tools of Microsoft R Server.

by Joseph Rickert

Random Forests, the "go to" classifier for many data scientists, is a fairly complex algorithm with many moving parts that introduces randomness at different levels. Understanding exactly how the algorithm operates requires some work, and assessing how good a Random Forests model fits the data is a serious challenge. In the pragmatic world of machine learning and data science, assessing model performance often comes down to calculating the area under the ROC curve (or some other convenient measure) on a hold out set of test data. If the ROC looks good then the model is good to go.

Fortunately, however, goodness of fit issues have a kind of nagging persistence that just won't leave statisticians alone. In a gem of a paper (and here) that sparkles with insight, the authors (Wagner, Hastie and Efron) take considerable care to make things clear to the reader while showing how to calculate confidence intervals for Random Forests models.

Using the high ground approach favored by theorists, Wagner et al. achieve the result about Random Forests by solving a more general problem first: they derive estimates of the variance of bagged predictors that can be computed from the same bootstrap replicates that give the predictors. After pointing out that these estimators suffer from two distinct sources of noise:

- Sampling noise - noise resulting from the randomness of data collection
- Monte Carlo noise - noise that results from using a finite number of bootstrap replicates

they produce bias corrected versions of jackknife and infinitesimal jackknife estimators. A very nice feature of the paper is the way the authors' illustrate the theory with simulation experiments and then describe the simulations in enough detail in an appendix for readers to replicate the results. I generated the following code and figure to replicate Figure 1 of their the first experiment using the authors' GitHub based package, randomForestCI.

Here, I fit a randomForest model to eight features from the UCI MPG data set and use the randomForestInfJack() function to calculate the infinitesimal Jackknife estimator. (The authors use seven features, but the overall shape of the result is the same.)

# Random Forest Confidence Intervals

install.packages("devtools") library(devtools) install_github("swager/randomForestCI")

library(randomForestCI)

library(dplyr) # For data manipulation

library(randomForest) # For random forest ensemble models

library(ggplot2)

# Fetch data from the UCI MAchine Learning Repository

url <-"https://archive.ics.uci.edu/ml/machine-learning-databases/

auto-mpg/auto-mpg.data"

mpg <- read.table(url,stringsAsFactors = FALSE,na.strings="?")

# https://archive.ics.uci.edu/ml/machine-learning-databases/

auto-mpg/auto-mpg.names

names(mpg) <- c("mpg","cyl","disp","hp","weight","accel","year","origin","name")

head(mpg)

# Look at the data and reset some of the data types

dim(mpg); Summary(mpg)

sapply(mpg,class)

mpg <- mutate(mpg, hp = as.numeric(hp),

year = as.factor(year),

origin = as.factor(origin))

head(mpg,2)

#

# Function to divide data into training, and test sets

index <- function(data=data,pctTrain=0.7)

{

# fcn to create indices to divide data into random

# training, validation and testing data sets

N <- nrow(data)

train <- sample(N, pctTrain*N)

test <- setdiff(seq_len(N),train)

Ind <- list(train=train,test=test)

return(Ind)

}

#

set.seed(123)

ind <- index(mpg,0.8)

length(ind$train); length(ind$test)

form <- formula("mpg ~ cyl + disp + hp + weight +

accel + year + origin")

rf_fit <- randomForest(formula=form,data=na.omit(mpg[ind$train,]),

keep.inbag=TRUE) # Build the model

# Plot the error as the number of trees increases

plot(rf_fit)

# Plot the important variables

varImpPlot(rf_fit,col="blue",pch= 2)

# Calculate the Variance

X <- na.omit(mpg[ind$test,-1])

var_hat <- randomForestInfJack(rf_fit, X, calibrate = TRUE)

#Have a look at the variance

head(var_hat); dim(var_hat); plot(var_hat)

# Plot the fit

df <- data.frame(y = mpg[ind$test,]$mpg, var_hat)

df <- mutate(df, se = sqrt(var.hat))

head(df)

p1 <- ggplot(df, aes(x = y, y = y.hat))

p1 + geom_errorbar(aes(ymin=y.hat-se, ymax=y.hat+se), width=.1) +

geom_point() +

geom_abline(intercept=0, slope=1, linetype=2) +

xlab("Reported MPG") +

ylab("Predicted MPG") +

ggtitle("Error Bars for Random Forests")

An interesting feature of the plot is that Random Forests doesn't appear to have the same confidence in all of its estimates, sometimes being less confident about estimates closer to the diagonal than those further away.

Don't forget to include confidence intervals with your next Random Forests model.

*by Verena Haunschmid*

Since I have a cat tracker, I wanted to do some analysis of the behavior of my cats. I have shown how to do some of these things here.

**Data Collection**

The data was collected using the Tractive GPS Pet Tracker over a period of about one year from January 2014 to November 2014 (with breaks). From March to November I additionally took notes in an Excel sheet which cat was carrying the tracker.

**Libraries you need**

If you want to reproduce my example you need the following libraries:

- XML
- plyr
- xlsx
- devtools
- leaflet (installed via devtools::install_github("rstudio/leaflet”))

**Loading the data into R**

There are some methods to read .gpx files in R, but since I just wanted to use it for this one specific file I created my own method:

readTrackingFile<-function(filename) {

library(XML)

library(plyr)

xmlfile <- xmlParse(filename)

xmltop <- xmlRoot(xmlfile)

tracking <- ldply(xmlToList(xmltop[['trk']][['trkseg']]),

function(x) {data.frame(x)

})

tracking <- data.frame("ele"=tracking$ele[seq(1, nrow(tracking), 2)],

"time" = as.character(tracking$time

[seq(1, nrow(tracking), 2)]),

"lat" = tracking$.attrs[seq(1, nrow(tracking), 2)],

"lon" = tracking$.attrs[seq(2, nrow(tracking), 2)])

tracking$ele <- as.numeric(levels(tracking$ele))[tracking$ele]

tracking$lat <- as.numeric(levels(tracking$lat))[tracking$lat]

tracking$lon <- as.numeric(levels(tracking$lon))[tracking$lon]

time_pattern <- "%Y-%m-%dT%H:%M:%SZ"

tracking$time <- strptime(as.character(tracking$time), time_pattern)

tracking$min <- 60*tracking$time$hour + tracking$time$min

message(paste("read", nrow(tracking), "tracking points"))

return(tracking)

}

Then I used this method to read the tracks:

track <- readTrackingFile("../../data/LJWSIZUT.gpx")

And made a rudimentary plot to see where the data was:

# showed that some were far off

plot(track$lon, track$lat, pch=19, cex=0.5)

track <- track[track$lat > 30,]

Since we also used our tracker to track our vacation, I had to filter that out:

time_pattern <- "%Y-%m-%dT%H:%M:%SZ"

vacation_start <- strptime("2015-07-23T04:00:00Z", time_pattern)

vacation_end <- strptime("2015-08-04T22:00:00Z", time_pattern)

track_cat <- track[track$time<vacation_start | track$time>vacation_end,]

To be able to distinguish the tracks I loaded the Excel file I use to take notes. I matched the dates in the Excel file with the ones in my data.frame.

cats <- read.xlsx("../tractive/data/Katzen1.xlsx", sheetIndex=1,

header=FALSE, stringsAsFactors = FALSE)

names(cats) <- c("TrackingDate", "Cat")

cats <- cats[!is.na(cats$Cat),]

time_pattern <- "%d. %B %Y"

cats$TrackingDate <- strptime(paste(cats$TrackingDate, "2015"),

format = time_pattern)

# add cat name

track_cat$cat <- "Unknown"

for (i in 1:nrow(track_cat)) {

cat_idx <- which((cats$TrackingDate$mday ==

track_cat[i,"time"]$mday)

& (cats$TrackingDate$mon+1 ==

track_cat[i,"time"]$mon+1)

& (cats$TrackingDate$year+1900 ==

track_cat[i,"time"]$year+1900))

if (length(cat_idx) == 1) {

track_cat[i,"cat"]<-cats[cat_idx, "Cat"]

}

}

After the vacation I did not take notes anymore because Teddy was the only one who use the tracker:

track_cat[track_cat$time > vacation_end,"cat"] <- "Teddy"

Since there were some points far off and I noticed those also had very wrong elevation, I decided it would make sense to remove all that deviated to far from the elevation. I guess there are other more scientific ways to do this, but it did the trick :)

track_cat_teddy <- track_cat[abs(track_cat$ele-423) < 30 & track_cat$cat=="Teddy",]

**Create a map**

To create the map I used the leaflet package. Since it did not work with the rainbow() function (I guess it can only take color as hex numbers) I defined a vector with colors. I Also defined a vector with the months I had data for. This of course be done in a nicer way (without hard coding the colors and the months), but since I currently only have this data set and don’t need to create anything dynamically, I’ll stick with the easiest way.

monthCol <- c("blue", "green", "yellow", "purple", "brown", "orange")

month < -c("March", "May", "June", "July", "August", "November")

I use the two methods leaflet() and addTiles() without any parameters which creates a default world map with tiles.

If you are wondering about %>%, this is the piping operator from the package magrittr.

catMap <- leaflet() %>% addTiles()

Then I loop over the unique months.

Using <- assigns the variable catMap a new map with a new polyline. The last line, catMap renders the map in the View tab of RStudio.

for (i in 1:length(unique(track_cat_teddy$time$mon))) {

m<-unique(track_cat_teddy$time$mon)[i]

catMap <- catMap %>%

addPolylines(track_cat_teddy[track_cat_teddy$time$mon

==m,"lon"],

track_cat_teddy[track_cat_teddy$time$mon

==m,"lat"],

col=monthCol[i], group=m)

}

catMap

You might note that I added a parameter group to the method and assigned the value of the month.

This value is used to define a control to the map where you can select/unselect certain months. Unfortunately the legend does not display the color of the line.

catMap %>%

addLayersControl(

overlayGroups = month,

options = layersControlOptions(collapsed = FALSE))

You can now use these checkboxes to select/deselect different polylines.

**Where to use this map**

You can do some useful things with this created map.

- You can use R shiny to create nice app that includes your map.
- You can export the HTML from the Viewer window clicking at “Export” > “Save as Web Page...”

Code and further links:

- The code is available in my github repo.
- I recently published my collected data on my own blog.
- The other data is also available in a github repo.

By Joseph Rickert

The ability to generate synthetic data with a specified correlation structure is essential to modeling work. As you might expect, R’s toolbox of packages and functions for generating and visualizing data from multivariate distributions is impressive. The basic function for generating multivariate normal data is mvrnorm() from the MASS package included in base R, although the mvtnorm package also provides functions for simulating both multivariate normal and t distributions. (For tutorial on how to use R to simulate from multivariate normal distributions from first principles using some linear algebra and the Cholesky decomposition see the astrostatistics tutorial on Multivariate Computations.)

The following block of code generates 5,000 draws from a bivariate normal distribution with mean (0,0) and covariance matrix Sigma printed in code. The function kde2d(), also from the Mass package generates a two-dimensional kernel density estimation of the distribution's probability density function.

# SIMULATING MULTIVARIATE DATA # https://stat.ethz.ch/pipermail/r-help/2003-September/038314.html # lets first simulate a bivariate normal sample library(MASS) # Simulate bivariate normal data mu <- c(0,0) # Mean Sigma <- matrix(c(1, .5, .5, 1), 2) # Covariance matrix # > Sigma # [,1] [,2] # [1,] 1.0 0.1 # [2,] 0.1 1.0 # Generate sample from N(mu, Sigma) bivn <- mvrnorm(5000, mu = mu, Sigma = Sigma ) # from Mass package head(bivn) # Calculate kernel density estimate bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50) # from MASS package

R offers several ways of visualizing the distribution. These next two lines of code overlay a contour plot on a "heat Map" that maps the density of points to a gradient of colors.

This plots the irregular contours of the simulated data. The code below which uses the ellipse() function from the ellipse package generates the classical bivariate normal distribution plot that graces many a textbook.

# Classic Bivariate Normal Diagram library(ellipse) rho <- cor(bivn) y_on_x <- lm(bivn[,2] ~ bivn[,1]) # Regressiion Y ~ X x_on_y <- lm(bivn[,1] ~ bivn[,2]) # Regression X ~ Y plot_legend <- c("99% CI green", "95% CI red","90% CI blue", "Y on X black", "X on Y brown") plot(bivn, xlab = "X", ylab = "Y", col = "dark blue", main = "Bivariate Normal with Confidence Intervals") lines(ellipse(rho), col="red") # ellipse() from ellipse package lines(ellipse(rho, level = .99), col="green") lines(ellipse(rho, level = .90), col="blue") abline(y_on_x) abline(x_on_y, col="brown") legend(3,1,legend=plot_legend,cex = .5, bty = "n")

The next bit of code generates a couple of three dimensional surface plots. The second of which is an rgl plot that you will be able to rotate and view from different perspectives on your screen.

Next, we have some code to unpack the grid coordinates produced by the kernel density estimator and get x y, and z values to plot the surface using the new scatterplot3js() function from the htmlwidgets, javascript threejs package. This visualization does not render the surface with the same level of detail as the rgl plot. Nevertheless, it does show some of the salient features of the pdf and has the distinct advantage of being easily embedded in web pages. I expect that html widget plots will keep getting better and easier to use.

# threejs Javascript plot library(threejs) # Unpack data from kde grid format x <- bivn.kde$x; y <- bivn.kde$y; z <- bivn.kde$z # Construct x,y,z coordinates xx <- rep(x,times=length(y)) yy <- rep(y,each=length(x)) zz <- z; dim(zz) <- NULL # Set up color range ra <- ceiling(16 * zz/max(zz)) col <- rainbow(16, 2/3) # 3D interactive scatter plot scatterplot3js(x=xx,y=yy,z=zz,size=0.4,color = col[ra],bg="black")<

The code that follows uses the rtmvt() function from the tmvtnorm package to generate bivariate t distribution. The rgl plot renders the surface kernel density estimate of the surface in impressive detail.

# Draw from multi-t distribution without truncation library (tmvtnorm) Sigma <- matrix(c(1, .1, .1, 1), 2) # Covariance matrix X1 <- rtmvt(n=1000, mean=rep(0, 2), sigma = Sigma, df=2) # from tmvtnorm package t.kde <- kde2d(X1[,1], X1[,2], n = 50) # from MASS package col2 <- heat.colors(length(bivn.kde$z))[rank(bivn.kde$z)] persp3d(x=t.kde, col = col2)

The real value of the multivariate distribution functions from the data science perspective is to simulate data sets with many more than two variables. The functions we have been considering are up to the task, but there are some technical considerations and, of course, we don't have the same options for visualization. The following code snippet generates 10 variables from a multivariate normal distribution with a specified covariance matrix. Note that I've used the genPositiveDefmat() function from the clusterGeneration package to generate the covariance matrix. This is because mvrnorm() will throw an error, as theory says it should, if the covariance matrix is not positive definite, and guessing a combination of matrix elements to make a high dimensional matrix positive definite would require quite a bit of luck along with some serious computation time.

After generating the matrix, I use the corrplot() function from the corrplot package to produce an attractive pairwise correlation plot that is coded both by shape and color. corrplot() scales pretty well with the number of variables and will give a decent chart with 40 to 50 variables. (Note that now ggcorrplot will do this for ggplot2 plots.) Other plotting options would be to generate pairwise scatter plots and R offers many alternatives for these.

Finally, what about going beyond the multivariate normal and t distributions? R does have a few functions like rlnorm() from the compositions package which generates random variates from the multivariate lognormal distribution that are as easy to use as mvrorm(), but you will have to hunt for them. I think a more fruitful approach if you are serious about probability distributions is to get familiar with the copula package.

R user and developer Lionel Henry proposes a number of changes to R syntax:

**Use square brackets to create lists**. You could use [1, 2:5, "hello"] to create a list of three elements. Nested lists would be possible as well, with syntax like or [ [1, 2], [2, 3] ] (much easier than list(list(1,2),list(2,3))).

**Define lambda functions with square brackets**. You could create an unnamed function like [x] -> log(abs(x)) to transform x to the log of its positive part. This is a little easier than using calling function(x) log(abs(x)) .

**Allow labeled blocks**. This change would make it easier to pass blocks of code into functions. You could write:

test_that("my code works") {

check_equal(A,B)

check_identical(C, D)

}

instead of

test_that("my code works", {

check_equal(A,B)

check_identical(C, D)

})

It would also allow you to create your own functions that have similar syntax (like a custom if statement).

**A native piping operator**. Magrittr's %>% pipe operator would get a new look, so that mtcars |> unlist would be a standard part of the language. This would also make it easier to debug code that used the pipe.

R is a very flexible language that provides many native facilities to extend the language itself. The %>% operator is defined as a "special operator" in the magrittr package, and you can even define Haskell-like list comprehensions for R by redefining the [ operator yourself. That approach won't work Lionel's changes above, though: because they modify R syntax they need to be made at the R source code level. (You can find Lionel's changes in this fork of the R source tree on GitHub.) That means it's fairly unlikely they'll be incorporated by R Core (not least because of the potential for backwards compatibility issues) but you never know.

You can read Lionel's full description of his proposed changes at the blog post linked below. His post doesn't have a comments section, so let us know what you think of the changes in the comments here.

(routines ...): The future of R syntax?

by Joseph Rickert

In a previous post, I showed some elementary properties of discrete time Markov Chains could be calculated, mostly with functions from the markovchain package. In this post, I would like to show a little bit more of the functionality available in that package by fitting a Markov Chain to some data. In this first block of code, I load the gold data set from the forecast package which contains daily morning gold prices in US dollars from January 1, 1985 through March 31, 1989. Next, since there are few missing values in the sequence, I impute them with a simple "ad hoc" process by substituting the previous day's price for one that is missing. There are two statements in the loop because there are a number of instances where there are two missing values in a row. Note that some kind of imputation is necessary because I will want to compute the autocorrelation of the series, and like many R functions acf() does not like NAs. (it doesn't make sense to compute with NAs.)

library(forecast)

library(markovchain)

data(gold) # Load gold time series

# Impute missing values

gold1 <- gold

for(i in 1:length(gold)){

gold1[i] <- ifelse(is.na(gold[i]),gold[i-1],gold[i])

gold1[i] <- ifelse(is.na(gold1[i]),gold1[i-1],gold1[i])

}

plot(gold1, xlab = "Days", ylab = "US$", main = "Gold prices 1/1/85 - 1/31/89")

This is an interesting series with over 1,000 points, but definitely not stationary; so it is not a good candidate for trying to model as a Markov Chain. The series produced by taking first differences is more reasonable. The series flat, oscillating about a mean (0.07) slightly above zero and the autocorrelation trails off as one might expect for a stationary series.

# Take first differences to try and get stationary series

goldDiff <- diff(gold1)

par(mfrow = c(1,2))

plot(goldDiff,ylab="",main="1st differences of gold")

acf(goldDiff)

Next, we set up for modeling by constructing a series of labels. In this analysis, I settle for constructing a two state chain that reflects whether the series of differences assumes positive or non-positive values. Note that I have introduced a few zero values into the series because of my crude imputation process. Here, i just lump them in with the negatives.

# construct a series of labels

goldSign <- vector(length=length(goldDiff))

for (i in 1:length(goldDiff)){

goldSign[i] <- ifelse(goldDiff[i] > 0, "POS","NEG")

}

Next, we can use some of the statistical tests built into the markovchain package to assess our assumptions so far. The function verifyMarkovProperty() attempts to verify that a sequence satisfies the Markov property by performing Chi squared tests on a series of contingency tables where the columns are sequences of past to present to future transitions and the rows are sequences of state transitions. Large p values indicate that one should not reject the null hypothesis that the Markov property holds for a specific transition. The output of verifyMarkovProperty() is a list with an entry for each possible state transition. The package vignette An introduction to the markovchain package presents the details of how this works. The following shows the output for the NEG to POS transitions.

# Verify the Markov property

vmp <- verifyMarkovProperty(goldSign)

vmp[2]

# $NEGPOS

# $NEGPOS$statistic

# X-squared

# 0

#

# $NEGPOS$parameter

# df

# 1

#

# $NEGPOS$p.value

# [1] 1

#

# $NEGPOS$method

# [1] "Pearson's Chi-squared test with Yates' continuity correction"

#

# $NEGPOS$data.name

# [1] "table"

#

# $NEGPOS$observed

# SSO TSO-SSO

# NEG 138 116

# POS 164 139

#

# $NEGPOS$expected

# SSO TSO-SSO

# NEG 137.7163 116.2837

# POS 164.2837 138.7163

#

# $NEGPOS$residuals

# SSO TSO-SSO

# NEG 0.02417181 -0.02630526

# POS -0.02213119 0.02408452

#

# $NEGPOS$stdres

# SSO TSO-SSO

# NEG 0.04843652 -0.04843652

# POS -0.04843652 0.04843652

#

# $NEGPOS$table

# SSO TSO

# NEG 138 254

# POS 164 303

The assessOrder() function uses a Chi squared test to test that hypothesis that the sequence is consistent with a first order Markov Chain

assessOrder(goldSign)

# The assessOrder test statistic is: 0.4142521

# the Chi-Square d.f. are: 2 the p-value is: 0.8129172

# $statistic

# [1] 0.4142521

#

# $p.value

# [1] 0.8129172

There is an additional function assessStationarity() to test for the stationarity of the sequence. However in this case, the chisq.test() function at the heart of things reports that the p-values are unreliable.

Next, we use the markovchainFit() function to fit a Markov Chain to the data using the maximum likelihood estimator explained in the vignette mentioned above. The output from the function includes the estimated transition matrix, the estimated error and lower and upper endpoint transition matrices which provide a confidence interval for the transition matrix.

goldMC <- markovchainFit(data = goldSign, method="mle", name = "gold mle")

goldMC$estimate

# gold mle

# A 2 - dimensional discrete Markov Chain with following states

# NEG POS

# The transition matrix (by rows) is defined as follows

# NEG POS

# NEG 0.4560144 0.5439856

# POS 0.5519126 0.4480874

goldMC$standardError

# NEG POS

# NEG 0.02861289 0.03125116

# POS 0.03170655 0.02856901

goldMC$confidenceInterval

# [1] 0.95

#

# $lowerEndpointMatrix

# NEG POS

# NEG 0.4089504 0.4925821

# POS 0.4997599 0.4010956

#

# $upperEndpointMatrix

# NEG POS

# NEG 0.5030784 0.5953892

# POS 0.6040652 0.4950793

The transition matrix does show some interesting structure with it being more likely for the chain to go from a negative to a positive value than to stay negative. And, once it is positive, the chain is more likely to stay positive than go negative.

Finally, we use the predict() function to produce a three day, look ahead forecast for the situation where the series has been negative for the last two days.

predict(object = goldMC$estimate, newdata = c("POS","POS"),n.ahead=3)

#"NEG" "POS" "NEG"

I still have not exhausted what is in the markovchain package. Perhaps, in an other post I will look at the functions for continuous time chains. What is presented here though should be enough to have some fun hunting for Markov Chains in all kinds of data.