by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

Apache Spark, the open-source cluster computing framework, will soon see a major update with the upcoming release of Spark 2.0. This update promises to be faster than Spark 1.6, thanks to a run-time compiler that generates optimized bytecode. It also promises to be easier for developers to use, with streamlined APIs and a more complete SQL implementation. (Here's a tutorial on using SQL with Spark.) Spark 2.0 will also include a new "structured streaming" API, which will allow developers to write algorithm for streaming data without having to worry about the fact that streaming data is always incomplete; algorithms written for complete DataFrame objects will work for streams as well.

This update also includes some news for R users. First, the DataFrame object continues to be the primary interface for R (and Python users). Although the DataSets structure in Spark is more general, using the single-table DataFrame construct makes sense for R and Python which have analogous native (or near-native, in Python's case) data structures. In addition, Spark 2.0 is set to add a few new distributed statistical modeling algorithms: generalized linear models (in addition to the Normal least-squares and logistic regression models in Spark 1.6); Naive Bayes; survival (censored) regression; and K-means clustering. The addition of survival regression is particularly interesting. It's a type of model used in situations where the outcome isn't always completely known: for example, some (but not all) patients may have yet experienced remission in a cancer study. It's also used for reliability analysis and lifetime estimation in manufacturing, where some (but not all) parts may have failed by the end of the observation period. To my knowledge, this is the first distributed implementation of the survival regression algorithm.

For R users, these models can be applied to large data sets stored in Spark DataFrames, and then computed using the Spark distributed computing framework. Access to the algorithms is via the SparkR package, which hews closely to standard R interfaces for model training. You'll need to first create a SparkDataFrame object in R, as a reference to a Spark DataFrame. Then, to perform logistic regression (for example), you'll use R's standard glm function, using the SparkDataFrame object as the data argument. (This elegantly uses R's object-oriented dispatch architecture to call the Spark-specific GLM code.) The example below creates a Spark DataFrame, and then uses Spark to fit a logisic regression to it:

```
df <- createDataFrame(sqlContext, iris)
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
```

The object model contains most (but not all) of the output created by R's traditional glm algorithm, so most standard R functions that work with GLM objects work here as well.

For more on what's coming in Spark 2.0 check out the DataBricks blog post below, and the preview documentation for SparkR contains more info as well. Also, you might want to check out how you can use Spark on HDInsight with Microsoft R Server.

Databricks: Preview of Apache Spark 2.0 now on Databricks Community Edition

John Mount Ph. D.

Data Scientist at Win-Vector LLC

Win-Vector LLC's Dr. Nina Zumel has just started a two part series on Principal Components Regression that we think is well worth your time. You can read her article here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:

- It can find important latent structure and relations.
- It can reduce over fit.
- It can ease the curse of dimensionality.
- It is used in a ritualistic manner in many scientific disciplines. In some fields it is considered ignorant and uncouth to regress using original variables.

We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.

Wikipedia: PCA

And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is *already* supplied in a reliable analysis platform (such as R).

Dr. Zumel uses the expressive and graphical power of R to work through the *use* of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "*y*-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.

*by Katherine Zhao, Hong Lu, Zhongmou Li, Data Scientists at Microsoft*

Bicycle rental has become popular as a convenient and environmentally friendly transportation option. Accurate estimation of bike demand at different locations and different times would help bicycle-sharing systems better meet rental demand and allocate bikes to locations.

In this blog post, we walk through how to use Microsoft R Server (MRS) to build a regression model to predict bike rental demand. In the example below, we demonstrate an end-to-end machine learning solution development process in MRS, including data importing, data cleaning, feature engineering, parameter sweeping, and model training and evaluation.

The Bike Rental UCI dataset is used as the input raw data for this sample. This dataset is based on real-world data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States.

The dataset contains 17,379 rows and 17 columns, with each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) are included in the raw data, and the dates are categorized as holiday vs. weekday, etc.

The field to predict is ** cnt**, which contains a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour.

In this example, we use historical bike rental counts as well as the weather condition data to predict the number of bike rentals within a specific hour in the future. We approach this problem as a regression problem, since the label column (number of rentals) contains continuous real numbers.

Along this line, we split the raw data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. Specifically, we employ the Decision Forest Regression algorithm as the regression model and build two models on different feature sets. Finally, we evaluate their prediction performance. We will elaborate the details in the following sections.

We build the models using the `RevoScaleR`

library in MRS. The `RevoScaleR`

library provides extremely fast statistical analysis on terabyte-class datasets without needing specialized hardware. `RevoScaleR`

's distributed computing capabilities can use a different (possibly remote) computing context using the same `RevoScaleR`

commands to manage and analyze local data. A wide range of `rx`

prefixed functions provide functionality for:

- Accessing external data sets (SAS, SPSS, ODBC, Teradata, and delimited and fixed format text) for analysis in R.
- Efficiently storing and retrieving data in a high-performance data file.
- Cleaning, exploring, and manipulating data.
- Fast, basic statistical analysis.
- Train and score advanced machine learning models.

Overall, there are five major steps of building this example using Microsoft R Server:

- Step 1: Import and Clean Data
- Step 2: Perform Feature Engineering
- Step 3: Prepare Training, Test and Score Datasets
- Step 4: Sweep Parameters and Train Regression Models
- Step 5: Test, Evaluate, and Compare Models

First, we import the Bike Rental UCI dataset. Since there are a small portion of missing records within the dataset, we use `rxDataStep()`

to replace the missing records with the latest non-missing observations. `rxDataStep()`

is a commonly used function for data manipulation. It transforms the input dataset chunk by chunk and saves the results to the output dataset.

# Define the tranformation function for the rxDataStep. xform <- function(dataList) { # Identify the features with missing values. featureNames <- c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt") # Use "na.locf" function to carry forward last observation. dataList[featureNames] <- lapply(dataList[featureNames], zoo::na.locf) # Return the data list. return(dataList) } # Use rxDataStep to replace missings with the latest non-missing observations. cleanXdf <- rxDataStep(inData = mergeXdf, outFile = outFileClean, overwrite = TRUE, # Apply the "last observation carried forward" operation. transformFunc = xform, # Identify the features to apply the tranformation. transformVars = c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt"), # Drop the "dteday" feature. varsToDrop = "dteday") Step 2: Perform Feature Engineering

In addition to the original features in the raw data, we add number of bikes rented in each of the previous 12 hours as features to provide better predictive power. We create a `computeLagFeatures()`

helper function to compute the 12 lag features and use it as the transformation function in `rxDataStep()`

.

Note that `rxDataStep()`

processes data chunk by chunk and lag feature computation requires data from previous rows. In `computLagFeatures()`

, we use the internal function `.rxSet()`

to save the last *n* rows of a chunk to a variable ** lagData**. When processing the next chunk, we use another internal function

`.rxGet()`

to retrieve the values stored in # Add number of bikes that were rented in each of # the previous 12 hours as 12 lag features. computeLagFeatures <- function(dataList) { # Total number of lags that need to be added. numLags <- length(nLagsVector) # Lag feature names as lagN. varLagNameVector <- paste("cnt_", nLagsVector, "hour", sep="") # Set the value of an object "storeLagData" in the transform environment. if (!exists("storeLagData")) { lagData <- mapply(rep, dataList[[varName]][1], times = nLagsVector) names(lagData) <- varLagNameVector .rxSet("storeLagData", lagData) } if (!.rxIsTestChunk) { for (iL in 1:numLags) { # Number of rows in the current chunk. numRowsInChunk <- length(dataList[[varName]]) nlags <- nLagsVector[iL] varLagName <- paste("cnt_", nlags, "hour", sep = "") # Retrieve lag data from the previous chunk. lagData <- .rxGet("storeLagData") # Concatenate lagData and the "cnt" feature. allData <- c(lagData[[varLagName]], dataList[[varName]]) # Take the first N rows of allData, where N is # the total number of rows in the original dataList. dataList[[varLagName]] <- allData[1:numRowsInChunk] # Save last nlag rows as the new lagData to be used # to process in the next chunk. lagData[[varLagName]] <- tail(allData, nlags) .rxSet("storeLagData", lagData) } } return(dataList) } # Apply the "computeLagFeatures" on the bike data. lagXdf <- rxDataStep(inData = cleanXdf, outFile = outFileLag, transformFunc = computeLagFeatures, transformObjects = list(varName = "cnt", nLagsVector = seq(12)), transformVars = "cnt", overwrite = TRUE)

Before training the regression model, we split data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. In order to obtain the best combination of parameters for regression models, we further divide year 2011 data into training and test datasets: 80% of the records are randomly selected to train regression models with various combinations of parameters, and the remaining 20% are used to evaluate the models obtained and determine the optimal combination.

# Split data by "yr" so that the training and test data contains records # for the year 2011 and the score data contains records for 2012. rxSplit(inData = lagXdf, outFilesBase = paste0(td, "/modelData"), splitByFactor = "yr", overwrite = TRUE, reportProgress = 0, verbose = 0) # Point to the .xdf files for the training & test and score set. trainTest <- RxXdfData(paste0(td, "/modelData.yr.0.xdf")) score <- RxXdfData(paste0(td, "/modelData.yr.1.xdf")) # Randomly split records for the year 2011 into training and test sets # for sweeping parameters. # 80% of data as training and 20% as test. rxSplit(inData = trainTest, outFilesBase = paste0(td, "/sweepData"), outFileSuffixes = c("Train", "Test"), splitByFactor = "splitVar", overwrite = TRUE, transforms = list(splitVar = factor(sample(c("Train", "Test"), size = .rxNumRows, replace = TRUE, prob = c(.80, .20)), levels = c("Train", "Test"))), rngSeed = 17, consoleOutput = TRUE) # Point to the .xdf files for the training and test set. train <- RxXdfData(paste0(td, "/sweepData.splitVar.Train.xdf")) test <- RxXdfData(paste0(td, "/sweepData.splitVar.Test.xdf"))

In this step, we construct two training datasets based on the same raw input data, but with different sets of features:

- Set A = weather + holiday + weekday + weekend features for the predicted day
- Set B = Set A + number of bikes rented in each of the previous 12 hours, which captures very recent demand for the bikes

In order to perform parameter sweeping, we create a helper function to evaluate the performance of a model trained with a given combination of number of trees and maximum depth. We use *Root Mean Squared Error (RMSE)* as the evaluation metric.

# Define a function to train and test models with given parameters # and then return Root Mean Squared Error (RMSE) as the performance metric. TrainTestDForestfunction <- function(trainData, testData, form, numTrees, maxD) { # Build decision forest regression models with given parameters. dForest <- rxDForest(form, data = trainData, method = "anova", maxDepth = maxD, nTree = numTrees, seed = 123) # Predict the the number of bike rental on the test data. rxPredict(dForest, data = testData, predVarNames = "cnt_Pred", residVarNames = "cnt_Resid", overwrite = TRUE, computeResiduals = TRUE) # Calcualte the RMSE. result <- rxSummary(~ cnt_Resid, data = testData, summaryStats = "Mean", transforms = list(cnt_Resid = cnt_Resid^2) )$sDataFrame # Return lists of number of trees, maximum depth and RMSE. return(c(numTrees, maxD, sqrt(result[1,2]))) }

The following is another helper function to sweep and select the optimal parameter combination. Under local parallel compute context (`rxSetComputeContext(RxLocalParallel())`

), `rxExec()`

executes multiple runs of model training and evaluation with different parameters in parallel, which significantly speeds up parameter sweeping. When used in a compute context with multiple nodes, e.g. high-performance computing clusters and Hadoop, `rxExec()`

can be used to distribute a large number of tasks to the nodes and run the tasks in parallel.

# Define a function to sweep and select the optimal parameter combination. findOptimal <- function(DFfunction, train, test, form, nTreeArg, maxDepthArg) { # Sweep different combination of parameters. sweepResults <- rxExec(DFfunction, train, test, form, rxElemArg(nTreeArg), rxElemArg(maxDepthArg)) # Sort the nested list by the third element (RMSE) in the list in ascending order. sortResults <- sweepResults[order(unlist(lapply(sweepResults, `[[`, 3)))] # Select the optimal parameter combination. nTreeOptimal <- sortResults[[1]][1] maxDepthOptimal <- sortResults[[1]][2] # Return the optimal values. return(c(nTreeOptimal, maxDepthOptimal)) }

A large number of parameter combinations are usually swept through in modeling process. For demonstration purpose, we use 9 combinations of parameters in this example.

Next, we find the best parameter combination and get the optimal regression model for each training dataset. For simplicity, we only present the process for Set A.

# Set A = weather + holiday + weekday + weekend features for the predicted day. # Build a formula for the regression model and remove the "yr", # which is used to split the training and test data. newHourFeatures <- paste("cnt_", seq(12), "hour", sep = "") # Define the hourly lags. formA <- formula(train, depVars = "cnt", varsToDrop = c("splitVar", newHourFeatures)) # Find the optimal parameters for Set A. optimalResultsA <- findOptimal(TrainTestDForestfunction, train, test, formA, numTreesToSweep, maxDepthToSweep) # Use the optimal parameters to fit a model for feature Set A. nTreeOptimalA <- optimalResultsA[[1]] maxDepthOptimalA <- optimalResultsA[[2]] dForestA <- rxDForest(formA, data = trainTest, method = "anova", maxDepth = maxDepthOptimalA, nTree = nTreeOptimalA, importance = TRUE, seed = 123)

Finally, we plot the dot charts of the variable importance and the out-of-bag error rates for the two optimal decision forest models.

In this step, we use the `rxPredict()`

function to predict the bike rental demand on the score dataset, and compare the two regression models over three performance metrics - *Mean Absolute Error (MAE)*, *Root Mean Squared Error (RMSE)*, and *Relative Absolute Error (RAE)*.

# Set A: Predict the probability on the test dataset. rxPredict(dForestA, data = score, predVarNames = "cnt_Pred_A", residVarNames = "cnt_Resid_A", overwrite = TRUE, computeResiduals = TRUE) # Set B: Predict the probability on the test dataset. rxPredict(dForestB, data = score, predVarNames = "cnt_Pred_B", residVarNames = "cnt_Resid_B", overwrite = TRUE, computeResiduals = TRUE) # Calculate three statistical metrics: # Mean Absolute Error (MAE), # Root Mean Squared Error (RMSE), and # Relative Absolute Error (RAE). sumResults <- rxSummary(~ cnt_Resid_A_abs + cnt_Resid_A_2 + cnt_rel_A + cnt_Resid_B_abs + cnt_Resid_B_2 + cnt_rel_B, data = score, summaryStats = "Mean", transforms = list(cnt_Resid_A_abs = abs(cnt_Resid_A), cnt_Resid_A_2 = cnt_Resid_A^2, cnt_rel_A = abs(cnt_Resid_A)/cnt, cnt_Resid_B_abs = abs(cnt_Resid_B), cnt_Resid_B_2 = cnt_Resid_B^2, cnt_rel_B = abs(cnt_Resid_B)/cnt) )$sDataFrame # Add row names. features <- c("baseline: weather + holiday + weekday + weekend features for the predicted day", "baseline + previous 12 hours demand") # List all metrics in a data frame. metrics <- data.frame(Features = features, MAE = c(sumResults[1, 2], sumResults[4, 2]), RMSE = c(sqrt(sumResults[2, 2]), sqrt(sumResults[5, 2])), RAE = c(sumResults[3, 2], sumResults[6, 2]))

Based on all three metrics listed below, the regression model built on feature set B outperforms the one built on feature set A. This result is not surprising, since from the variable importance chart we can see, the lag features play a critical part in the regression model. Adding this set of features can lead to better performance.

Feature Set | MAE | RMSE | RAE |
---|---|---|---|

A | 101.34848 | 146.9973 | 0.9454142 |

B | 62.48245 | 105.6198 | 0.3737669 |

Follow this link for source code and datasets: Bike Rental Demand Estimation with Microsoft R Server

by Yuzhou Song, Microsoft Data Scientist

R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.

In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.

**Step 1: Import the Data**

At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.

library(RevoScaleR)

data.path <- "./"

file.name <- "ccFraud.csv"

fraud_csv_path <- file.path(data.path,file.name)

Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.

colClasses <- c("integer","factor","factor","integer",

"numeric","integer","integer","numeric","factor")

names(colClasses)<-c("custID","gender","state","cardholder",

"balance","numTrans","numIntlTrans",

"creditLine","fraudRisk")

fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)

With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:

rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)

and you will get the following:

**Step 2: Process Data**

Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.

Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:

fraud_flag_csv_source <- RxTextData(file=file.path(data.path,

"ccFraudFlag.csv"))

Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:

make_train_flag <- function(data){

data <- as.data.frame(data)

set.seed(34)

data$trainFlag <- sample(c(0,1),size=nrow(data),

replace=TRUE,prob=c(0.3,0.7))

return(data)

}

Then, use rxDateStep to complete the transformation:

rxDataStep(inData=fraud_csv_source,

outFile=fraud_flag_csv_source,

transformFunc=make_train_flag,overwrite = TRUE)

again, we can check the output file information by using rxGetInfo function:

rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)

we can find the trainFlag column has been appended to the last:

Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:

train_csv_source <- RxTextData(

file=file.path(data.path,"train.csv"))

test_csv_source <- RxTextData(

file=file.path(data.path,"test.csv"))

Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:

rxDataStep(inData=fraud_flag_csv_source,

outFile=train_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 1),

overwrite = TRUE)

rxDataStep(inData=fraud_flag_csv_source,

outFile=test_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 0),

overwrite = TRUE)

A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.

down_sample <- function(data){

data <- as.data.frame(data)

data_bad <- subset(data,fraudRisk == 1)

data_good <- subset(data,fraudRisk == 0)

# good to bad ratio 4:1

rate <- nrow(data_bad)*4/nrow(data_good)

set.seed(34)

data_good$keepTag <- sample(c(0,1),replace=TRUE,

size=nrow(data_good),prob=c(1-rate,rate))

data_good_down <- subset(data_good,keepTag == 1)

data_good_down$keepTag <- NULL

data_down <- rbind(data_bad,data_good_down)

data_down$trainFlag <- NULL

return(data_down)

}

Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:

train_downsample_csv_source <- RxTextData(

file=file.path(data.path,

"train_downsample.csv"),

colClasses = colClasses)

rxDataStep(inData=train_csv_source,

outFile=train_downsample_csv_source,

transformFunc = down_sample,

reportProgress = 0,

overwrite = TRUE)

**Step 3: Training**

In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:

training_vars <- rxGetVarNames(train_downsample_csv_source)

training_vars <- training_vars[!(training_vars %in%

c("fraudRisk","custID"))]

formula <- as.formula(paste("fraudRisk~",

paste(training_vars, collapse = "+")))

The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:

boosted_fit <- rxBTrees(formula = formula,

data = train_downsample_csv_source,

learningRate = 0.2,

minSplit = 10,

minBucket = 10,

# small number of tree for testing purpose

nTree = 20,

seed = 5,

lossFunction ="multinomial",

reportProgress = 0)

**Step 4: Prediction and Evaluation**

We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:

test_data <- rxImport(test_csv_source)

Then, we take the imported testing set and fitted model object as input for rxPredict function:

predictions <- rxPredict(modelObject = boosted_fit,

data = test_data,

type = "response",

overwrite = TRUE,

reportProgress = 0)

In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:

threshold <- 0.5

predictions <- data.frame(predictions$X1_prob)

names(predictions) <- c("Boosted_Probability")

predictions$Boosted_Prediction <- ifelse(

predictions$Boosted_Probability > threshold, 1, 0)

predictions$Boosted_Prediction <- factor(

predictions$Boosted_Prediction,

levels = c(1, 0))

scored_test_data <- cbind(test_data, predictions)

evaluate_model <- function(data, observed, predicted) {

confusion <- table(data[[observed]],

data[[predicted]])

print(confusion)

tp <- confusion[1, 1]

fn <- confusion[1, 2]

fp <- confusion[2, 1]

tn <- confusion[2, 2]

accuracy <- (tp+tn)/(tp+fn+fp tn)

precision <- tp / (tp + fp)

recall <- tp / (tp + fn)

fscore <- 2*(precision*recall)/(precision+recall)

metrics <- c("Accuracy" = accuracy,

"Precision" = precision,

"Recall" = recall,

"F-Score" = fscore)

return(metrics)

}

* *

roc_curve <- function(data, observed, predicted) {

data <- data[, c(observed, predicted)]

data[[observed]] <- as.numeric(

as.character(data[[observed]]))

rxRocCurve(actualVarName = observed,

predVarNames = predicted,

data = data)

}

* *

boosted_metrics <- evaluate_model(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Prediction")

roc_curve(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Probability")

The confusion Matrix:

0 1

0 2752288 67659

1 77117 101796

ROC curve is (AUC=0.95):

**Summary:**

In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.

by Joseph Rickert

When I first went to grad school, the mathematicians advised me cultivate the habit of reading with a pencil. This turned into a lifelong habit and useful skill for reading all sorts of things: literature, reports and newspapers for example; not just technical papers. However, reading statistics and data science papers, or really anything that includes some data, considerably "ups the ante". For this sort of exercise, I need a tool to calculate, to try some variations that test my intuition and see how well I'm following the arguments. The idea here is not so much to replicate the paper but to accept the author's invitation to engage with the data and work through the analysis. Ideally, I'd want something not much more burdensome than than a pencil (maybe a tablet based implementation of R), but standard R on my notebook comes pretty close to the perfect tool.

Recently, I sat down with Bradley Efron's 1987 paper "Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve", the paper where he elaborates on the idea of using conditional logistic regression to estimate hazard rates and survival curves. This paper is classic Efron: drawing you in with a great story well before you realize how much work it's going to be to follow it to the end. Efron writes with a fairly informal style that encourages the reader to continue. Struggling to keep up with some of his arguments I nevertheless get the feeling that Efron is doing his best help me follow along, dropping hints every now and then about where to look if I lose the trail.

The basic idea of conditional logistic regression is to group the data into discrete time intervals with n_{i} patients at risk in each interval, i, and then assume that the intervals really are independent and that the s_{i} events (deaths or some other measure of "success") in each interval, follow a binomial distribution with parameters n_{i} and h_{i} where:

h_{i} = Prob(patient i dies during the i^{th} interval | patient i survives until the beginning of the i^{th} interval).

The modest goal of this post was to see if I could reproduce Efron's Figure 3 which shows survival curves for three different models for A arm of a clinical trial examining treatments for head and neck cancer. I figured that getting to Figure 3 represents the minimum amount of comprehension required to begin experimenting with conditional logistic regression.

I entered the data buried in the caption to Efron's Table 1 and was delighted when R's Survival package replicated survival times also in the caption.

# Data for Efron's Table 1 # Enter raw data for arm A Adays <- c(7, 34, 42, 63, 64, 74, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157, 160, 160, 165, 173, 176, 185, 218, 225, 241, 248, 273, 277, 279, 297, 319, 405, 417, 420, 440, 523, 523, 583, 594, 1101, 1116, 1146, 1226, 1349, 1412, 1417) Astatus <- rep(1,51) Astatus[c(6,27,34,36,42,46,48,49,50)] <-0 Aobj <- Surv(time = Adays, Astatus==1) Aobj # [1] 7 34 42 63 64 74+ 83 84 91 108 112 129 133 133 # [15] 139 140 140 146 149 154 157 160 160 165 173 176 185+ 218 # [29] 225 241 248 273 277 279+ 297 319+ 405 417 420 440 523 523+ # [43] 583 594 1101 1116+ 1146 1226+ 1349+ 1412+ 1417

Doing the same with the data from arm B of the trial led to a set of Kaplan-Meier curves that pretty much match the curves in Efron's Figure 1.

All of this was straightforward but I was puzzled that the summary of the Kaplan-Meier curve for arm A (See KM_A in the code below) doesn't match the values for month, n_{i} and s_{i} in Efron's Table 1, until I realized these values were for the beginning of the month. To match the table I compute n_{i} by putting 51 in front of the vector KM_A$n.risk, and add a 1 to the end of the vector KM_A$n.event to get s_{i}. (See the "set up variables for models section in the code below.)

After this, one more "trick" was required to get to Figure 3. Most of the time, I suppose those of us working with logistic regression to construct machine learning models are accustomed to specifying the outcome as a binary variable of "ones" and "zeros", or as a two level factor. But, how exactly does one specify the parameters n_{i} and s_{i} for the binomial models that comprise each outcome? After a close reading of the documentation (See the first paragraph under Details for ?glm) I was very pleased to see that glm() permits the dependent variable to be a matrix.

The formula for Efron's cubic model looks like this:

Y <- matrix(c(si, failure),n,2) # Response matrix

form <- formula(Y ~ t + t2 + t3)

The rest of the code is straight forward and leads to a pretty good reproduction of Figure 3.

In this figure, the black line represents the survival curve for a Life Table model where the hazard probabilities are estimated by h_{i} = s_{i} / n_{i}. The blue triangles map the survival curve for the cubic model given above, and the red curve with crosses plots a cubic spline model where h_{i} = t_{i} + (t_{i} - 11)^{2} + (t_{i} - 11)^{3 }.

What a delightful little diagram! In addition to illustrating the technique of using a very basic statistical tool to model time-to-event data, the process leading to Figure 3 reveals something about the intuition and care a professional statistician puts into the exploratory modeling process.

There is much more in Efron's paper. What I have shown here is just the "trailer". Efron presents a careful analysis of the data for both arms of the clinical trial data, goes on to study maximum likelihood estimates for conditional logistic regression models and their standard errors, and proves a result about average ratio of the asymptotic variance between parametric and non-parametric hazard rate estimates.

Enjoy this classic paper and write some "am I reading this right" code of your own!

Here is my code for the models and plots.

*by Shaheen Gauher, PhD, Data Scientist at Microsoft*

At the heart of a classification model is the ability to assign a class to an object based on its description or features. When we build a classification model, often we have to prove that the model we built is significantly better than random guessing. How do we know if our machine learning model performs better than a classifier built by assigning labels or classes arbitrarily (through random guess, weighted guess etc.)? I will call the latter *non-machine learning classifiers* as these do not learn from the data. A machine learning classifier should be smarter and should not be making just lucky guesses! At the least, it should do a better job at detecting the difference between different classes and should have a better accuracy than the latter. In the sections below, I will show three different ways to build a non-machine learning classifier and compute their accuracy. The purpose is to establish some baseline metrics against which we can evaluate our classification model.

In the examples below, I will assume we are working with data with population size \(n\). The data is divided into two groups, with \(x\%\) of the rows or instances belonging to one class (labeled positive or \(P\)) and \((1-x)\%\) belonging to another class (labeled negative or \(N\)). We will also assume that the majority of the data is labeled \(N\). (This is easily extended to data with more than two classes, as I show in the paper here). This is the ground truth.

Population Size \(=n\)

Fraction of instances labelled positive \(=x\)

Fraction of instances labelled negative \(=(1-x)\)

Number of instances labelled positive \((P)\) \(=xn\)

Number of instances labelled negative \((N)\) \(=(1-x)n\)

The confusion matrix with rows reflecting the ground truth and columns reflecting the machine learning classifier classifications looks like:

We can define some simple non-machine learning classifiers that assign labels based simply on the proportions found in the training data:

**Random Guess Classifier**: randomly assign half of the labels to \(P\) and the other half as \(N\).**Weighted Guess Classifier**: randomly assign \(x\%\) of the labels to \(P\), and the remaining \((1-x)\%\) to \(N\)**Majority Class Classifier**: assign all of the labels to \(N\) (the majority class in the data)

The confusion matrices for these trivial classifiers would look like:

The standard performance metrics for evaluating classifiers are **accuracy**, **recall** and **precision**. (In a previous post, we included definitions of these metrics and how to compute them in R.) In this paper I algebraically derive the performance metrics for these non-machine classifier. They are shown in the table below, and provide baseline metrics for comparing the performance of machine learning classifiers:

```
| Classifier | Accuracy | Recall | Precision |
| -------------- | ---------- | ------ | --------- |
| Random Guess | 0.5 | 0.5 | x |
| Weighted Guess | x
```^{2 }+ (1-x)^{2 }| x | x |
| Majority Class | (1-x) | 0 | 0 |

In this experiment at the Cortana Analytics Gallery you can follow a binary classification model using Census Income data set to see how the confusion matrices for the models compare with each other. An accuracy of 76% can be achieved simply by assigning all instances to majority class.

Fig. Showing confusion matrices for a binary classification model using Boosted Decision Tree for Census Income data and models based on random guess, weighted guess and all instances assigned to majority class. The experiment can be found here.

For a **multiclass classification** with \(k\) classes with \(x_i\) being the fraction of instances belonging to class \(i\) ( \(i\) = 1 to \(k\)), it can similarly be shown that for

**Random Guess**: Accuracy would by \(\frac{1}{k}\). The precision for a class \(i\) would be \(x_i\), the fraction of instances in the data with class \(i\). Recall for a class will be equal to \(\frac{1}{k}\). In the language of probability, the accuracy is simply the probability of selecting a class which for two classes (binary classification) is 0.5 and for \(k\) classes will be \(\frac{1}{k}\).

**Weighted Guess**: Accuracy is equal to \(\sum_{i=1}^k x_{i}^2 \). Precision and Recall for a class is equal to \(x_i\), the fraction of instances in the data with the class \(i\). In the language of probability, \(\frac{xn}{n}\) or \(x\) is the probability of a label being positive in the data (for negative label, the probability is \(\frac{(1-x)n}{n}\)or \((1-x)\) ). If there are more negative instances in the data, the model has a higher probability of assigning an instance as negative. The probability of assigning an instance as true positive by the model will be \(x*x\) (for true negative it is \((1-x)^2\). The accuracy is simply the sum of these two probabilities.

**Majority Class**: Accuracy will be equal to \((1-x_i)\), the fraction of instances belonging to the majority class (assumed negative label is majority here). Recall will be 1 for the majority class and 0 for all other classes. Precision will be equal to the fraction of instances belonging to the majority class for the majority class and 0 for all other classes. In the language of probability, the probability of assigning a positive label to an instance by the model will be zero and the probability of assigning a negative label to an instance will be 1.

*We can use these simple, non-machine learning classifiers as benchmarks with which to compare the performance of machine learning models.*

**MRS code** to compute the baseline metrics for a classification model (Decision Forest) using Census Income data set can be found below. The corresponding **R cod**e can be found here.

Acknowledgement: Special thanks to Danielle Dean, Said Bleik, Dmitry Pechyoni and George Iordanescu for their input.

by Joseph Rickert

Random Forests, the "go to" classifier for many data scientists, is a fairly complex algorithm with many moving parts that introduces randomness at different levels. Understanding exactly how the algorithm operates requires some work, and assessing how good a Random Forests model fits the data is a serious challenge. In the pragmatic world of machine learning and data science, assessing model performance often comes down to calculating the area under the ROC curve (or some other convenient measure) on a hold out set of test data. If the ROC looks good then the model is good to go.

Fortunately, however, goodness of fit issues have a kind of nagging persistence that just won't leave statisticians alone. In a gem of a paper (and here) that sparkles with insight, the authors (Wagner, Hastie and Efron) take considerable care to make things clear to the reader while showing how to calculate confidence intervals for Random Forests models.

Using the high ground approach favored by theorists, Wagner et al. achieve the result about Random Forests by solving a more general problem first: they derive estimates of the variance of bagged predictors that can be computed from the same bootstrap replicates that give the predictors. After pointing out that these estimators suffer from two distinct sources of noise:

- Sampling noise - noise resulting from the randomness of data collection
- Monte Carlo noise - noise that results from using a finite number of bootstrap replicates

they produce bias corrected versions of jackknife and infinitesimal jackknife estimators. A very nice feature of the paper is the way the authors' illustrate the theory with simulation experiments and then describe the simulations in enough detail in an appendix for readers to replicate the results. I generated the following code and figure to replicate Figure 1 of their the first experiment using the authors' GitHub based package, randomForestCI.

Here, I fit a randomForest model to eight features from the UCI MPG data set and use the randomForestInfJack() function to calculate the infinitesimal Jackknife estimator. (The authors use seven features, but the overall shape of the result is the same.)

# Random Forest Confidence Intervals

install.packages("devtools") library(devtools) install_github("swager/randomForestCI")

library(randomForestCI)

library(dplyr) # For data manipulation

library(randomForest) # For random forest ensemble models

library(ggplot2)

# Fetch data from the UCI MAchine Learning Repository

url <-"https://archive.ics.uci.edu/ml/machine-learning-databases/

auto-mpg/auto-mpg.data"

mpg <- read.table(url,stringsAsFactors = FALSE,na.strings="?")

# https://archive.ics.uci.edu/ml/machine-learning-databases/

auto-mpg/auto-mpg.names

names(mpg) <- c("mpg","cyl","disp","hp","weight","accel","year","origin","name")

head(mpg)

# Look at the data and reset some of the data types

dim(mpg); Summary(mpg)

sapply(mpg,class)

mpg <- mutate(mpg, hp = as.numeric(hp),

year = as.factor(year),

origin = as.factor(origin))

head(mpg,2)

#

# Function to divide data into training, and test sets

index <- function(data=data,pctTrain=0.7)

{

# fcn to create indices to divide data into random

# training, validation and testing data sets

N <- nrow(data)

train <- sample(N, pctTrain*N)

test <- setdiff(seq_len(N),train)

Ind <- list(train=train,test=test)

return(Ind)

}

#

set.seed(123)

ind <- index(mpg,0.8)

length(ind$train); length(ind$test)

form <- formula("mpg ~ cyl + disp + hp + weight +

accel + year + origin")

rf_fit <- randomForest(formula=form,data=na.omit(mpg[ind$train,]),

keep.inbag=TRUE) # Build the model

# Plot the error as the number of trees increases

plot(rf_fit)

# Plot the important variables

varImpPlot(rf_fit,col="blue",pch= 2)

# Calculate the Variance

X <- na.omit(mpg[ind$test,-1])

var_hat <- randomForestInfJack(rf_fit, X, calibrate = TRUE)

#Have a look at the variance

head(var_hat); dim(var_hat); plot(var_hat)

# Plot the fit

df <- data.frame(y = mpg[ind$test,]$mpg, var_hat)

df <- mutate(df, se = sqrt(var.hat))

head(df)

p1 <- ggplot(df, aes(x = y, y = y.hat))

p1 + geom_errorbar(aes(ymin=y.hat-se, ymax=y.hat+se), width=.1) +

geom_point() +

geom_abline(intercept=0, slope=1, linetype=2) +

xlab("Reported MPG") +

ylab("Predicted MPG") +

ggtitle("Error Bars for Random Forests")

An interesting feature of the plot is that Random Forests doesn't appear to have the same confidence in all of its estimates, sometimes being less confident about estimates closer to the diagonal than those further away.

Don't forget to include confidence intervals with your next Random Forests model.

By Joseph Rickert

The ability to generate synthetic data with a specified correlation structure is essential to modeling work. As you might expect, R’s toolbox of packages and functions for generating and visualizing data from multivariate distributions is impressive. The basic function for generating multivariate normal data is mvrnorm() from the MASS package included in base R, although the mvtnorm package also provides functions for simulating both multivariate normal and t distributions. (For tutorial on how to use R to simulate from multivariate normal distributions from first principles using some linear algebra and the Cholesky decomposition see the astrostatistics tutorial on Multivariate Computations.)

The following block of code generates 5,000 draws from a bivariate normal distribution with mean (0,0) and covariance matrix Sigma printed in code. The function kde2d(), also from the Mass package generates a two-dimensional kernel density estimation of the distribution's probability density function.

# SIMULATING MULTIVARIATE DATA # https://stat.ethz.ch/pipermail/r-help/2003-September/038314.html # lets first simulate a bivariate normal sample library(MASS) # Simulate bivariate normal data mu <- c(0,0) # Mean Sigma <- matrix(c(1, .5, .5, 1), 2) # Covariance matrix # > Sigma # [,1] [,2] # [1,] 1.0 0.1 # [2,] 0.1 1.0 # Generate sample from N(mu, Sigma) bivn <- mvrnorm(5000, mu = mu, Sigma = Sigma ) # from Mass package head(bivn) # Calculate kernel density estimate bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50) # from MASS package

R offers several ways of visualizing the distribution. These next two lines of code overlay a contour plot on a "heat Map" that maps the density of points to a gradient of colors.

This plots the irregular contours of the simulated data. The code below which uses the ellipse() function from the ellipse package generates the classical bivariate normal distribution plot that graces many a textbook.

# Classic Bivariate Normal Diagram library(ellipse) rho <- cor(bivn) y_on_x <- lm(bivn[,2] ~ bivn[,1]) # Regressiion Y ~ X x_on_y <- lm(bivn[,1] ~ bivn[,2]) # Regression X ~ Y plot_legend <- c("99% CI green", "95% CI red","90% CI blue", "Y on X black", "X on Y brown") plot(bivn, xlab = "X", ylab = "Y", col = "dark blue", main = "Bivariate Normal with Confidence Intervals") lines(ellipse(rho), col="red") # ellipse() from ellipse package lines(ellipse(rho, level = .99), col="green") lines(ellipse(rho, level = .90), col="blue") abline(y_on_x) abline(x_on_y, col="brown") legend(3,1,legend=plot_legend,cex = .5, bty = "n")

The next bit of code generates a couple of three dimensional surface plots. The second of which is an rgl plot that you will be able to rotate and view from different perspectives on your screen.

Next, we have some code to unpack the grid coordinates produced by the kernel density estimator and get x y, and z values to plot the surface using the new scatterplot3js() function from the htmlwidgets, javascript threejs package. This visualization does not render the surface with the same level of detail as the rgl plot. Nevertheless, it does show some of the salient features of the pdf and has the distinct advantage of being easily embedded in web pages. I expect that html widget plots will keep getting better and easier to use.

# threejs Javascript plot library(threejs) # Unpack data from kde grid format x <- bivn.kde$x; y <- bivn.kde$y; z <- bivn.kde$z # Construct x,y,z coordinates xx <- rep(x,times=length(y)) yy <- rep(y,each=length(x)) zz <- z; dim(zz) <- NULL # Set up color range ra <- ceiling(16 * zz/max(zz)) col <- rainbow(16, 2/3) # 3D interactive scatter plot scatterplot3js(x=xx,y=yy,z=zz,size=0.4,color = col[ra],bg="black")<

The code that follows uses the rtmvt() function from the tmvtnorm package to generate bivariate t distribution. The rgl plot renders the surface kernel density estimate of the surface in impressive detail.

# Draw from multi-t distribution without truncation library (tmvtnorm) Sigma <- matrix(c(1, .1, .1, 1), 2) # Covariance matrix X1 <- rtmvt(n=1000, mean=rep(0, 2), sigma = Sigma, df=2) # from tmvtnorm package t.kde <- kde2d(X1[,1], X1[,2], n = 50) # from MASS package col2 <- heat.colors(length(bivn.kde$z))[rank(bivn.kde$z)] persp3d(x=t.kde, col = col2)

The real value of the multivariate distribution functions from the data science perspective is to simulate data sets with many more than two variables. The functions we have been considering are up to the task, but there are some technical considerations and, of course, we don't have the same options for visualization. The following code snippet generates 10 variables from a multivariate normal distribution with a specified covariance matrix. Note that I've used the genPositiveDefmat() function from the clusterGeneration package to generate the covariance matrix. This is because mvrnorm() will throw an error, as theory says it should, if the covariance matrix is not positive definite, and guessing a combination of matrix elements to make a high dimensional matrix positive definite would require quite a bit of luck along with some serious computation time.

After generating the matrix, I use the corrplot() function from the corrplot package to produce an attractive pairwise correlation plot that is coded both by shape and color. corrplot() scales pretty well with the number of variables and will give a decent chart with 40 to 50 variables. (Note that now ggcorrplot will do this for ggplot2 plots.) Other plotting options would be to generate pairwise scatter plots and R offers many alternatives for these.

Finally, what about going beyond the multivariate normal and t distributions? R does have a few functions like rlnorm() from the compositions package which generates random variates from the multivariate lognormal distribution that are as easy to use as mvrorm(), but you will have to hunt for them. I think a more fruitful approach if you are serious about probability distributions is to get familiar with the copula package.