There's more to Iowa than just today's presidential primary. Last month, the Central Iowa R User Group hosted Dr. Max Kuhn, Director of Non-Clinical Statistics at Pfizer Global R&D, via video-chat to present on Applied Predictive Modeling with R. Max is the co-author of the excellent book Applied Predictive Modeling (read our review here), and in the presentation he covers many of the topics from the book in a brisk 75 minutes.

Unlike most statistics courses which focus in the inferential side of models, Max's talk instead focuses on creating statistical models where the goal is prediction. Given his background at Pfizer, the talk includes analysis some interesting datasets, including one to predict the performance an algorithm used to identify cell components (the cell wall, nucleus, etc) in a microscopy slide. (The data is public, so you can recreate the analyses using this R code and the caret package.)

In addition to an overview of predictive modeling in general, you'll learn how to use R to use resampling techniques to select tuning parameters in a model (for example, the number of trees to use in a random forest), and how to evaluate the performance for classification using the confusion matriX and ROC curves. Watch the presentation below:

If you find the slides a little small to read, you can download them here and follow along.

Applied Predictive Modeling: Central Iowa R User Group Talk

You may have heard that R and the big-data RevoScaleR package have been integrated with with SQL Server 2016 as SQL Server R Services. If you've been wanting to try out R with SQL Server but haven't been sure where to start, a new MSDN tutorial will take you through all the steps of creating a predictive model: from obtaining data for analysis, to building a statistical model, to creating a stored prodedure to make predictions from the model. To work through the tutorial, you'll need a suitable Windows server on which to install the SQL Server 2016 Community Technology Preview, and make sure you have SQL Server R Services installed. You'll also need a separate Windows machine (say a desktop or laptop) where you'll install Revolution R Open and Revolution R Enterprise. Most of the computations will be happening in SQL Server, though, so this "data science client machine" doesn't need to be as powerful.

The tutorial is made up of five lessons, which together should take you about 90 minutes to run though. If you run into problems, each lesson includes troubleshooting tips at the end.

Lesson 1 begins with downloading the New York City taxi data set (which was also used to create these beautiful data visualizations) and loading it into SQL Server. You'll also set up R to include some useful packages such as ggmap and RODBC.

Lesson 2 starts by having you verify the data using SQL queries. Don't miss the "Next Steps" links near the end, where you'll summarize the data using the RevoScaleR package on the data science client machine, and then visualize the data as a map with the ggmaps package (as shown below).

Lesson 3 focuses on using R to augment the data with new features, such as calculating the distance between pickup and dropoff points using a custom R function or using T-SQL.

Lesson 4 is where you'll use the rxLogit function to train a logistic regression model to predict the probability of a driver receiving a tip for a ride, evauate the model using ROC curves, and then deploy the prediction into SQL Server as a T-SQL stored procedure.

Lesson 5 wraps things up by showing how to use the deployed model in a production environment, both by calculating predictions from a stored dataset in batch mode, and by performing transactional predictions one trip at a time.

To save on cutting-and-pasting, you can find all of the code used in the tutorial on Github. Give it a go, and before long you'll have your own R models running live in SQL Server.

MSDN: End-to-End Data Science Walkthrough: Overview (SQL Server R Services)

by Joseph Rickert

If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners.

There are occasions, however, when you find yourself for one reason or another committed to classifier that doesn’t automatically narrow down that list of predictor variables that some sort of automated feature selection might seem like a good idea. If you are an R user then the caret package offers a whole lot machinery that might be helpful. Caret offers both filter methods and wrapper methods that include recursive feature estimation, genetic algorithms (GAs) and simulated annealing. In this post, we will have a look at a small experiment with caret’s GA option. But first, a little background.

Performing feature selection with GAs requires conceptualizing the process of feature selection as an optimization problem and then mapping it to the genetic framework of random variation and natural selection. Individuals from a given generation of a population mate to produce offspring who inherit genes (chromosomes) from both parents. Random mutation alters a small part of child’s genetic material. The children of this new generation who are genetically most fit produce the next generation. In the feature selection context, individuals become solutions to a prediction problem. Chromosomes (sequences of genes) are modeled as vectors of 1’s and 0’s with a 1 indicating the presence of a feature and a 0 its absence. The simulated genetic algorithm then does the following: it selects two individuals, randomly chooses a split point for their chromosomes, maps the front of one chromosome to the back of the other (and vice versa) and then randomly mutates the resulting chromosomes according to some predetermined probability.

In their book Applied Predictive Modeling, Kuhn and Johnson provide the following pseudo code for caret's GA:

- Define stopping criteria, population size, P, for each generation, and mutation probability, p
_{m} - Randomly generate an initial population of chromosomes
- repeat:
- | for each chromosome do
- | Tune and train a model and compute each chromosome's fitness
- | end
- | for each reproduction 1 ... P/2 do
- | Select 2 chromosomes based on fitness
- | Crossover: randomly select a locus and exchange genes on either side of locus
- | (head of one chromosome applied to tail of the other and vice versa)
- | to produce 2 child chromosomes with mixed genes
- | Mutate the child chromosomes with probability p
_{m} - | end
- until stopping criterion are met

If 10 fold cross validation is selected in the GA control procedure, then the entire genetic algorithm (steps 2 through 13) is run 10 times.

Now, just for fun, I'll conduct the following experiment to see if GA feature selection will improve on the performance of the support vector machine model featured in a previous post. As before, I use the segmentationData data set that is included in the caret package and described in the paper by Hill et al. This data set has 2,019 rows and 58 possible feature variables. The structure of the experiment is as follows:

- Divide the data into a training and test data sets.
- Run the GA feature selection algorithm on the training data set to produce a subset of the training set with the selected features.
- Train the SVM algorithm on this subset.
- Assess the performance of the SVM model using the subset of the test data that contains the selected features.
- Train the SVM model on the entire training data set.
- Assess the performance of this second SVM model using the test data set.
- Compare the performance of the two SVM models.

(Note that I have cut corners here by training the SVM on the same data that was used in the GA. A methodological improvement would be to divide the original data into three sets.)

The is first block of code sets up for the experiment and divides the data into training and test data sets.

library(caret) library(doParallel) # parallel processing library(dplyr) # Used by caret library(pROC) # plot the ROC curve ### Use the segmentationData from caret # Load the data and construct indices to divided it into training and test data sets. set.seed(10) data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Create training feature data frame testX <- testData[,-1] # Create test feature data frame y=trainData$Class # Target variable for training

Next run the GA. Note the first line of code registers the parallel workers to set up for having the caret run the GA in parallel using the 4 cores on my Windows laptop.

registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() # check that there are 4 workers ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel=TRUE, # Use parallel programming allowParallel = TRUE) ## set.seed(10) lev <- c("PS","WS") # Set the levels system.time(rf_ga3 <- gafs(x = trainX, y = y, iters = 100, # 100 generations of algorithm popSize = 20, # population size for each generation levels = lev, gafsControl = ga_ctrl))

The parameter settings of the gafsControl() function indicate that the internally implemented random forests model and 10 fold cross validation are to be used to assess performance of the "chromosomes" in each generation. The parameters for the gafs() function itself specify 100 generations of populations consisting of 20 individuals. Ideally, it would be nice to let the algorithm run for more iterations with larger populations and perhaps repeated 10 fold CV. However, even these modest parameter settings generate a tremendous number of calculations. (The algorithm took 4 hours to complete on a small Azure VM.)

The results below, and the plot that follows them, show that the best performance was achieved on the at iteration 9 with a subset of 44 variables.

rf_ga3 1010 samples 58 predictors 2 classes: 'PS', 'WS' Maximum generations: 100 Population per generation: 20 Crossover probability: 0.8 Mutation probability: 0.1 Elitism: 0 Internal performance values: Accuracy, Kappa Subset selection driven to maximize internal Accuracy External performance values: Accuracy, Kappa Best iteration chose by maximizing external Accuracy External resampling method: Cross-Validated (10 fold) During resampling: * the top 5 selected variables (out of a possible 58): DiffIntenDensityCh4 (100%), EntropyIntenCh1 (100%), EqEllipseProlateVolCh1 (100%), EqSphereAreaCh1 (100%), FiberWidthCh1 (100%) * on average, 39.3 variables were selected (min = 25, max = 52) In the final search using the entire training set: * 44 features selected at iteration 9 including: AvgIntenCh1, AvgIntenCh4, ConvexHullAreaRatioCh1, ConvexHullPerimRatioCh1, EntropyIntenCh3 ... * external performance at this iteration is Accuracy Kappa 0.8406 0.6513 plot(rf_ga3) # Plot mean fitness (AUC) by generation

The plot also shows the average internal accuracy estimates as well as the average external estimates calculated from the 10 out of sample predictions.

This next section of code trains an SVM using only the selected features setting up a grid to search the parameter space; again, using the parallel computing feature built into caret.

final <- rf_ga3$ga$final # Get features selected by GA trainX2 <- trainX[,final] # training data: selected features testX2 <- testX[,final] # test data: selected features ## SUPPORT VECTOR MACHINE MODEL #Note the default method of picking the best model is accuracy and Cohen's Kappa # Set up training control ctrl <- trainControl(method="repeatedcv", # 10fold cross validation repeats=5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) #Use the expand.grid to specify the search space #Note that the default search grid selects 3 values of each tuning parameter grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4 n.trees=seq(10,100,by=10), # let iterations go from 10 to 100 shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate n.minobsinnode = 20) # # Set up for parallel processing set.seed(1951) registerDoParallel(4,cores=4) #Train and Tune the SVM svm.tune <- train(x=trainX2, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above )

Finally, assess the performance of the model using the test data set.

#Make predictions on the test data with the SVM Model svm.pred <- predict(svm.tune,testX2) confusionMatrix(svm.pred,testData$Class) Confusion Matrix and Statistics Reference Prediction PS WS PS 560 106 WS 84 259 Accuracy : 0.8117 95% CI : (0.7862, 0.8354) No Information Rate : 0.6383 P-Value [Acc > NIR] : <2e-16 Kappa : 0.5868 Mcnemar's Test P-Value : 0.1276 Sensitivity : 0.8696 Specificity : 0.7096 Pos Pred Value : 0.8408 Neg Pred Value : 0.7551 Prevalence : 0.6383 Detection Rate : 0.5550 Detection Prevalence : 0.6601 Balanced Accuracy : 0.7896 # 'Positive' Class : PS svm.probs <- predict(svm.tune,testX2,type="prob") # Gen probs for ROC svm.ROC <- roc(predictor=svm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) svm.ROC #Area under the curve: 0.8881 plot(svm.ROC,main="ROC for SVM built with GA selected features")

Voilà - a decent model, but at significant computational cost and, as it turns out, with no performance improvement! If the GA feature selection procedure is just omitted, and the SVM is fit using the training and tuning procedure described above with the same seed settings it will produce a model with an AUC of 0.8888.

In the "glass half empty" interpretation of the experiment I did a lot of work for nothing. On the other hand, looking at a "glass half full", using automatic GA feature selection, I was able to build a model that achieved the same performance as the full model but with 24% fewer features. My guess though is that I just got lucky this time. I think the take away is that this kind of automated feature selection might be worth while if there is a compelling reason to reduce the number of features, perhaps at the expense of a decrease in performance. However, unless you know that you are dealing with a classifier that is easily confused by irrelevant predictor variables there is no reason to expect that GA, or any other "wrapper style" feature selection method, will improve performance.

by Nina Zumel

Principal Consultant Win-Vector LLC

We've just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we've tried to touch on the highlights of the papers, and to play around with variations of our own.

**A Simpler Explanation of Differential Privacy**: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in*Science*(Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”,*Science*, vol 349, no. 6248, pp. 636-638, August 2015). Note that Cynthia Dwork, one of the inventors of differential privacy, originally used it in the analysis of sensitive information.**Using differential privacy to reuse training data**: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.**A simple differentially private procedure**: The bootstrap as an alternative to Laplace noise to introduce differential privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself. Image Credit

*Editor's Note:**The R code includes an example of using vtreat, a package for preparing and cleaning data frames based on level-based feature pruning.*

by Joseph Rickert

In a recent previous post, I wrote about support vector machines, the representative master algorithm of the 5th tribe of machine learning practitioners described by Pedro Domingos in his book, The Master Algorithm. Here we look into algorithms favored by the first tribe, the symbolists, who see learning as the process of inverse deduction. Pedro writes:

Another limitation of inverse deduction is that it's very computational intensive, which makes it hard to scale to massive data sets. For these, the symbolist algorithm of choice is decision tree induction. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. (p85)

The de facto standard for decision trees or “recursive partitioning” trees as they are known in the literature, is the CART algorithm by Breiman et al. (1984) implemented in R's rpart package. Stripped down to it’s essential structure, CART is a two stage algorithm. In the first stage, the algorithm conducts an exhaustive search over each variable to find the best split by maximizing an information criterion that will result in cells that are as pure as possible for one or the other of the class variables. In the second stage, a constant model is fit to each cell of the resulting partition. The algorithm then proceeds in a recursive “greedy” fashion making splits and not looking back to see how things might have been before making the next split. Although hugely successful in practice, the algorithm has two vexing problems: (1) overfitting and (2) selection bias – the algorithm favors features with many possible splits^{1}. Overfitting occurs because the algorithm has “no concept of statistical significance” ^{2}. While overfitting is usually handled with cross validation and pruning there doesn’t seem to be an easy way to deal with selection bias in the CART / rpart framework.

To address these issues Hothorn, Hornik and Zeileis introduced the party package into R about ten years ago which provides an implementation of conditional inference trees. (Unbiased Recursive Partitioning: A Conditional Inference Framework) Party’s ctree() function separates the selection of variables for splitting and the splitting process itself into two different steps and explicitly addresses bias selection by implementing statistical testing and a stopping procedure in the first step. Very roughly, the algorithm proceeds as follows:

- Each node of the tree is represented by a set of weights. Then, for each covariate vector X, the algorithm tests the null hypothesis that the dependent variable Y is independent of X. If the hypothesis cannot be rejected then the algorithm stops. Otherwise, the covariate with the strongest association with Y is selected for splitting.
- The algorithm performs a split and updates the weights describing the tree.
- Steps 1 and 2 are repeated recursively with the new parameter settings.

The details, along with enough theory to use the ctree algorithm with some confidence, are presented in this accessible vignette: “party: A Laboratory for Recursive Partitioning. The following example contrasts the ctree() and rpart() algorithms.

We begin by dividing the segmationData data set that comes with the caret package into training and test sets and fitting a ctree() model to it using the default parameters. No attempt is made to optimize the model. Next, we use the model to predict values of the Class variable on the test data set and calculate the area under the ROC curve to be 0.8326.

# Script to compare ctree with rpart library(party) library(rpart) library(caret) library(pROC) ### Get the Data # Load the data and construct indices to divide it into training and test data sets. data(segmentationData) # Load the segmentation data set data <- segmentationData[,3:61] data$Class <- ifelse(data$Class=="PS",1,0) # trainIndex <- createDataPartition(data$Class,p=.7,list=FALSE) trainData <- data[trainIndex,] testData <- data[-trainIndex,] #------------------------ set.seed(23) # Fit Conditional Tree Model ctree.fit <- ctree(Class ~ ., data=trainData) ctree.fit plot(ctree.fit,main="ctree Model") #Make predictions using the test data set ctree.pred <- predict(ctree.fit,testData) #Draw the ROC curve ctree.ROC <- roc(predictor=as.numeric(ctree.pred), response=testData$Class) ctree.ROC$auc #Area under the curve: 0.8326 plot(ctree.ROC,main="ctree ROC")

Here are the text and graphical descriptions of the resulting tree.

1) FiberWidthCh1 <= 9.887543; criterion = 1, statistic = 383.388 2) TotalIntenCh2 <= 42511; criterion = 1, statistic = 115.137 3) TotalIntenCh1 <= 39428; criterion = 1, statistic = 20.295 4)* weights = 504 3) TotalIntenCh1 > 39428 5)* weights = 9 2) TotalIntenCh2 > 42511 6) AvgIntenCh1 <= 199.2768; criterion = 1, statistic = 28.037 7) IntenCoocASMCh3 <= 0.5188792; criterion = 0.99, statistic = 14.022 8)* weights = 188 7) IntenCoocASMCh3 > 0.5188792 9)* weights = 7 6) AvgIntenCh1 > 199.2768 10)* weights = 36 1) FiberWidthCh1 > 9.887543 11) ShapeP2ACh1 <= 1.227156; criterion = 1, statistic = 48.226 12)* weights = 169 11) ShapeP2ACh1 > 1.227156 13) IntenCoocContrastCh3 <= 12.32349; criterion = 1, statistic = 22.349 14) SkewIntenCh4 <= 1.148388; criterion = 0.998, statistic = 16.78 15)* weights = 317 14) SkewIntenCh4 > 1.148388 16)* weights = 109 13) IntenCoocContrastCh3 > 12.32349 17) AvgIntenCh2 <= 244.9512; criterion = 0.999, statistic = 19.382 18)* weights = 53 17) AvgIntenCh2 > 244.9512 19)* weights = 22

Next, we fit an rpart() model to the training data using the default parameter settings and calculate the AUC to be 0.8536 on the test data.

# Fit CART Model rpart.fit <- rpart(Class ~ ., data=trainData,cp=0) rpart.fit plot(as.party(rpart.fit),main="rpart Model") #Make predictions using the test data set rpart.pred <- predict(rpart.fit,testData) #Draw the ROC curve rpart.ROC <- roc(predictor=as.numeric(rpart.pred), response=testData$Class) rpart.ROC$auc #Area under the curve: 0.8536 plot(rpart.ROC)

The resulting pruned tree does better than ctree(), but at the expense of building a slightly deeper tree.

1) root 1414 325.211500 0.64144270 2) TotalIntenCh2>=42606.5 792 191.635100 0.41035350 4) FiberWidthCh1>=11.19756 447 85.897090 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.548390 0.09677419 * 9) ShapeP2ACh1>=1.225676 292 66.065070 0.34589040 18) SkewIntenCh4< 1.41772 254 53.259840 0.29921260 36) TotalIntenCh4< 127285.5 214 40.373830 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.943660 0.16901410 * 73) EqEllipseOblateVolCh1< 383.1453 72 17.500000 0.41666670 146) AvgIntenCh1>=110.2253 40 6.400000 0.20000000 * 147) AvgIntenCh1< 110.2253 32 6.875000 0.68750000 * 37) TotalIntenCh4>=127285.5 40 9.900000 0.55000000 * 19) SkewIntenCh4>=1.41772 38 8.552632 0.65789470 * 5) FiberWidthCh1< 11.19756 345 82.388410 0.60579710 10) KurtIntenCh1< -0.3447192 121 28.000000 0.36363640 20) TotalIntenCh1>=13594 98 19.561220 0.27551020 * 21) TotalIntenCh1< 13594 23 4.434783 0.73913040 * 11) KurtIntenCh1>=-0.3447192 224 43.459820 0.73660710 22) AvgIntenCh1>=454.3329 7 0.000000 0.00000000 * 23) AvgIntenCh1< 454.3329 217 39.539170 0.76036870 46) VarIntenCh4< 130.9745 141 31.333330 0.66666670 92) NeighborAvgDistCh1>=256.5239 30 6.300000 0.30000000 * 93) NeighborAvgDistCh1< 256.5239 111 19.909910 0.76576580 * 47) VarIntenCh4>=130.9745 76 4.671053 0.93421050 * 3) TotalIntenCh2< 42606.5 622 37.427650 0.93569130 6) ShapeP2ACh1< 1.236261 11 2.545455 0.36363640 * 7) ShapeP2ACh1>=1.236261 611 31.217680 0.94599020 * >

Note, however, that complexity parameter for rpart(), cp, is set to zero rpart() builds a massive tree, a portion of which is shown below, and over fits the data yielding an AUC of 0.806

1) root 1414 325.2115000 0.64144270 2) TotalIntenCh2>=42606.5 792 191.6351000 0.41035350 4) FiberWidthCh1>=11.19756 447 85.8970900 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.5483900 0.09677419 16) EntropyIntenCh1>=6.672119 133 7.5187970 0.06015038 32) AngleCh1< 108.6438 82 0.0000000 0.00000000 * 33) AngleCh1>=108.6438 51 6.7450980 0.15686270 66) EqEllipseLWRCh1>=1.184478 26 0.9615385 0.03846154 132) DiffIntenDensityCh3>=26.47004 19 0.0000000 0.00000000 * 133) DiffIntenDensityCh3< 26.47004 7 0.8571429 0.14285710 * 67) EqEllipseLWRCh1< 1.184478 25 5.0400000 0.28000000 134) IntenCoocContrastCh3>=9.637027 9 0.0000000 0.00000000 * 135) IntenCoocContrastCh3< 9.637027 16 3.9375000 0.43750000 * 17) EntropyIntenCh1< 6.672119 22 4.7727270 0.31818180 34) ShapeBFRCh1>=0.6778205 13 0.0000000 0.00000000 * 35) ShapeBFRCh1< 0.6778205 9 1.5555560 0.77777780 * 9) ShapeP2ACh1>=1.225676 292 66.0650700 0.34589040 18) SkewIntenCh4< 1.41772 254 53.2598400 0.29921260 36) TotalIntenCh4< 127285.5 214 40.3738300 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.9436600 0.16901410 144) IntenCoocEntropyCh3< 7.059374 133 16.2857100 0.14285710 288) NeighborMinDistCh1>=21.91001 116 11.5431000 0.11206900 576) NeighborAvgDistCh1>=170.2248 108 8.2500000 0.08333333 1152) FiberAlign2Ch4< 1.481728 68 0.9852941 0.01470588 2304) XCentroid>=100.5 61 0.0000000 0.00000000 * 2305) XCentroid< 100.5 7 0.8571429 0.14285710 * 1153) FiberAlign2Ch4>=1.481728 40 6.4000000 0.20000000 2306) SkewIntenCh1< 0.9963465 27 1.8518520 0.07407407

In practice, rpart()'s complexity parameter (default value cp = .01) is effective in controlling tree growth and overfitting. It does, however, have an "ad hoc" feel to it. In contrast, the ctree() algorithm implements tests for statistical significance within the process of growing a decision tree. It automatically curtails excessive growth, inherently addresses both overfitting and bias and offers the promise of achieving good models with less computation.

Finally, note that rpart() and ctree() construct different trees that offer about the same performance. Some practitioners who value decision trees for their interpretability find this disconcerting. End users of machine learning models often want at story that tells them something true about their customer's behavior or buying preferences etc. But, the likelihood of there being multiple satisfactory answers to a complex problem is inherent to the process of inverse deduction. As Hothorn et al. comment:

Since a key reason for the popularity of tree based methods stems from their ability to represent the estimated regression relationship in an intuitive way, interpretations drawn from regression trees must be taken with a grain of salt.

1 Hothorn et al. (2006): Unbiased Recursive Partitioning: A Conditional Inference Framework J COMPUT GRAPH STAT Vol(15) No(3) Sept 2006

2. Mingers 1987: Expert Systems-Rule Induction with Statistical Data

Cortana Analytics Suite is Microsoft's cloud-based big data and advanced analytics suite. It includes a complete set of all the services need to build advanced analytics applications: from data ingestion and management, data warehousing, advanced analytics, data visualization and solution frameworks. You can use Cortana Analytics to build applications using R, by incorporating services including Data Factory, HDInsights Hadoop, and ML Studio.

If you'd like to spend some quality time with other developers and the Microsoft development team, there will be a first-ever Cortana Analytics Workshop at Microsoft HQ in Redmond, September 10-11. Or if you just want to check it out online, you can watch this introductory webinar, or join this upcoming application-focused webinar on September 1:

Leveraging Predictive Analytics for Sales and MarketingIn this session, learn how the Microsoft Global Marketing team built predictive analytics solutions to meet the marketing and sales business need for Microsoft subsidiaries and business groups. The session will describe how Cortana Analytics components (Azure Machine Learning, Azure Data Factory) are used to build e2e solution in Cloud.

Sign up for this free webinar here, or follow the link below to explore Cortana Analytics in depth.

Microsoft: Cortana Analytics Suite

The KDD Cup is an annual competition to build the best predictive model from a large data set. This years' contest tasked entrants to predict the likelihood of a student dropping out from one of XuetangX's massively-online open courses, based on the student's prior activities. The competition closed on July 12, and yesterday, the winning teams were announced. The winner was team "Intercontinental Ensemble" and the runner-up was "FEG&NSSOL@DataVeraci".

I couldn't find any details on what techniques were used — more will be revealed, I expect, at the KDD Conference in Sydney. But if you want to get a sense of what it's like to work with these data, take a look at this Data Until I Die blog post from a competitor who got close to the top of the leaderboard. He or she used a Gradient Boosting Model from the H20 R package, and found (amongst other things) that students who had completed prior courses were more likely to complete the next one.

If you'd like to play around with the data yourself, it's no longer available at the KDD Cup site, but it is available in an experiment in Azure ML Studio. If you haven't used Azure ML Studio before, it's free to get started and all you need is a modern web broswer (I used Chrome on a Mac). The screenshot below just shows the data munging steps, but later on in the flow a Python node is used to fit a predictive model. (This step-by-step tutorial on analyzing the KDD 2015 data walks you through the steps.) It's easy to add an R node as well, which gives you an R instance with 50 Gb of RAM and 8 cores to analyze the data.

For more details on using Azure ML Studio to analyze the KDD Cup data, check out the blog post below.

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#----------------------------------------------- # Set up the data location information bigDataDir <- "C:/Data/Mortgage" mortCsvDataName <- file.path(bigDataDir,"mortDefault") trainingDataFileName <- "mortDefaultTraining" mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") targetDataFileName <- "mortDefault2009.xdf" #--------------------------------------- # Import the data from multiple .csv files into2 .XDF files # One file, the training file containing data from the years # 2000 through 2008. # The other file, the test file, containing data from the year 2009. defaultLevels <- as.character(c(0,1)) ageLevels <- as.character(c(0:40)) yearLevels <- as.character(c(2000:2009)) colInfo <- list(list(name = "default", type = "factor", levels = defaultLevels), list(name = "houseAge", type = "factor", levels = ageLevels), list(name = "year", type = "factor", levels = yearLevels)) append= FALSE for (i in 2000:2008) { importFile <- paste(mortCsvDataName, i, ".csv", sep = "") rxImport(inData = importFile, outFile = trainingDataFileName, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefaultTraining.xdf #Number of observations: 9e+06 #Number of variables: 6 #Number of blocks: 18 #Compression type: zlib #Variable information: #Var 1: creditScore, Type: integer, Low/High: (432, 955) #Var 2: houseAge #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40 #Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) #Var 4: ccDebt, Type: integer, Low/High: (0, 15566) #Var 5: year #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 #Var 6: default #2 factor levels: 0 1 rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo) rxGetInfo(targetDataFileName) #> rxGetInfo(targetDataFileName) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefault2009.xdf #Number of observations: 1e+06 #Number of variables: 6 #Number of blocks: 2 #Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt, data = trainingDataFileName, smoothingFactor = 1) #Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds #Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds #Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB # #Naive Bayes Classifier # #Call: #rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + #ccDebt, data = trainingDataFileName, smoothingFactor = 1) # #A priori probabilities: #default #0 1 #0.997242889 0.002757111 # #Predictor types: #Variable Type #1 year factor #2 creditScore numeric #3 yearsEmploy numeric #4 ccDebt numeric # #Conditional probabilities: #$year #year #default 2000 2001 2002 2003 2004 #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01 #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02 #year #default 2005 2006 2007 2008 2009 #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07 #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05 # #$creditScore #Means StdDev #0 700.0839 50.00289 #1 686.5243 49.71074 # #$yearsEmploy #Means StdDev #0 5.006873 2.009446 #1 4.133030 1.969213 # #$ccDebt #Means StdDev #0 4991.582 1976.716 #1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") #Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 # secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1") mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1)) #head(mortNBPred) #prob_0 prob_1 default_Pred #1 0.9968860 0.003114038 0 #2 0.9569425 0.043057472 0 #3 0.5725627 0.427437291 0 #4 0.9989603 0.001039729 0 #5 0.7372746 0.262725382 0 #6 0.4142266 0.585773432 1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]] predicted_value <- mortNBPred[["default_Pred"]] results <- table(predicted_value,actual_value) #> results #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 pctMisclassified <- sum(results[2,3])/sum(results)*100 pctMisclassified #[1] 10.1779

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results library(caret) library(e1071) confusionMatrix(results,positive="1") #Confusion Matrix and Statistics # #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 # #Accuracy : 0.8982 #95% CI : (0.8976, 0.8988) #No Information Rate : 0.9753 #P-Value [Acc > NIR] : 1 # #Kappa : NA #Mcnemar's Test P-Value : <2e-16 # #Sensitivity : 0.84673 #Specificity : 0.89953 #Pos Pred Value : 0.17614 #Neg Pred Value : 0.99570 #Prevalence : 0.02474 #Detection Rate : 0.02095 #Detection Prevalence : 0.11894 #Balanced Accuracy : 0.87313 # #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1) names(roc_data) <- c("predicted_value","actual_value") head(roc_data) hist(roc_data$actual_value) rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.

by Joseph Rickert

Last year in a post on interesting R topics presented at the JSM I described how data scientists in Google's human resources department were using R and predictive analytics to better understand the characteristics of its workforce. Google may very well have done the pioneering work, but predictive analytics for HR applications is going mainstream. In the still below from a Predictive Analytics Times video on *Data Science for Work Force Optimization* Pasha Roberts, Chief Scientists at Talent Analytics, describes using survival analysis for modeling employee retention.

The video begins with a discussion of data analytics in industry, spends some time on three important curves for workforce analysis, presents some tips for talent modeling and ends with a case study on call center attrition. During the course of his presentation Pasha walks through all of the stages of a project from formulating a hypothesis, through model building and testing to model deployment.

But Pasha covers more ground than model building alone. It appears that leading edge HR departments are moving towards predicting individual employee performance. The discussion of Aptitude Metrics about 45 minutes into the talk should be of interest to anyone working, or looking for work at a technology company. Quantitative evaluation is likely to be a big part of our future. This video is well worth watching.