by Joseph Rickert

The second, annual H2O World conference finished up yesterday. More than 700 people from all over the US attended the three-day event that was held at the Computer History Museum in Mountain View, California; a venue that pretty much sits well within the blast radius of ground zero for Data Science in the Silicon Valley. This was definitely a conference for practitioners and I recognized quite a few accomplished data scientists in the crowd. Unlike many other single-vendor productions, this was a genuine Data Science event and not merely a vendor showcase. H2O is a relatively small company, but they took a big league approach to the conference with an emphasis on cultivating the community of data scientists and delivering presentations and panel discussions that focused on programming, algorithms and good Data Science practice.

The R based sessions I attended on the tutorial day were all very well done. Each was designed around a carefully crafted R script performing a non-trivial model building exercise and showcasing one or more of the various algorithms in the H2O repertoire including GLMs, Gradient Boosting Machines, Random Forests and Deep Learning Neural Nets. The presentations were targeted to a sophisticated audience with considerable discussion of pros and cons. Deep Learning is probably H2O's signature algorithm, but despite its extremely impressive performance in many applications nobody here was selling it as the answer to everything.

The following code fragment from a script (Download Deeplearning)that uses deep learning to identify a spiral pattern in a data set illustrates the current look and feel of H2O's R interface. Any function that begins with h2o. runs in the JVM not in the R environment. (Also note that if you want to run the code you must first install Java on your machine, the Java Runtime Environment will do. Then, download the H2O R package Version 3.6.0.3 from the company's website. The scripts will not run with the older version of the package on CRAN.)

### Cover Type Dataset #We important the full cover type dataset (581k rows, 13 columns, 10 numerical, 3 categorical). #We also split the data 3 ways: 60% for training, 20% for validation (hyper parameter tuning) and 20% for final testing. # df <- h2o.importFile(path = normalizePath("../data/covtype.full.csv")) dim(df) df splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234) train <- h2o.assign(splits[[1]], "train.hex") # 60% valid <- h2o.assign(splits[[2]], "valid.hex") # 20% test <- h2o.assign(splits[[3]], "test.hex") # 20% # #Here's a scalable way to do scatter plots via binning (works for categorical and numeric columns) to get more familiar with the dataset. # #dev.new(noRStudioGD=FALSE) #direct plotting output to a new window par(mfrow=c(1,1)) # reset canvas plot(h2o.tabulate(df, "Elevation", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Cover_Type")) plot(h2o.tabulate(df, "Soil_Type", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Elevation" )) # #### First Run of H2O Deep Learning #Let's run our first Deep Learning model on the covtype dataset. #We want to predict the `Cover_Type` column, a categorical feature with 7 levels, and the Deep Learning model will be tasked to perform (multi-class) classification. It uses the other 12 predictors of the dataset, of which 10 are numerical, and 2 are categorical with a total of 44 levels. We can expect the Deep Learning model to have 56 input neurons (after automatic one-hot encoding). # response <- "Cover_Type" predictors <- setdiff(names(df), response) predictors # #To keep it fast, we only run for one epoch (one pass over the training data). # m1 <- h2o.deeplearning( model_id="dl_model_first", training_frame=train, validation_frame=valid, ## validation dataset: used for scoring and early stopping x=predictors, y=response, #activation="Rectifier", ## default #hidden=c(200,200), ## default: 2 hidden layers with 200 neurons each epochs=1, variable_importances=T ## not enabled by default ) summary(m1) # #Inspect the model in [Flow](http://localhost:54321/) for more information about model building etc. by issuing a cell with the content `getModel "dl_model_first"`, and pressing Ctrl-Enter. # #### Variable Importances #Variable importances for Neural Network models are notoriously difficult to compute, and there are many [pitfalls](ftp://ftp.sas.com/pub/neural/importance.html). H2O Deep Learning has implemented the method of [Gedeon](http://cs.anu.edu.au/~./Tom.Gedeon/pdfs/ContribDataMinv2.pdf), and returns relative variable importances in descending order of importance. # head(as.data.frame(h2o.varimp(m1))) # #### Early Stopping #Now we run another, smaller network, and we let it stop automatically once the misclassification rate converges (specifically, if the moving average of length 2 does not improve by at least 1% for 2 consecutive scoring events). We also sample the validation set to 10,000 rows for faster scoring. # m2 <- h2o.deeplearning( model_id="dl_model_faster", training_frame=train, validation_frame=valid, x=predictors, y=response, hidden=c(32,32,32), ## small network, runs faster epochs=1000000, ## hopefully converges earlier... score_validation_samples=10000, ## sample the validation dataset (faster) stopping_rounds=2, stopping_metric="misclassification", ## could be "MSE","logloss","r2" stopping_tolerance=0.01 ) summary(m2) plot(m2)

First notice that it all looks pretty much like R code. The script mixes standard R functions and H2O functions in a natural way. For example, h20.tabulate() produces an object of class "list" and h20.deeplearning() yields a model object that plot can deal with. This is just really baseline stuff that has to happen to provide make H2O coding feel like R. But note that the H2O code goes beyond this baseline requirement. The functions h2o.splitFrame() and h2o.assign() manipulate data residing in the JVM in a way that will probably seem natural to most R users, and the function signatures also seem to be close enough "R like" to go unnoticed. All of this reflects the conscious intent of the H2O designers not only to provide tools to facilitate the manipulation of H2O data from the R environment, but also to try and replicate the R experience.

An innovative new feature of the h20.deeplearning() function itself is the ability to specify a stopping metric. The parameter setting: (stopping_metric="misclassification", ## could be "MSE","logloss","r2" ) in the specification of model m2 means that the neural net will continue to learn until the specified performance threshold is achieved. In most cases, this will produce a useful model in much less time than it would take to have the learner run to completion. The following plot, generated in the script referenced above, shows the kind of problem for which the Deep Learning algorithm excels.

Highlights of the conference for me included the presentations listed below. The videos and slides (when available) from all of these presentations will be posted on the H2O conference website. Some have been posted already and the rest should follow soon. (I have listed the dates and presentation times to help you locate the slides when they become available)

Madeleine Udell (11-11: 10:30AM) presented the mathematics underlying the new algorithm, Generalized Low Rank Models (GLRM), she developed as part of her PhD work under Stephen Boyd, professor at Stanford University and adviser to H2O. This algorithm which generalizes PCA to deal with heterogeneous data types shows great promise for a variety of data science applications. Among other things, it offers a scalable way to impute missing data. This was possibly the best presentation of the conference. Madeleine is an astonishingly good speaker; she makes the math exciting.

Anqi Fu (11-9: 3PM) presented her H2O implementation of the GLRM. Anqi not only does a great job of presenting the algorithm, she also offers some real insight into the challenges of turning the mathematics into production level code. You can download one of Anqi's demo R scripts here: Download Glrm.census.labor.violations. To my knowledge, Anqi's code is the only scalable implementation of the GLRM. (Madeleine wrote the prototype code in Julia.)

Matt Dowle (11-10), of data.table fame, demonstrated his port of data.table's lightning fast radix sorting algorithm to H2O. Matt showed a 1B row X 1B row table join that runs in about 1.45 minutes on a 4 node 128 core H2O cluster. This is very impressive result, but Matt says he can already do 10B x 10B row joins, and is shooting for 100B x 100B rows.

Professor Rob Tibshirani (11-11: 11AM) presented work he is doing that may lead to lasso based models capable of detecting the presence of cancer in tissue extracted from patients while they are on the operating table! He described "Customized Learning", a method of building individual models for each patient. The basic technique is to pool the data from all of the patients and run a clustering algorithm. Then, for each patient fit a model using only the data in the patient's cluster. This is exciting work with the real potential to save lives.

Professor Stephen Boyd (11-10: 11AM) delivered a tutorial on optimization starting with basic convex optimization problems and then went on to describe Consensus Optimization, an algorithm for building machine learning models from data stored at different locations without sharing the data among the locations. Professor Boyd is a lucid and entertaining speaker, the kind of professor you will wish you had had.

Arno Candel (11-9: 1:30PM) presented the Deep Learning model which he developed at H2O. Arno is an accomplished speaker who presents the details with great clarity and balance. Be sure to have a look at his slide showing the strengths and weaknesses of Deep Learning.

Erin LeDell (11-9: 3PM) de-mystified ensembles and described how to build an ensemble learner from scratch. Anyone who wants to compete in a Kaggle competition should find this talk to be of value.

Szilard Pafka (11-11:3PM), in a devastatingly effective, low key presentation, described his efforts to benchmark the open source, machine learning platforms R, Python scikit, Vopal Wabbit, H2O, xgboost and Spark MLlib. Szilard downplayed his results, pointing out that they are in no way meant to be either complete nor conclusive. Nevertheless, Szilard put some considerable effort into the benchmarks. (He worked directly with all of the development teams for the various platforms.) Szilard did not offer any conclusions, but things are not looking all that good for Spark. The following slide plots AUC vs file size up to 10M rows.

Szilard's presentation should be available on the H2O site soon, but it is also available here.

I also found the Wednesday morning panel discussion on the "Culture of Data Driven Decision Making" and the Wednesday afternoon panel on "Algorithms -Design and Application" to be informative and well worth watching. Both panels included a great group of articulate and knowledgeable people.

If you have not checked in with H2O since the post I wrote last year, here' on one slide, is some of what they have been up to since then.

Congratulations to H2O for putting on a top notch event!

by Joseph Rickert

In a recent previous post, I wrote about support vector machines, the representative master algorithm of the 5th tribe of machine learning practitioners described by Pedro Domingos in his book, The Master Algorithm. Here we look into algorithms favored by the first tribe, the symbolists, who see learning as the process of inverse deduction. Pedro writes:

Another limitation of inverse deduction is that it's very computational intensive, which makes it hard to scale to massive data sets. For these, the symbolist algorithm of choice is decision tree induction. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. (p85)

The de facto standard for decision trees or “recursive partitioning” trees as they are known in the literature, is the CART algorithm by Breiman et al. (1984) implemented in R's rpart package. Stripped down to it’s essential structure, CART is a two stage algorithm. In the first stage, the algorithm conducts an exhaustive search over each variable to find the best split by maximizing an information criterion that will result in cells that are as pure as possible for one or the other of the class variables. In the second stage, a constant model is fit to each cell of the resulting partition. The algorithm then proceeds in a recursive “greedy” fashion making splits and not looking back to see how things might have been before making the next split. Although hugely successful in practice, the algorithm has two vexing problems: (1) overfitting and (2) selection bias – the algorithm favors features with many possible splits^{1}. Overfitting occurs because the algorithm has “no concept of statistical significance” ^{2}. While overfitting is usually handled with cross validation and pruning there doesn’t seem to be an easy way to deal with selection bias in the CART / rpart framework.

To address these issues Hothorn, Hornik and Zeileis introduced the party package into R about ten years ago which provides an implementation of conditional inference trees. (Unbiased Recursive Partitioning: A Conditional Inference Framework) Party’s ctree() function separates the selection of variables for splitting and the splitting process itself into two different steps and explicitly addresses bias selection by implementing statistical testing and a stopping procedure in the first step. Very roughly, the algorithm proceeds as follows:

- Each node of the tree is represented by a set of weights. Then, for each covariate vector X, the algorithm tests the null hypothesis that the dependent variable Y is independent of X. If the hypothesis cannot be rejected then the algorithm stops. Otherwise, the covariate with the strongest association with Y is selected for splitting.
- The algorithm performs a split and updates the weights describing the tree.
- Steps 1 and 2 are repeated recursively with the new parameter settings.

The details, along with enough theory to use the ctree algorithm with some confidence, are presented in this accessible vignette: “party: A Laboratory for Recursive Partitioning. The following example contrasts the ctree() and rpart() algorithms.

We begin by dividing the segmationData data set that comes with the caret package into training and test sets and fitting a ctree() model to it using the default parameters. No attempt is made to optimize the model. Next, we use the model to predict values of the Class variable on the test data set and calculate the area under the ROC curve to be 0.8326.

# Script to compare ctree with rpart library(party) library(rpart) library(caret) library(pROC) ### Get the Data # Load the data and construct indices to divide it into training and test data sets. data(segmentationData) # Load the segmentation data set data <- segmentationData[,3:61] data$Class <- ifelse(data$Class=="PS",1,0) # trainIndex <- createDataPartition(data$Class,p=.7,list=FALSE) trainData <- data[trainIndex,] testData <- data[-trainIndex,] #------------------------ set.seed(23) # Fit Conditional Tree Model ctree.fit <- ctree(Class ~ ., data=trainData) ctree.fit plot(ctree.fit,main="ctree Model") #Make predictions using the test data set ctree.pred <- predict(ctree.fit,testData) #Draw the ROC curve ctree.ROC <- roc(predictor=as.numeric(ctree.pred), response=testData$Class) ctree.ROC$auc #Area under the curve: 0.8326 plot(ctree.ROC,main="ctree ROC")

Here are the text and graphical descriptions of the resulting tree.

1) FiberWidthCh1 <= 9.887543; criterion = 1, statistic = 383.388 2) TotalIntenCh2 <= 42511; criterion = 1, statistic = 115.137 3) TotalIntenCh1 <= 39428; criterion = 1, statistic = 20.295 4)* weights = 504 3) TotalIntenCh1 > 39428 5)* weights = 9 2) TotalIntenCh2 > 42511 6) AvgIntenCh1 <= 199.2768; criterion = 1, statistic = 28.037 7) IntenCoocASMCh3 <= 0.5188792; criterion = 0.99, statistic = 14.022 8)* weights = 188 7) IntenCoocASMCh3 > 0.5188792 9)* weights = 7 6) AvgIntenCh1 > 199.2768 10)* weights = 36 1) FiberWidthCh1 > 9.887543 11) ShapeP2ACh1 <= 1.227156; criterion = 1, statistic = 48.226 12)* weights = 169 11) ShapeP2ACh1 > 1.227156 13) IntenCoocContrastCh3 <= 12.32349; criterion = 1, statistic = 22.349 14) SkewIntenCh4 <= 1.148388; criterion = 0.998, statistic = 16.78 15)* weights = 317 14) SkewIntenCh4 > 1.148388 16)* weights = 109 13) IntenCoocContrastCh3 > 12.32349 17) AvgIntenCh2 <= 244.9512; criterion = 0.999, statistic = 19.382 18)* weights = 53 17) AvgIntenCh2 > 244.9512 19)* weights = 22

Next, we fit an rpart() model to the training data using the default parameter settings and calculate the AUC to be 0.8536 on the test data.

# Fit CART Model rpart.fit <- rpart(Class ~ ., data=trainData,cp=0) rpart.fit plot(as.party(rpart.fit),main="rpart Model") #Make predictions using the test data set rpart.pred <- predict(rpart.fit,testData) #Draw the ROC curve rpart.ROC <- roc(predictor=as.numeric(rpart.pred), response=testData$Class) rpart.ROC$auc #Area under the curve: 0.8536 plot(rpart.ROC)

The resulting pruned tree does better than ctree(), but at the expense of building a slightly deeper tree.

1) root 1414 325.211500 0.64144270 2) TotalIntenCh2>=42606.5 792 191.635100 0.41035350 4) FiberWidthCh1>=11.19756 447 85.897090 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.548390 0.09677419 * 9) ShapeP2ACh1>=1.225676 292 66.065070 0.34589040 18) SkewIntenCh4< 1.41772 254 53.259840 0.29921260 36) TotalIntenCh4< 127285.5 214 40.373830 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.943660 0.16901410 * 73) EqEllipseOblateVolCh1< 383.1453 72 17.500000 0.41666670 146) AvgIntenCh1>=110.2253 40 6.400000 0.20000000 * 147) AvgIntenCh1< 110.2253 32 6.875000 0.68750000 * 37) TotalIntenCh4>=127285.5 40 9.900000 0.55000000 * 19) SkewIntenCh4>=1.41772 38 8.552632 0.65789470 * 5) FiberWidthCh1< 11.19756 345 82.388410 0.60579710 10) KurtIntenCh1< -0.3447192 121 28.000000 0.36363640 20) TotalIntenCh1>=13594 98 19.561220 0.27551020 * 21) TotalIntenCh1< 13594 23 4.434783 0.73913040 * 11) KurtIntenCh1>=-0.3447192 224 43.459820 0.73660710 22) AvgIntenCh1>=454.3329 7 0.000000 0.00000000 * 23) AvgIntenCh1< 454.3329 217 39.539170 0.76036870 46) VarIntenCh4< 130.9745 141 31.333330 0.66666670 92) NeighborAvgDistCh1>=256.5239 30 6.300000 0.30000000 * 93) NeighborAvgDistCh1< 256.5239 111 19.909910 0.76576580 * 47) VarIntenCh4>=130.9745 76 4.671053 0.93421050 * 3) TotalIntenCh2< 42606.5 622 37.427650 0.93569130 6) ShapeP2ACh1< 1.236261 11 2.545455 0.36363640 * 7) ShapeP2ACh1>=1.236261 611 31.217680 0.94599020 * >

Note, however, that complexity parameter for rpart(), cp, is set to zero rpart() builds a massive tree, a portion of which is shown below, and over fits the data yielding an AUC of 0.806

1) root 1414 325.2115000 0.64144270 2) TotalIntenCh2>=42606.5 792 191.6351000 0.41035350 4) FiberWidthCh1>=11.19756 447 85.8970900 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.5483900 0.09677419 16) EntropyIntenCh1>=6.672119 133 7.5187970 0.06015038 32) AngleCh1< 108.6438 82 0.0000000 0.00000000 * 33) AngleCh1>=108.6438 51 6.7450980 0.15686270 66) EqEllipseLWRCh1>=1.184478 26 0.9615385 0.03846154 132) DiffIntenDensityCh3>=26.47004 19 0.0000000 0.00000000 * 133) DiffIntenDensityCh3< 26.47004 7 0.8571429 0.14285710 * 67) EqEllipseLWRCh1< 1.184478 25 5.0400000 0.28000000 134) IntenCoocContrastCh3>=9.637027 9 0.0000000 0.00000000 * 135) IntenCoocContrastCh3< 9.637027 16 3.9375000 0.43750000 * 17) EntropyIntenCh1< 6.672119 22 4.7727270 0.31818180 34) ShapeBFRCh1>=0.6778205 13 0.0000000 0.00000000 * 35) ShapeBFRCh1< 0.6778205 9 1.5555560 0.77777780 * 9) ShapeP2ACh1>=1.225676 292 66.0650700 0.34589040 18) SkewIntenCh4< 1.41772 254 53.2598400 0.29921260 36) TotalIntenCh4< 127285.5 214 40.3738300 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.9436600 0.16901410 144) IntenCoocEntropyCh3< 7.059374 133 16.2857100 0.14285710 288) NeighborMinDistCh1>=21.91001 116 11.5431000 0.11206900 576) NeighborAvgDistCh1>=170.2248 108 8.2500000 0.08333333 1152) FiberAlign2Ch4< 1.481728 68 0.9852941 0.01470588 2304) XCentroid>=100.5 61 0.0000000 0.00000000 * 2305) XCentroid< 100.5 7 0.8571429 0.14285710 * 1153) FiberAlign2Ch4>=1.481728 40 6.4000000 0.20000000 2306) SkewIntenCh1< 0.9963465 27 1.8518520 0.07407407

In practice, rpart()'s complexity parameter (default value cp = .01) is effective in controlling tree growth and overfitting. It does, however, have an "ad hoc" feel to it. In contrast, the ctree() algorithm implements tests for statistical significance within the process of growing a decision tree. It automatically curtails excessive growth, inherently addresses both overfitting and bias and offers the promise of achieving good models with less computation.

Finally, note that rpart() and ctree() construct different trees that offer about the same performance. Some practitioners who value decision trees for their interpretability find this disconcerting. End users of machine learning models often want at story that tells them something true about their customer's behavior or buying preferences etc. But, the likelihood of there being multiple satisfactory answers to a complex problem is inherent to the process of inverse deduction. As Hothorn et al. comment:

Since a key reason for the popularity of tree based methods stems from their ability to represent the estimated regression relationship in an intuitive way, interpretations drawn from regression trees must be taken with a grain of salt.

1 Hothorn et al. (2006): Unbiased Recursive Partitioning: A Conditional Inference Framework J COMPUT GRAPH STAT Vol(15) No(3) Sept 2006

2. Mingers 1987: Expert Systems-Rule Induction with Statistical Data

by Joseph Rickert

In his new book, The Master Algorithm, Pedro Domingos takes on the heroic task of explaining machine learning to a wide audience and classifies machine learning practitioners into 5 tribes*, each with its own fundamental approach to learning problems. To the 5th tribe, the analogizers, Pedro ascribes the Support Vector Machine (SVM) as it's master algorithm. Although the SVM has been a competitive and popular algorithm since its discovery in the 1990's this might be the breakout moment for SVMs into pop culture. (What algorithm has a cooler name?) More people than every will want to give them a try and face the challenge of tuning them. Fortunately, there is plenty of help out there and some good tools for the task.

In A Practical Guide to Support Vector Classification, which is aimed at beginners Hsu et al. suggest the following methodology:

- Transform data to the format of an SVM package
- Conduct simple scaling on the data
- Consider the RBF kernel K(x, y) = exp (−γ||x−y||
^{2}) - Use cross-validation to find the best parameter C and γ
- Use the best parameter C and γ to train the whole training set5
- Test

In this post, we present a variation of the methodology using R and the caret package. First, we set up for an analysis, loading the segmentation data set from the caret package and using the caret's createDataPartition() function to produce training and test data sets.

# Training SVM Models library(caret) library(dplyr) # Used by caret library(kernlab) # support vector machine library(pROC) # plot the ROC curves ### Get the Data # Load the data and construct indices to divide it into training and test data sets. data(segmentationData) # Load the segmentation data set trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,] testData <- segmentationData[-trainIndex,] trainX <-trainData[,4:61] # Pull out the variables for training sapply(trainX,summary) # Look at a summary of the training data

Next, we carry out a two pass training and tuning process. In the first pass, shown in the code block below, we arbitrarily pick some tuning parameters and use the default caret settings for others. In the trainControl() function we specify 5 repetitions of 10 fold cross validation. in the train() function which actually does the work, we specify the radial kernel using the method parameter and the ROC as the metric for assessing performance. The tuneLength parameter is set to pick 9 arbitrary values for the C, the "cost" of the radial kernel. This parameter controls the complexity of the boundary between support vectors. The radial kernel also requires setting a smoothing parameter, sigma. In this first, pass we let train() use its default method of calculating an analytically derived estimate for sigma. Also note that we instruct train() to center and scale the data before running the analysis with the preProc parameter.

## SUPPORT VECTOR MACHINE MODEL # First pass set.seed(1492) # Setup for cross validation ctrl <- trainControl(method="repeatedcv", # 10fold cross validation repeats=5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) #Train and Tune the SVM svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", # Radial kernel tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), # Center and scale data metric="ROC", trControl=ctrl) svm.tune # Support Vector Machines with Radial Basis Function Kernel # # 1010 samples # 58 predictor # 2 classes: 'PS', 'WS' # # Pre-processing: centered, scaled # Resampling: Cross-Validated (10 fold, repeated 5 times) # Summary of sample sizes: 908, 909, 909, 909, 909, 909, ... # Resampling results across tuning parameters: # # C ROC Sens Spec ROC SD Sens SD Spec SD # 0.25 0.8695054 0.8540559 0.6690476 0.03720951 0.04389913 0.07282584 # 0.50 0.8724147 0.8592774 0.6912857 0.03618794 0.04234003 0.07830249 # 1.00 0.8746137 0.8718648 0.6968254 0.03418700 0.04204607 0.06918850 # 2.00 0.8709825 0.8755478 0.6969048 0.03345607 0.03927223 0.06714838 # 4.00 0.8609396 0.8795478 0.6702063 0.03437846 0.04189803 0.06597494 # 8.00 0.8456799 0.8703357 0.6310635 0.03610988 0.04105803 0.07540066 # 16.00 0.8293339 0.8666667 0.5943492 0.03717344 0.04773906 0.08006023 # 32.00 0.8220839 0.8636131 0.5759365 0.03622665 0.04531028 0.07587914 # 64.00 0.8123889 0.8605315 0.5541746 0.03795353 0.04494173 0.07140262 # # Tuning parameter 'sigma' was held constant at a value of 0.01521335 # ROC was used to select the optimal model using the largest value. # The final values used for the model were sigma = 0.01521335 and C = 1.

The results show that the best model resulted from setting .

In the second pass, having seen the parameter values selected in the first pass, we use train()'s tuneGrid parameter to do some sensitivity analysis around the values C = 1 and sigma = 0.015 that produced the model with the best ROC value. Note that R's expand.grid() function is used to build a dataframe contain all the combinations of C and sigma we want to look at.

# Second pass # Look at the results of svm.tune and refine the parameter space set.seed(1492) # Use the expand.grid to specify the search space grid <- expand.grid(sigma = c(.01, .015, 0.2), C = c(0.75, 0.9, 1, 1.1, 1.25) ) #Train and Tune the SVM svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", preProc = c("center","scale"), metric="ROC", tuneGrid = grid, trControl=ctrl) svm.tune # Support Vector Machines with Radial Basis Function Kernel # # 1010 samples # 58 predictor # 2 classes: 'PS', 'WS' # # Pre-processing: centered, scaled # Resampling: Cross-Validated (10 fold, repeated 5 times) # Summary of sample sizes: 909, 909, 908, 910, 909, 909, ... # Resampling results across tuning parameters: # # sigma C ROC Sens Spec ROC SD Sens SD Spec SD # 0.010 0.75 0.8727381 0.8614685 0.6817619 0.04096223 0.04183900 0.07664910 # 0.010 0.90 0.8742107 0.8633193 0.6878889 0.04066995 0.04037202 0.07817537 # 0.010 1.00 0.8748389 0.8630023 0.6873016 0.04079094 0.04061032 0.08189960 # 0.010 1.10 0.8747998 0.8642378 0.6884444 0.04076756 0.04004827 0.07892234 # 0.010 1.25 0.8749384 0.8657762 0.6923492 0.04083294 0.03911751 0.08070616 # 0.015 0.75 0.8726557 0.8660793 0.6923333 0.04171842 0.04324822 0.08203598 # 0.015 0.90 0.8727037 0.8688531 0.6945714 0.04164810 0.04082448 0.08379649 # 0.015 1.00 0.8727079 0.8713147 0.6906667 0.04184851 0.04273855 0.08174494 # 0.015 1.10 0.8724013 0.8719301 0.6895556 0.04197524 0.04108930 0.08377854 # 0.015 1.25 0.8721560 0.8722331 0.6900952 0.04207802 0.04096501 0.08639355 # 0.200 0.75 0.8193497 0.8863263 0.4478413 0.04695531 0.04072159 0.08061632 # 0.200 0.90 0.8195377 0.8903263 0.4405397 0.04688797 0.04091728 0.07844983 # 0.200 1.00 0.8193307 0.8915478 0.4361111 0.04719399 0.04004779 0.07815045 # 0.200 1.10 0.8195696 0.8958508 0.4333651 0.04694670 0.04026003 0.08252021 # 0.200 1.25 0.8198250 0.8983077 0.4271905 0.04705685 0.03900879 0.07945602 # # ROC was used to select the optimal model using the largest value. # The final values used for the model were sigma = 0.01 and C = 1.25.

This was quite a bit of calculation for an improvement of 0.0003247 in the ROC score, but it shows off some of what caret can do.

To finish up, we have build a model with a different kernel. The linear kernel is the simplest way to go. There is only the C parameter to set for this kernel and train() hardwires in a value of C = 1. The resulting ROC value of 0.87 is not too shabby.

#Linear Kernel set.seed(1492) #Train and Tune the SVM svm.tune2 <- train(x=trainX, y= trainData$Class, method = "svmLinear", preProc = c("center","scale"), metric="ROC", trControl=ctrl) svm.tune2 # > svm.tune2 # Support Vector Machines with Linear Kernel # # 1010 samples # 58 predictor # 2 classes: 'PS', 'WS' # # Pre-processing: centered, scaled # Resampling: Cross-Validated (10 fold, repeated 5 times) # Summary of sample sizes: 909, 909, 908, 910, 909, 909, ... # Resampling results # # ROC Sens Spec ROC SD Sens SD Spec SD # 0.8654818 0.8774312 0.6327302 0.03975929 0.04210438 0.0824907 # # Tuning parameter 'C' was held constant at a value of 1

Because, I took the trouble to set the seed for the pseudorandom number generator to the same value before each of the resampling operations above we can use caret's resample() function to compare the results generated by the radial kernel and linear kernel models. The following block of code and results shows just thee first five lines of the comparison table but includes the summary of the comparison.

rValues <- resamples(list(svm=svm.tune,svm.tune2)) rValues$values # Resample svm~ROC svm~Sens svm~Spec Model2~ROC Model2~Sens Model2~Spec # 1 Fold01.Rep1 0.8995726 0.8153846 0.7777778 0.8914530 0.8461538 0.6666667 # 2 Fold01.Rep2 0.9452991 0.9076923 0.7500000 0.9346154 0.9384615 0.6666667 # 3 Fold01.Rep3 0.8153846 0.8153846 0.6111111 0.8115385 0.8153846 0.5555556 # 4 Fold01.Rep4 0.9115385 0.9076923 0.6944444 0.8918803 0.9076923 0.4722222 # 5 Fold01.Rep5 0.9017094 0.8615385 0.6944444 0.8645299 0.8923077 0.6666667 summary(rValues) # Call: # summary.resamples(object = rValues) # # Models: svm, Model2 # Number of resamples: 50 # # ROC # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # svm 0.7910 0.8503 0.8785 0.8749 0.9013 0.9620 0 # Model2 0.7789 0.8399 0.8609 0.8655 0.8918 0.9457 0 # # Sens # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # svm 0.7538 0.8346 0.8769 0.8658 0.8935 0.9385 0 # Model2 0.7692 0.8462 0.8923 0.8774 0.9077 0.9538 0 # # Spec # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # svm 0.4286 0.6312 0.6944 0.6923 0.7500 0.8333 0 # Model2 0.4722 0.5556 0.6389 0.6327 0.6667 0.8333 0 bwplot(rValues,metric="ROC",ylab =c("linear kernel", "radial kernel")) # boxplot

The boxplot below would appear to give the radial kernel the edge on being the better model, but of course, we are far from done. We will leave the last step of Hsu et al.'s methodology, testing the models on held-out testing data, for another day.

To go further, start with the resources on the caret website. Have a look at the list of 206 models that can be used with the train() function. (This is an astounding number of models to be put into a single operational framework.) Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference. Finally, I'm not quite finished but Pedro's book is a pretty good read.

* The 5 tribes and their master algorithms:

- symbolists - inverse deduction
- connectionists - backpropagation
- evolutionists - genetic programming
- Bayesians - Bayesian inference
- analogizers - the support vector machine

by Andrie de Vries

The second week of SQLRelay (#SQLRelay) kicked off in London earlier this week. SQLRelay is a series of conferences, spanning 10 cities in the United Kingdom over two weeks. The London agenda included 4 different streams, with tracks for the DBA, BI and Analytics users, as well as a workshop track with two separate tutorials.

My speaking slot was in the afternoon, with the title "In-database analytics using Revolution R and SQL".

In my talk I covered:

- A high level overview of R
- Data science in the cloud
- Connecting R to SQL
- Scalable R
- R in SQL Server
- Moving your workflow to the cloud

Although the functionality of using R directly inside SQL Server will only be part of SQL Server 2016, Microsoft announced earlier this year that SQL Server 2016 will include Revolution Analytics. I expect that more information will be released during the PASS 2015 summit in Seattle at the end of this month.

In my talk I included 5 simple demonstrations. The first 3 demonstrations appeal to the data scientist coding in R:

- Connecting R to SQL Server using an RODBC connector
- Using Revolution R Enterprise (RRE) in a local parallel compute context, reading data from a local file
- Changing the compute context to SQL Server, and running the R code directly inside the SQL Server machine

The last two demonstrations demonstrate how to run some R code embedded in a SQL stored procedure:

- Creating a very simple script that calls out to R
- Using R to generate some data, in this case simply bringing some data in from the famous iris data set that is built into R.

The presentation is available on SlideShare:

Here are the code samples I used in the demonstration:

*by Bob HortonMicrosoft Senior Data Scientist*

Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s Machine Learning course on Coursera. Here I present a simple simulation that illustrates this idea.

Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to train the model is called the *training error*, and the error on an independent sample is called the *validation error*.

A model will commonly perform better (that is, have lower error) on the data it was trained on than on an independent sample. The difference between the training error and the validation error reflects *overfitting* of the model. Overfitting is like memorizing the answers for a test instead of learning the principles (to borrow a metaphor from the Wikipedia article). Memorizing works fine if the test is exactly like the study guide, but it doesn’t work very well if the test questions are different; that is, it doesn’t generalize. In fact, the more a model is overfitted, the higher its validation error is likely to be. This is because the spurious correlations the overfitted model memorized from the training set most likely don’t apply in the validation set.

Overfitting is usually more extreme with small training sets. In large training sets the random noise tends to average out, so that the underlying patterns are more clear. But in small training sets, there is less opportunity for averaging out the noise, and accidental correlations consequently have more influence on the model. Learning curves let us visualize this relationship between training set size and the degree of overfitting.

We start with a function to generate simulated data:

```
sim_data <- function(N, noise_level=1){
X1 <- sample(LETTERS[1:10], N, replace=TRUE)
X2 <- sample(LETTERS[1:10], N, replace=TRUE)
X3 <- sample(LETTERS[1:10], N, replace=TRUE)
y <- 100 + ifelse(X1 == X2, 10, 0) + rnorm(N, sd=noise_level)
data.frame(X1, X2, X3, y)
}
```

The input columns X1, X2, and X3 are categorical variables which each have 10 possible values, represented by capital letters `A`

through `J`

. The outcome is cleverly named `y`

; it has a base level of 100, but if the values in the first two `X`

variables are equal, this is increased by 10. On top of this we add some normally distributed noise. Any other pattern that might appear in the data is accidental.

Now we can use this function to generate a simulated data set for experiments.

```
set.seed(123)
data <- sim_data(25000, noise=10)
```

There are many possible error functions, but I prefer the root mean squared error:

`rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))`

To generate a learning curve, we fit models at a series of different training set sizes, and calculate the training error and validation error for each model. Then we will plot these errors against the training set size. Here the parameters are a model formula, the data frame of simulated data, the validation set size (vss), the number of different training set sizes we want to plot, and the smallest training set size to start with. The largest training set will be all the rows of the dataset that are not used for validation.

```
run_learning_curve <- function(model_formula, data, vss=5000, num_tss=30, min_tss=1000){
library(data.table)
max_tss <- nrow(data) - vss
tss_vector <- seq(min_tss, max_tss, length=num_tss)
data.table::rbindlist( lapply (tss_vector, function(tss){
vs_idx <- sample(1:nrow(data), vss)
vs <- data[vs_idx,]
ts_eligible <- setdiff(1:nrow(data), vs_idx)
ts <- data[sample(ts_eligible, tss),]
fit <- lm( model_formula, ts)
training_error <- rmse(ts$y, predict(fit, ts))
validation_error <- rmse(vs$y, predict(fit, vs))
data.frame(tss=tss,
error_type = factor(c("training", "validation"),
levels=c("validation", "training")),
error=c(training_error, validation_error))
}) )
}
```

We’ll use a formula that considers all combinations of the input columns. Since these are categorical inputs, they will be represented by dummy variables in the model, with each combination of variable values getting its own coefficient.

`learning_curve <- run_learning_curve(y ~ X1*X2*X3, data)`

With this example, you get a series of warnings:

```
## Warning in predict.lm(fit, vs): prediction from a rank-deficient fit may be
## misleading
```

This is R trying to tell you that you don’t have enough rows to reliably fit all those coefficients. In this simulation, training set sizes above about 7500 don’t trigger the warning, though as we’ll see the curve still shows some evidence of overfitting.

```
library(ggplot2)
ggplot(learning_curve, aes(x=tss, y=error, linetype=error_type)) +
geom_line(size=1, col="blue") + xlab("training set size") + geom_hline(y=10, linetype=3)
```

In this figure, the X-axis represents different training set sizes and the Y-axis represents error. Validation error is shown in the solid blue line on the top part of the figure, and training error is shown by the dashed blue line in the bottom part. As the training set sizes get larger, these curves converge toward a level representing the amount of irreducible error in the data. This plot was generated using a simulated dataset where we know exactly what the irreducible error is; in this case it is the standard deviation of the Gaussian noise we added to the output in the simulation (10; the root mean squared error is essentially the same as standard deviation for reasonably large sample sizes). We don’t expect any model to reliably fit this error since we know it was completely random.

One interesting thing about this simulation is that the underlying system is very simple, yet it can take many thousands of training examples before the validation error of this model gets very close to optimum. In real life, you can easily encounter systems with many more variables, much higher cardinality, far more complex patterns, and of course lots and lots of those unpredictable variations we call “noise”. You can easily encounter situations where truly enormous numbers of samples are needed to train your model without excessive overfitting. On the other hand, if your training and validation error curves have already converged, more data may be superfluous. Learning curves can help you see if you are in a situation where more data is likely to be of benefit for training your model better.

by John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Previously we worked on:

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation. For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training. Another variation called "leave one out" (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts *about the fitting procedure* and *not about the actual model in hand* (so they are answering a different question than test/train split). There is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

To read more: Win-Vector - How do you know if your model is going to work? Part 4: Cross-validation techniques

by Joseph Rickert

We are very pleased to announce that Microsoft will not only continue the Revolution Analytics’ tradition of supporting R user groups worldwide, but is expanding the scope of the user group program. The new 2016 Microsoft Data Science User Group Sponsorship Program is open to all user groups that are passionate about open-source data science technologies. If your group is focused on R, Python, Apache Hadoop or some other vital data science technology you may qualify for the Microsoft program. The major criteria for participation are that you have a public web presence, you hold meetings on a regular basis, and the you can demonstrate your commitment to furthering and contributing to your particular corner of the open-source data science community.

**Benefits of participation**

Among the benefits of participating in the Microsoft Data Science User Group Sponsorship Program are:

- A shipment of goodies that you can give to your members or use as door prizes, or however you believe they will best support the success of your group.
- Promotion of your group’s activities to help get the word out to other like-minded data scientists in your area.
- Access to Microsoft subject matter and technology experts to speak at your meetups.
- The opportunity to hold user group meetings at Microsoft facilities around the world.

You will also benefit by being part of a network of like-minded data scientists that span the globe.

**Requirements for Participation**

In return for your group’s participation in the program we ask:

- That you list Microsoft as a sponsor on your group’s website and that you acknowledge our sponsorship at your meetings.
- That once a month, Microsoft will be able to deliver an email message to your members

**Apply for the Program**

When you are ready to apply it is only a matter of filling out the application form.

We are looking forward to working with you and helping you to make a difference in the open-source data science world. Please help us to make this a program a success, and let us know how you think we can work together to make things happen.