The field of neuroscience -- the study of brains and the nervous system -- has taken some major leaps in recent years. Scientists can now gather real-time electrical activity from the brain during actions and thoughts, which is helping to pinpoint the exact location of brain lesions caused by strokes, and is leading to promising treatments for epilepsy and even profound paralysis. Joseph Sirosh describes these advances in a keynote presented at Strata Hadoop World last week:

In the video, Dr Kai Miller, Neurosurgery Resident at Stanford University, described an ingenious experiment designed to link brain activity to perception. In the experiment, several epilepsy patients were shown a series of images, each of which was either a house or a face. Simultaneously, electrical activity on the brain surface was measured by 64 separate brain sensors. The goal is to be able to create a model from the brain sensor data that can accurately predict what the patients is seeing: a face or a house.

You can try creating such a model yourself in the Azure Machine Learning competition, Decoding Brain Signals. To enter the competition, you'll need to train a model on the competition data, and have it accurately predict the images seen by other patients in the study (whose data remains hidden from all participants). You can use the built-on Azure ML Studio machine learning modules, or you can build your model entirely using R and R packages (this Tutorial using R explains the process).

Your submission will be ranked against the other participants according to prediction accuracy. As of this writing, the best model has a 73.75% accuracy rate. If your model can do better than that, and remains the best model when the competition closes on July 1, you could win $3,000 in prize money. (Second place gets $1,500 and third place gets $500.) Note that you'll need a free Microsoft Azure account to participate, and there is no charge for training, validating or submitting your competition models. For more information on the competition and how to submit a model, follow the link below.

Cortana Intelligence Gallery: Competition: Decoding Brain Signals

If you're new to the concept of predictive models, or just want to review the background on how data scientists learn from past data to predict the future, you may be interested in my talk from the Data Insights Summit, Introduction to Real-Time Predictive Modeling.

In the talk above I gave a brief introduction to the R language and mentioned several applications using R. If you'd like to get started with R, you might like to follow along with my co-blogger Joseph Rickert's beginners workshop, Supercharge Your Data Analysis With R:

You can follow along with Joe's session by downloading Microsoft R Open and using the scripts in this GitHub repository.

March Madness is upon us here in the US. This annual college basketball competition pits 64 teams in a single-elimination tournament, and the team that goes undefeated for all 6 rounds will be named NCAA Champion.

Predicting the winners of the competition, and in particular completing a "bracket" of the teams you predict to make it to the final 32 or 16 and eventually win, is a popular pastime (and foundation for many wagers). Some use their knowledge of the teams or the betting markets to select their bracket. And some, like 47-year-old English data scientist and top-ranked Kaggler Amanda Schierz, use data, models, and R. Watch her story in this (sadly, unembeddable) video from ESPN and FiveThirtyEight.

If you'd like to try your hand at your own predictions based on machine learning, Azure ML (part of the Cortana Analytics suite) provides all the data, algorithms, and R and Python support you need. Here at Microsoft we've run internal March Madness competitions every year, and in the video below last year's winner Damon Hachmeister shares his secrets.

There's even a March Madness Prediction Service published on the Cortana Analytics Gallery which, given the current state of the competition as a Web Service input, will provide predictions for the remaining games as an output.

Got more tips for predicting a bracket with Machine Learning? Share them in the comments below.

by Joseph Rickert

In a post late last year, my colleague and fellow blogger, Andrie de Vries described enhancements to the AzureML R package that makes it easy to publish R functions that consume data frames as Azure Web Services. A very nice consequence is that it is now feasible to develop predictive models in R and enable the Excel powered business analysts in your organization to use your model to generate predictions with new data. This is made possible by an Azure feature that integrates a published web service into an Excel workbook. Once you publish your R model as a web service and set up the Excel workbook anybody you give the workbook to will be able to score new data copied into the workbook.

Now, I'll walk through the steps required assuming you have already set up an Azure ML account. The AzureML package vignette gives a detailed example of publishing a new model. For convenience, I reproduce the necessary code from the vignette here:

The first part of the code fits a generalized boosted regression model (gbm) to the Boston data set from the MASS package that contains features characterizing the housing values for suburban Boston. The prediction function, mypredict() is set up to take in a data frame containing new data and use the gbm model to predict the median value of owner-occupied homes. Notice that the function includes the statement require(gbm). This ensures that the Azure environment will have access to the gbm package when making predictions.

The rest of the code "tests" the prediction using a data frame containing the first five lines of the Boston data set and then publishes the prediction function as a web service using the function publishWebService().

Once you have gotten this far there are only a few more steps to set up the Excel workbook. Login to your Azure Machine Learning account and go to the web services page. You should see something like this:

Notice that the name used in the publisWebService function appears on the list of available services. Clicking on this will bring you to a page like this next one. Go ahead and select Excel 2013 or later workbook in the REQUEST/RESPONSE row.

This should bring you to the Sample Data screen below that pretty much explains what you are about to do.

Now we almost there. A couple more clicks will bring you to an empty workbook like the one below. To get this screen I manually pasted in test data from the Boston file.

Then I used the input boxes on the right to set the range for the input data and select range of cell to place the output.

Once the file is built you can distribute it to your colleagues to begin making predictions. You can download my Excel file here Download AzureML-vignette-gbm-3_15_2016 12_13_33 AM and begin making predictions.

*by Daniel Moore**Director of Applied Statistics Engineering, Console Development**Microsoft*

In Xbox Hardware, we are interested in the various ways that our hardware is used, and we are especially interested in how that usage changes over time. We employ several several time series analysis techniques that are helpful in getting a holistic view of usage of the Xbox console. When it comes to the actual data that we look at – it could be an individual game or app usage or it could be specific features of the console. For all of these, there are a few parts of the time series that we are interested in. The first is the trend of usage over time. Many games are very popular when they are first released and then lose popularity as they age. Some games remain steady in their popularity. The consoles themselves may show increased usage on holiday periods. All of these would be reflected in the overall trend of the data. We are also interested in the weekly cycle of usage. As you can imagine, use of a gaming console goes up on the weekend and down during the week. This is probably even more so among children than adults – as they have more time (and permission!) to play on non-school days than they do on school days.

We use R extensively to perform a time series analysis. In this post we’ll explore the initial analysis and decomposition of the time series into its component parts. Though some packages offer more complete time series analysis options, the base version of R has some good built in features for this initial analysis. The data for this example is the usage of a single game over more than a year of usage on the Xbox One. This is done with the following code below:

data<-read.csv("dataset.csv") # loads the data set in plot(data, type="l") # plot the data first to get a look at it. Here, just a line plot showing weekly periodicity over a trend tsdata<-ts(data, start=1, freq=7) # here we define a time series out of the data, with the first observation set at 1 and a weekly frequency.. decomposeddata<-stl(tsdata, s.window=7) plot(decomposeddata) # this shows four panes of line charts – the data, the sinusoidal seasonal fluctuations, the trend, and the remainder

The object “decomposeddata” is of class “stl” with several components useful for time series analysis. This uses loess to decompose the time series and many smoothing and other settings are available, depending on specific needs and the analysis being performed.

To highlight something that can be illuminated with this, look at the seasonal component of the chart below.

You’ll notice a period (highlighted) where the seasonal fluctuations dropped significantly. This is the summer time when school is out and weekdays/weekends blend together for kids. It’s always fun to find those artifacts in the data!

There's more to Iowa than just today's presidential primary. Last month, the Central Iowa R User Group hosted Dr. Max Kuhn, Director of Non-Clinical Statistics at Pfizer Global R&D, via video-chat to present on Applied Predictive Modeling with R. Max is the co-author of the excellent book Applied Predictive Modeling (read our review here), and in the presentation he covers many of the topics from the book in a brisk 75 minutes.

Unlike most statistics courses which focus in the inferential side of models, Max's talk instead focuses on creating statistical models where the goal is prediction. Given his background at Pfizer, the talk includes analysis some interesting datasets, including one to predict the performance an algorithm used to identify cell components (the cell wall, nucleus, etc) in a microscopy slide. (The data is public, so you can recreate the analyses using this R code and the caret package.)

In addition to an overview of predictive modeling in general, you'll learn how to use R to use resampling techniques to select tuning parameters in a model (for example, the number of trees to use in a random forest), and how to evaluate the performance for classification using the confusion matriX and ROC curves. Watch the presentation below:

If you find the slides a little small to read, you can download them here and follow along.

Applied Predictive Modeling: Central Iowa R User Group Talk

You may have heard that R and the big-data RevoScaleR package have been integrated with with SQL Server 2016 as SQL Server R Services. If you've been wanting to try out R with SQL Server but haven't been sure where to start, a new MSDN tutorial will take you through all the steps of creating a predictive model: from obtaining data for analysis, to building a statistical model, to creating a stored prodedure to make predictions from the model. To work through the tutorial, you'll need a suitable Windows server on which to install the SQL Server 2016 Community Technology Preview, and make sure you have SQL Server R Services installed. You'll also need a separate Windows machine (say a desktop or laptop) where you'll install Revolution R Open and Revolution R Enterprise. Most of the computations will be happening in SQL Server, though, so this "data science client machine" doesn't need to be as powerful.

The tutorial is made up of five lessons, which together should take you about 90 minutes to run though. If you run into problems, each lesson includes troubleshooting tips at the end.

Lesson 1 begins with downloading the New York City taxi data set (which was also used to create these beautiful data visualizations) and loading it into SQL Server. You'll also set up R to include some useful packages such as ggmap and RODBC.

Lesson 2 starts by having you verify the data using SQL queries. Don't miss the "Next Steps" links near the end, where you'll summarize the data using the RevoScaleR package on the data science client machine, and then visualize the data as a map with the ggmaps package (as shown below).

Lesson 3 focuses on using R to augment the data with new features, such as calculating the distance between pickup and dropoff points using a custom R function or using T-SQL.

Lesson 4 is where you'll use the rxLogit function to train a logistic regression model to predict the probability of a driver receiving a tip for a ride, evauate the model using ROC curves, and then deploy the prediction into SQL Server as a T-SQL stored procedure.

Lesson 5 wraps things up by showing how to use the deployed model in a production environment, both by calculating predictions from a stored dataset in batch mode, and by performing transactional predictions one trip at a time.

To save on cutting-and-pasting, you can find all of the code used in the tutorial on Github. Give it a go, and before long you'll have your own R models running live in SQL Server.

MSDN: End-to-End Data Science Walkthrough: Overview (SQL Server R Services)

by Joseph Rickert

If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners.

There are occasions, however, when you find yourself for one reason or another committed to classifier that doesn’t automatically narrow down that list of predictor variables that some sort of automated feature selection might seem like a good idea. If you are an R user then the caret package offers a whole lot machinery that might be helpful. Caret offers both filter methods and wrapper methods that include recursive feature estimation, genetic algorithms (GAs) and simulated annealing. In this post, we will have a look at a small experiment with caret’s GA option. But first, a little background.

Performing feature selection with GAs requires conceptualizing the process of feature selection as an optimization problem and then mapping it to the genetic framework of random variation and natural selection. Individuals from a given generation of a population mate to produce offspring who inherit genes (chromosomes) from both parents. Random mutation alters a small part of child’s genetic material. The children of this new generation who are genetically most fit produce the next generation. In the feature selection context, individuals become solutions to a prediction problem. Chromosomes (sequences of genes) are modeled as vectors of 1’s and 0’s with a 1 indicating the presence of a feature and a 0 its absence. The simulated genetic algorithm then does the following: it selects two individuals, randomly chooses a split point for their chromosomes, maps the front of one chromosome to the back of the other (and vice versa) and then randomly mutates the resulting chromosomes according to some predetermined probability.

In their book Applied Predictive Modeling, Kuhn and Johnson provide the following pseudo code for caret's GA:

- Define stopping criteria, population size, P, for each generation, and mutation probability, p
_{m} - Randomly generate an initial population of chromosomes
- repeat:
- | for each chromosome do
- | Tune and train a model and compute each chromosome's fitness
- | end
- | for each reproduction 1 ... P/2 do
- | Select 2 chromosomes based on fitness
- | Crossover: randomly select a locus and exchange genes on either side of locus
- | (head of one chromosome applied to tail of the other and vice versa)
- | to produce 2 child chromosomes with mixed genes
- | Mutate the child chromosomes with probability p
_{m} - | end
- until stopping criterion are met

If 10 fold cross validation is selected in the GA control procedure, then the entire genetic algorithm (steps 2 through 13) is run 10 times.

Now, just for fun, I'll conduct the following experiment to see if GA feature selection will improve on the performance of the support vector machine model featured in a previous post. As before, I use the segmentationData data set that is included in the caret package and described in the paper by Hill et al. This data set has 2,019 rows and 58 possible feature variables. The structure of the experiment is as follows:

- Divide the data into a training and test data sets.
- Run the GA feature selection algorithm on the training data set to produce a subset of the training set with the selected features.
- Train the SVM algorithm on this subset.
- Assess the performance of the SVM model using the subset of the test data that contains the selected features.
- Train the SVM model on the entire training data set.
- Assess the performance of this second SVM model using the test data set.
- Compare the performance of the two SVM models.

(Note that I have cut corners here by training the SVM on the same data that was used in the GA. A methodological improvement would be to divide the original data into three sets.)

The is first block of code sets up for the experiment and divides the data into training and test data sets.

library(caret) library(doParallel) # parallel processing library(dplyr) # Used by caret library(pROC) # plot the ROC curve ### Use the segmentationData from caret # Load the data and construct indices to divided it into training and test data sets. set.seed(10) data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Create training feature data frame testX <- testData[,-1] # Create test feature data frame y=trainData$Class # Target variable for training

Next run the GA. Note the first line of code registers the parallel workers to set up for having the caret run the GA in parallel using the 4 cores on my Windows laptop.

registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() # check that there are 4 workers ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel=TRUE, # Use parallel programming allowParallel = TRUE) ## set.seed(10) lev <- c("PS","WS") # Set the levels system.time(rf_ga3 <- gafs(x = trainX, y = y, iters = 100, # 100 generations of algorithm popSize = 20, # population size for each generation levels = lev, gafsControl = ga_ctrl))

The parameter settings of the gafsControl() function indicate that the internally implemented random forests model and 10 fold cross validation are to be used to assess performance of the "chromosomes" in each generation. The parameters for the gafs() function itself specify 100 generations of populations consisting of 20 individuals. Ideally, it would be nice to let the algorithm run for more iterations with larger populations and perhaps repeated 10 fold CV. However, even these modest parameter settings generate a tremendous number of calculations. (The algorithm took 4 hours to complete on a small Azure VM.)

The results below, and the plot that follows them, show that the best performance was achieved on the at iteration 9 with a subset of 44 variables.

rf_ga3 1010 samples 58 predictors 2 classes: 'PS', 'WS' Maximum generations: 100 Population per generation: 20 Crossover probability: 0.8 Mutation probability: 0.1 Elitism: 0 Internal performance values: Accuracy, Kappa Subset selection driven to maximize internal Accuracy External performance values: Accuracy, Kappa Best iteration chose by maximizing external Accuracy External resampling method: Cross-Validated (10 fold) During resampling: * the top 5 selected variables (out of a possible 58): DiffIntenDensityCh4 (100%), EntropyIntenCh1 (100%), EqEllipseProlateVolCh1 (100%), EqSphereAreaCh1 (100%), FiberWidthCh1 (100%) * on average, 39.3 variables were selected (min = 25, max = 52) In the final search using the entire training set: * 44 features selected at iteration 9 including: AvgIntenCh1, AvgIntenCh4, ConvexHullAreaRatioCh1, ConvexHullPerimRatioCh1, EntropyIntenCh3 ... * external performance at this iteration is Accuracy Kappa 0.8406 0.6513 plot(rf_ga3) # Plot mean fitness (AUC) by generation

The plot also shows the average internal accuracy estimates as well as the average external estimates calculated from the 10 out of sample predictions.

This next section of code trains an SVM using only the selected features setting up a grid to search the parameter space; again, using the parallel computing feature built into caret.

final <- rf_ga3$ga$final # Get features selected by GA trainX2 <- trainX[,final] # training data: selected features testX2 <- testX[,final] # test data: selected features ## SUPPORT VECTOR MACHINE MODEL #Note the default method of picking the best model is accuracy and Cohen's Kappa # Set up training control ctrl <- trainControl(method="repeatedcv", # 10fold cross validation repeats=5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) #Use the expand.grid to specify the search space #Note that the default search grid selects 3 values of each tuning parameter grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4 n.trees=seq(10,100,by=10), # let iterations go from 10 to 100 shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate n.minobsinnode = 20) # # Set up for parallel processing set.seed(1951) registerDoParallel(4,cores=4) #Train and Tune the SVM svm.tune <- train(x=trainX2, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above )

Finally, assess the performance of the model using the test data set.

#Make predictions on the test data with the SVM Model svm.pred <- predict(svm.tune,testX2) confusionMatrix(svm.pred,testData$Class) Confusion Matrix and Statistics Reference Prediction PS WS PS 560 106 WS 84 259 Accuracy : 0.8117 95% CI : (0.7862, 0.8354) No Information Rate : 0.6383 P-Value [Acc > NIR] : <2e-16 Kappa : 0.5868 Mcnemar's Test P-Value : 0.1276 Sensitivity : 0.8696 Specificity : 0.7096 Pos Pred Value : 0.8408 Neg Pred Value : 0.7551 Prevalence : 0.6383 Detection Rate : 0.5550 Detection Prevalence : 0.6601 Balanced Accuracy : 0.7896 # 'Positive' Class : PS svm.probs <- predict(svm.tune,testX2,type="prob") # Gen probs for ROC svm.ROC <- roc(predictor=svm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) svm.ROC #Area under the curve: 0.8881 plot(svm.ROC,main="ROC for SVM built with GA selected features")

Voilà - a decent model, but at significant computational cost and, as it turns out, with no performance improvement! If the GA feature selection procedure is just omitted, and the SVM is fit using the training and tuning procedure described above with the same seed settings it will produce a model with an AUC of 0.8888.

In the "glass half empty" interpretation of the experiment I did a lot of work for nothing. On the other hand, looking at a "glass half full", using automatic GA feature selection, I was able to build a model that achieved the same performance as the full model but with 24% fewer features. My guess though is that I just got lucky this time. I think the take away is that this kind of automated feature selection might be worth while if there is a compelling reason to reduce the number of features, perhaps at the expense of a decrease in performance. However, unless you know that you are dealing with a classifier that is easily confused by irrelevant predictor variables there is no reason to expect that GA, or any other "wrapper style" feature selection method, will improve performance.

by Nina Zumel

Principal Consultant Win-Vector LLC

We've just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we've tried to touch on the highlights of the papers, and to play around with variations of our own.

**A Simpler Explanation of Differential Privacy**: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in*Science*(Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”,*Science*, vol 349, no. 6248, pp. 636-638, August 2015). Note that Cynthia Dwork, one of the inventors of differential privacy, originally used it in the analysis of sensitive information.**Using differential privacy to reuse training data**: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.**A simple differentially private procedure**: The bootstrap as an alternative to Laplace noise to introduce differential privacy.

Our R code and experiments are available on Github here, so you can try some experiments and variations yourself. Image Credit

*Editor's Note:**The R code includes an example of using vtreat, a package for preparing and cleaning data frames based on level-based feature pruning.*

by Joseph Rickert

In a recent previous post, I wrote about support vector machines, the representative master algorithm of the 5th tribe of machine learning practitioners described by Pedro Domingos in his book, The Master Algorithm. Here we look into algorithms favored by the first tribe, the symbolists, who see learning as the process of inverse deduction. Pedro writes:

Another limitation of inverse deduction is that it's very computational intensive, which makes it hard to scale to massive data sets. For these, the symbolist algorithm of choice is decision tree induction. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. (p85)

The de facto standard for decision trees or “recursive partitioning” trees as they are known in the literature, is the CART algorithm by Breiman et al. (1984) implemented in R's rpart package. Stripped down to it’s essential structure, CART is a two stage algorithm. In the first stage, the algorithm conducts an exhaustive search over each variable to find the best split by maximizing an information criterion that will result in cells that are as pure as possible for one or the other of the class variables. In the second stage, a constant model is fit to each cell of the resulting partition. The algorithm then proceeds in a recursive “greedy” fashion making splits and not looking back to see how things might have been before making the next split. Although hugely successful in practice, the algorithm has two vexing problems: (1) overfitting and (2) selection bias – the algorithm favors features with many possible splits^{1}. Overfitting occurs because the algorithm has “no concept of statistical significance” ^{2}. While overfitting is usually handled with cross validation and pruning there doesn’t seem to be an easy way to deal with selection bias in the CART / rpart framework.

To address these issues Hothorn, Hornik and Zeileis introduced the party package into R about ten years ago which provides an implementation of conditional inference trees. (Unbiased Recursive Partitioning: A Conditional Inference Framework) Party’s ctree() function separates the selection of variables for splitting and the splitting process itself into two different steps and explicitly addresses bias selection by implementing statistical testing and a stopping procedure in the first step. Very roughly, the algorithm proceeds as follows:

- Each node of the tree is represented by a set of weights. Then, for each covariate vector X, the algorithm tests the null hypothesis that the dependent variable Y is independent of X. If the hypothesis cannot be rejected then the algorithm stops. Otherwise, the covariate with the strongest association with Y is selected for splitting.
- The algorithm performs a split and updates the weights describing the tree.
- Steps 1 and 2 are repeated recursively with the new parameter settings.

The details, along with enough theory to use the ctree algorithm with some confidence, are presented in this accessible vignette: “party: A Laboratory for Recursive Partitioning. The following example contrasts the ctree() and rpart() algorithms.

We begin by dividing the segmationData data set that comes with the caret package into training and test sets and fitting a ctree() model to it using the default parameters. No attempt is made to optimize the model. Next, we use the model to predict values of the Class variable on the test data set and calculate the area under the ROC curve to be 0.8326.

# Script to compare ctree with rpart library(party) library(rpart) library(caret) library(pROC) ### Get the Data # Load the data and construct indices to divide it into training and test data sets. data(segmentationData) # Load the segmentation data set data <- segmentationData[,3:61] data$Class <- ifelse(data$Class=="PS",1,0) # trainIndex <- createDataPartition(data$Class,p=.7,list=FALSE) trainData <- data[trainIndex,] testData <- data[-trainIndex,] #------------------------ set.seed(23) # Fit Conditional Tree Model ctree.fit <- ctree(Class ~ ., data=trainData) ctree.fit plot(ctree.fit,main="ctree Model") #Make predictions using the test data set ctree.pred <- predict(ctree.fit,testData) #Draw the ROC curve ctree.ROC <- roc(predictor=as.numeric(ctree.pred), response=testData$Class) ctree.ROC$auc #Area under the curve: 0.8326 plot(ctree.ROC,main="ctree ROC")

Here are the text and graphical descriptions of the resulting tree.

1) FiberWidthCh1 <= 9.887543; criterion = 1, statistic = 383.388 2) TotalIntenCh2 <= 42511; criterion = 1, statistic = 115.137 3) TotalIntenCh1 <= 39428; criterion = 1, statistic = 20.295 4)* weights = 504 3) TotalIntenCh1 > 39428 5)* weights = 9 2) TotalIntenCh2 > 42511 6) AvgIntenCh1 <= 199.2768; criterion = 1, statistic = 28.037 7) IntenCoocASMCh3 <= 0.5188792; criterion = 0.99, statistic = 14.022 8)* weights = 188 7) IntenCoocASMCh3 > 0.5188792 9)* weights = 7 6) AvgIntenCh1 > 199.2768 10)* weights = 36 1) FiberWidthCh1 > 9.887543 11) ShapeP2ACh1 <= 1.227156; criterion = 1, statistic = 48.226 12)* weights = 169 11) ShapeP2ACh1 > 1.227156 13) IntenCoocContrastCh3 <= 12.32349; criterion = 1, statistic = 22.349 14) SkewIntenCh4 <= 1.148388; criterion = 0.998, statistic = 16.78 15)* weights = 317 14) SkewIntenCh4 > 1.148388 16)* weights = 109 13) IntenCoocContrastCh3 > 12.32349 17) AvgIntenCh2 <= 244.9512; criterion = 0.999, statistic = 19.382 18)* weights = 53 17) AvgIntenCh2 > 244.9512 19)* weights = 22

Next, we fit an rpart() model to the training data using the default parameter settings and calculate the AUC to be 0.8536 on the test data.

# Fit CART Model rpart.fit <- rpart(Class ~ ., data=trainData,cp=0) rpart.fit plot(as.party(rpart.fit),main="rpart Model") #Make predictions using the test data set rpart.pred <- predict(rpart.fit,testData) #Draw the ROC curve rpart.ROC <- roc(predictor=as.numeric(rpart.pred), response=testData$Class) rpart.ROC$auc #Area under the curve: 0.8536 plot(rpart.ROC)

The resulting pruned tree does better than ctree(), but at the expense of building a slightly deeper tree.

1) root 1414 325.211500 0.64144270 2) TotalIntenCh2>=42606.5 792 191.635100 0.41035350 4) FiberWidthCh1>=11.19756 447 85.897090 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.548390 0.09677419 * 9) ShapeP2ACh1>=1.225676 292 66.065070 0.34589040 18) SkewIntenCh4< 1.41772 254 53.259840 0.29921260 36) TotalIntenCh4< 127285.5 214 40.373830 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.943660 0.16901410 * 73) EqEllipseOblateVolCh1< 383.1453 72 17.500000 0.41666670 146) AvgIntenCh1>=110.2253 40 6.400000 0.20000000 * 147) AvgIntenCh1< 110.2253 32 6.875000 0.68750000 * 37) TotalIntenCh4>=127285.5 40 9.900000 0.55000000 * 19) SkewIntenCh4>=1.41772 38 8.552632 0.65789470 * 5) FiberWidthCh1< 11.19756 345 82.388410 0.60579710 10) KurtIntenCh1< -0.3447192 121 28.000000 0.36363640 20) TotalIntenCh1>=13594 98 19.561220 0.27551020 * 21) TotalIntenCh1< 13594 23 4.434783 0.73913040 * 11) KurtIntenCh1>=-0.3447192 224 43.459820 0.73660710 22) AvgIntenCh1>=454.3329 7 0.000000 0.00000000 * 23) AvgIntenCh1< 454.3329 217 39.539170 0.76036870 46) VarIntenCh4< 130.9745 141 31.333330 0.66666670 92) NeighborAvgDistCh1>=256.5239 30 6.300000 0.30000000 * 93) NeighborAvgDistCh1< 256.5239 111 19.909910 0.76576580 * 47) VarIntenCh4>=130.9745 76 4.671053 0.93421050 * 3) TotalIntenCh2< 42606.5 622 37.427650 0.93569130 6) ShapeP2ACh1< 1.236261 11 2.545455 0.36363640 * 7) ShapeP2ACh1>=1.236261 611 31.217680 0.94599020 * >

Note, however, that complexity parameter for rpart(), cp, is set to zero rpart() builds a massive tree, a portion of which is shown below, and over fits the data yielding an AUC of 0.806

1) root 1414 325.2115000 0.64144270 2) TotalIntenCh2>=42606.5 792 191.6351000 0.41035350 4) FiberWidthCh1>=11.19756 447 85.8970900 0.25950780 8) ShapeP2ACh1< 1.225676 155 13.5483900 0.09677419 16) EntropyIntenCh1>=6.672119 133 7.5187970 0.06015038 32) AngleCh1< 108.6438 82 0.0000000 0.00000000 * 33) AngleCh1>=108.6438 51 6.7450980 0.15686270 66) EqEllipseLWRCh1>=1.184478 26 0.9615385 0.03846154 132) DiffIntenDensityCh3>=26.47004 19 0.0000000 0.00000000 * 133) DiffIntenDensityCh3< 26.47004 7 0.8571429 0.14285710 * 67) EqEllipseLWRCh1< 1.184478 25 5.0400000 0.28000000 134) IntenCoocContrastCh3>=9.637027 9 0.0000000 0.00000000 * 135) IntenCoocContrastCh3< 9.637027 16 3.9375000 0.43750000 * 17) EntropyIntenCh1< 6.672119 22 4.7727270 0.31818180 34) ShapeBFRCh1>=0.6778205 13 0.0000000 0.00000000 * 35) ShapeBFRCh1< 0.6778205 9 1.5555560 0.77777780 * 9) ShapeP2ACh1>=1.225676 292 66.0650700 0.34589040 18) SkewIntenCh4< 1.41772 254 53.2598400 0.29921260 36) TotalIntenCh4< 127285.5 214 40.3738300 0.25233640 72) EqEllipseOblateVolCh1>=383.1453 142 19.9436600 0.16901410 144) IntenCoocEntropyCh3< 7.059374 133 16.2857100 0.14285710 288) NeighborMinDistCh1>=21.91001 116 11.5431000 0.11206900 576) NeighborAvgDistCh1>=170.2248 108 8.2500000 0.08333333 1152) FiberAlign2Ch4< 1.481728 68 0.9852941 0.01470588 2304) XCentroid>=100.5 61 0.0000000 0.00000000 * 2305) XCentroid< 100.5 7 0.8571429 0.14285710 * 1153) FiberAlign2Ch4>=1.481728 40 6.4000000 0.20000000 2306) SkewIntenCh1< 0.9963465 27 1.8518520 0.07407407

In practice, rpart()'s complexity parameter (default value cp = .01) is effective in controlling tree growth and overfitting. It does, however, have an "ad hoc" feel to it. In contrast, the ctree() algorithm implements tests for statistical significance within the process of growing a decision tree. It automatically curtails excessive growth, inherently addresses both overfitting and bias and offers the promise of achieving good models with less computation.

Finally, note that rpart() and ctree() construct different trees that offer about the same performance. Some practitioners who value decision trees for their interpretability find this disconcerting. End users of machine learning models often want at story that tells them something true about their customer's behavior or buying preferences etc. But, the likelihood of there being multiple satisfactory answers to a complex problem is inherent to the process of inverse deduction. As Hothorn et al. comment:

Since a key reason for the popularity of tree based methods stems from their ability to represent the estimated regression relationship in an intuitive way, interpretations drawn from regression trees must be taken with a grain of salt.

1 Hothorn et al. (2006): Unbiased Recursive Partitioning: A Conditional Inference Framework J COMPUT GRAPH STAT Vol(15) No(3) Sept 2006

2. Mingers 1987: Expert Systems-Rule Induction with Statistical Data