by Joseph Rickert

In his new book, The Master Algorithm, Pedro Domingos takes on the heroic task of explaining machine learning to a wide audience and classifies machine learning practitioners into 5 tribes*, each with its own fundamental approach to learning problems. To the 5th tribe, the analogizers, Pedro ascribes the Support Vector Machine (SVM) as it's master algorithm. Although the SVM has been a competitive and popular algorithm since its discovery in the 1990's this might be the breakout moment for SVMs into pop culture. (What algorithm has a cooler name?) More people than every will want to give them a try and face the challenge of tuning them. Fortunately, there is plenty of help out there and some good tools for the task.

In A Practical Guide to Support Vector Classification, which is aimed at beginners Hsu et al. suggest the following methodology:

- Transform data to the format of an SVM package
- Conduct simple scaling on the data
- Consider the RBF kernel K(x, y) = exp (−γ||x−y||
^{2}) - Use cross-validation to find the best parameter C and γ
- Use the best parameter C and γ to train the whole training set5
- Test

In this post, we present a variation of the methodology using R and the caret package. First, we set up for an analysis, loading the segmentation data set from the caret package and using the caret's createDataPartition() function to produce training and test data sets.

# Training SVM Models library(caret) library(dplyr) # Used by caret library(kernlab) # support vector machine library(pROC) # plot the ROC curves ### Get the Data # Load the data and construct indices to divide it into training and test data sets. data(segmentationData) # Load the segmentation data set trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,] testData <- segmentationData[-trainIndex,] trainX <-trainData[,4:61] # Pull out the variables for training sapply(trainX,summary) # Look at a summary of the training data

Next, we carry out a two pass training and tuning process. In the first pass, shown in the code block below, we arbitrarily pick some tuning parameters and use the default caret settings for others. In the trainControl() function we specify 5 repetitions of 10 fold cross validation. in the train() function which actually does the work, we specify the radial kernel using the method parameter and the ROC as the metric for assessing performance. The tuneLength parameter is set to pick 9 arbitrary values for the C, the "cost" of the radial kernel. This parameter controls the complexity of the boundary between support vectors. The radial kernel also requires setting a smoothing parameter, sigma. In this first, pass we let train() use its default method of calculating an analytically derived estimate for sigma. Also note that we instruct train() to center and scale the data before running the analysis with the preProc parameter.

## SUPPORT VECTOR MACHINE MODEL # First pass set.seed(1492) # Setup for cross validation ctrl <- trainControl(method="repeatedcv", # 10fold cross validation repeats=5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) #Train and Tune the SVM svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", # Radial kernel tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), # Center and scale data metric="ROC", trControl=ctrl) svm.tune # Support Vector Machines with Radial Basis Function Kernel # # 1010 samples # 58 predictor # 2 classes: 'PS', 'WS' # # Pre-processing: centered, scaled # Resampling: Cross-Validated (10 fold, repeated 5 times) # Summary of sample sizes: 908, 909, 909, 909, 909, 909, ... # Resampling results across tuning parameters: # # C ROC Sens Spec ROC SD Sens SD Spec SD # 0.25 0.8695054 0.8540559 0.6690476 0.03720951 0.04389913 0.07282584 # 0.50 0.8724147 0.8592774 0.6912857 0.03618794 0.04234003 0.07830249 # 1.00 0.8746137 0.8718648 0.6968254 0.03418700 0.04204607 0.06918850 # 2.00 0.8709825 0.8755478 0.6969048 0.03345607 0.03927223 0.06714838 # 4.00 0.8609396 0.8795478 0.6702063 0.03437846 0.04189803 0.06597494 # 8.00 0.8456799 0.8703357 0.6310635 0.03610988 0.04105803 0.07540066 # 16.00 0.8293339 0.8666667 0.5943492 0.03717344 0.04773906 0.08006023 # 32.00 0.8220839 0.8636131 0.5759365 0.03622665 0.04531028 0.07587914 # 64.00 0.8123889 0.8605315 0.5541746 0.03795353 0.04494173 0.07140262 # # Tuning parameter 'sigma' was held constant at a value of 0.01521335 # ROC was used to select the optimal model using the largest value. # The final values used for the model were sigma = 0.01521335 and C = 1.

The results show that the best model resulted from setting .

In the second pass, having seen the parameter values selected in the first pass, we use train()'s tuneGrid parameter to do some sensitivity analysis around the values C = 1 and sigma = 0.015 that produced the model with the best ROC value. Note that R's expand.grid() function is used to build a dataframe contain all the combinations of C and sigma we want to look at.

# Second pass # Look at the results of svm.tune and refine the parameter space set.seed(1492) # Use the expand.grid to specify the search space grid <- expand.grid(sigma = c(.01, .015, 0.2), C = c(0.75, 0.9, 1, 1.1, 1.25) ) #Train and Tune the SVM svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", preProc = c("center","scale"), metric="ROC", tuneGrid = grid, trControl=ctrl) svm.tune # Support Vector Machines with Radial Basis Function Kernel # # 1010 samples # 58 predictor # 2 classes: 'PS', 'WS' # # Pre-processing: centered, scaled # Resampling: Cross-Validated (10 fold, repeated 5 times) # Summary of sample sizes: 909, 909, 908, 910, 909, 909, ... # Resampling results across tuning parameters: # # sigma C ROC Sens Spec ROC SD Sens SD Spec SD # 0.010 0.75 0.8727381 0.8614685 0.6817619 0.04096223 0.04183900 0.07664910 # 0.010 0.90 0.8742107 0.8633193 0.6878889 0.04066995 0.04037202 0.07817537 # 0.010 1.00 0.8748389 0.8630023 0.6873016 0.04079094 0.04061032 0.08189960 # 0.010 1.10 0.8747998 0.8642378 0.6884444 0.04076756 0.04004827 0.07892234 # 0.010 1.25 0.8749384 0.8657762 0.6923492 0.04083294 0.03911751 0.08070616 # 0.015 0.75 0.8726557 0.8660793 0.6923333 0.04171842 0.04324822 0.08203598 # 0.015 0.90 0.8727037 0.8688531 0.6945714 0.04164810 0.04082448 0.08379649 # 0.015 1.00 0.8727079 0.8713147 0.6906667 0.04184851 0.04273855 0.08174494 # 0.015 1.10 0.8724013 0.8719301 0.6895556 0.04197524 0.04108930 0.08377854 # 0.015 1.25 0.8721560 0.8722331 0.6900952 0.04207802 0.04096501 0.08639355 # 0.200 0.75 0.8193497 0.8863263 0.4478413 0.04695531 0.04072159 0.08061632 # 0.200 0.90 0.8195377 0.8903263 0.4405397 0.04688797 0.04091728 0.07844983 # 0.200 1.00 0.8193307 0.8915478 0.4361111 0.04719399 0.04004779 0.07815045 # 0.200 1.10 0.8195696 0.8958508 0.4333651 0.04694670 0.04026003 0.08252021 # 0.200 1.25 0.8198250 0.8983077 0.4271905 0.04705685 0.03900879 0.07945602 # # ROC was used to select the optimal model using the largest value. # The final values used for the model were sigma = 0.01 and C = 1.25.

This was quite a bit of calculation for an improvement of 0.0003247 in the ROC score, but it shows off some of what caret can do.

To finish up, we have build a model with a different kernel. The linear kernel is the simplest way to go. There is only the C parameter to set for this kernel and train() hardwires in a value of C = 1. The resulting ROC value of 0.87 is not too shabby.

#Linear Kernel set.seed(1492) #Train and Tune the SVM svm.tune2 <- train(x=trainX, y= trainData$Class, method = "svmLinear", preProc = c("center","scale"), metric="ROC", trControl=ctrl) svm.tune2 # > svm.tune2 # Support Vector Machines with Linear Kernel # # 1010 samples # 58 predictor # 2 classes: 'PS', 'WS' # # Pre-processing: centered, scaled # Resampling: Cross-Validated (10 fold, repeated 5 times) # Summary of sample sizes: 909, 909, 908, 910, 909, 909, ... # Resampling results # # ROC Sens Spec ROC SD Sens SD Spec SD # 0.8654818 0.8774312 0.6327302 0.03975929 0.04210438 0.0824907 # # Tuning parameter 'C' was held constant at a value of 1

Because, I took the trouble to set the seed for the pseudorandom number generator to the same value before each of the resampling operations above we can use caret's resample() function to compare the results generated by the radial kernel and linear kernel models. The following block of code and results shows just thee first five lines of the comparison table but includes the summary of the comparison.

rValues <- resamples(list(svm=svm.tune,svm.tune2)) rValues$values # Resample svm~ROC svm~Sens svm~Spec Model2~ROC Model2~Sens Model2~Spec # 1 Fold01.Rep1 0.8995726 0.8153846 0.7777778 0.8914530 0.8461538 0.6666667 # 2 Fold01.Rep2 0.9452991 0.9076923 0.7500000 0.9346154 0.9384615 0.6666667 # 3 Fold01.Rep3 0.8153846 0.8153846 0.6111111 0.8115385 0.8153846 0.5555556 # 4 Fold01.Rep4 0.9115385 0.9076923 0.6944444 0.8918803 0.9076923 0.4722222 # 5 Fold01.Rep5 0.9017094 0.8615385 0.6944444 0.8645299 0.8923077 0.6666667 summary(rValues) # Call: # summary.resamples(object = rValues) # # Models: svm, Model2 # Number of resamples: 50 # # ROC # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # svm 0.7910 0.8503 0.8785 0.8749 0.9013 0.9620 0 # Model2 0.7789 0.8399 0.8609 0.8655 0.8918 0.9457 0 # # Sens # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # svm 0.7538 0.8346 0.8769 0.8658 0.8935 0.9385 0 # Model2 0.7692 0.8462 0.8923 0.8774 0.9077 0.9538 0 # # Spec # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # svm 0.4286 0.6312 0.6944 0.6923 0.7500 0.8333 0 # Model2 0.4722 0.5556 0.6389 0.6327 0.6667 0.8333 0 bwplot(rValues,metric="ROC",ylab =c("linear kernel", "radial kernel")) # boxplot

The boxplot below would appear to give the radial kernel the edge on being the better model, but of course, we are far from done. We will leave the last step of Hsu et al.'s methodology, testing the models on held-out testing data, for another day.

To go further, start with the resources on the caret website. Have a look at the list of 206 models that can be used with the train() function. (This is an astounding number of models to be put into a single operational framework.) Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference. Finally, I'm not quite finished but Pedro's book is a pretty good read.

* The 5 tribes and their master algorithms:

- symbolists - inverse deduction
- connectionists - backpropagation
- evolutionists - genetic programming
- Bayesians - Bayesian inference
- analogizers - the support vector machine

Nice article. Were you planning on writing about all of the tribes? Feel free to ignore the request for more work from you.

Posted by: Tom Haas | October 17, 2015 at 10:46

I believe there is a method with a cooler name than SVM:

"Dynamic Time Warping" ;-)

https://en.wikipedia.org/wiki/Dynamic_time_warping

Cheers,

Andrej

Posted by: anspiess | October 18, 2015 at 23:26

Thanks for putting this book on my radar Joseph,

I was inspired by your review to order it.

Keep up the good blogging!

"~/Justin"

CRUG member

Posted by: Justin | November 04, 2015 at 10:22