by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#----------------------------------------------- # Set up the data location information bigDataDir <- "C:/Data/Mortgage" mortCsvDataName <- file.path(bigDataDir,"mortDefault") trainingDataFileName <- "mortDefaultTraining" mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") targetDataFileName <- "mortDefault2009.xdf" #--------------------------------------- # Import the data from multiple .csv files into2 .XDF files # One file, the training file containing data from the years # 2000 through 2008. # The other file, the test file, containing data from the year 2009. defaultLevels <- as.character(c(0,1)) ageLevels <- as.character(c(0:40)) yearLevels <- as.character(c(2000:2009)) colInfo <- list(list(name = "default", type = "factor", levels = defaultLevels), list(name = "houseAge", type = "factor", levels = ageLevels), list(name = "year", type = "factor", levels = yearLevels)) append= FALSE for (i in 2000:2008) { importFile <- paste(mortCsvDataName, i, ".csv", sep = "") rxImport(inData = importFile, outFile = trainingDataFileName, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefaultTraining.xdf #Number of observations: 9e+06 #Number of variables: 6 #Number of blocks: 18 #Compression type: zlib #Variable information: #Var 1: creditScore, Type: integer, Low/High: (432, 955) #Var 2: houseAge #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40 #Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) #Var 4: ccDebt, Type: integer, Low/High: (0, 15566) #Var 5: year #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 #Var 6: default #2 factor levels: 0 1 rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo) rxGetInfo(targetDataFileName) #> rxGetInfo(targetDataFileName) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefault2009.xdf #Number of observations: 1e+06 #Number of variables: 6 #Number of blocks: 2 #Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt, data = trainingDataFileName, smoothingFactor = 1) #Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds #Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds #Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB # #Naive Bayes Classifier # #Call: #rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + #ccDebt, data = trainingDataFileName, smoothingFactor = 1) # #A priori probabilities: #default #0 1 #0.997242889 0.002757111 # #Predictor types: #Variable Type #1 year factor #2 creditScore numeric #3 yearsEmploy numeric #4 ccDebt numeric # #Conditional probabilities: #$year #year #default 2000 2001 2002 2003 2004 #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01 #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02 #year #default 2005 2006 2007 2008 2009 #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07 #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05 # #$creditScore #Means StdDev #0 700.0839 50.00289 #1 686.5243 49.71074 # #$yearsEmploy #Means StdDev #0 5.006873 2.009446 #1 4.133030 1.969213 # #$ccDebt #Means StdDev #0 4991.582 1976.716 #1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") #Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 # secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1") mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1)) #head(mortNBPred) #prob_0 prob_1 default_Pred #1 0.9968860 0.003114038 0 #2 0.9569425 0.043057472 0 #3 0.5725627 0.427437291 0 #4 0.9989603 0.001039729 0 #5 0.7372746 0.262725382 0 #6 0.4142266 0.585773432 1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]] predicted_value <- mortNBPred[["default_Pred"]] results <- table(predicted_value,actual_value) #> results #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 pctMisclassified <- sum(results[2,3])/sum(results)*100 pctMisclassified #[1] 10.1779

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results library(caret) library(e1071) confusionMatrix(results,positive="1") #Confusion Matrix and Statistics # #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 # #Accuracy : 0.8982 #95% CI : (0.8976, 0.8988) #No Information Rate : 0.9753 #P-Value [Acc > NIR] : 1 # #Kappa : NA #Mcnemar's Test P-Value : <2e-16 # #Sensitivity : 0.84673 #Specificity : 0.89953 #Pos Pred Value : 0.17614 #Neg Pred Value : 0.99570 #Prevalence : 0.02474 #Detection Rate : 0.02095 #Detection Prevalence : 0.11894 #Balanced Accuracy : 0.87313 # #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1) names(roc_data) <- c("predicted_value","actual_value") head(roc_data) hist(roc_data$actual_value) rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.