Revolution R Open, the enhanced open source R distribution from Revolution Analytics and Microsoft, is now available for download. This update brings multi-threaded performance to the latest update to the R engine from the R Core Group, which includes several improvements and bug fixes. Significant amongst these is default support for HTTPS connections, making it easy to follow the best practices for R security recommended by the R Consortium.

Revolution R Open uses a CRAN snapshot taken on August 27, 2015; you can review highlights of new and updated R packages introduced since RRO 3.2.1 at our Package Spotlight page. Don't forget you can always use packages released after this date using the built-in checkpoint package or by using any CRAN mirror. (By the way, the checkpoint package also now downloads packages using HTTPS.)

RRO 3.2.2 is available for download now at the link below for Windows, Mac and Linux systems, and is 100% compatible with R 3.2.2. We welcome comments, suggestions and other discussion on the RRO Forums. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

Everyone needs to be vigilant about security on the Web today. One particular threat — the man-in-the-middle attack — is a risk anytime you are communicating over the Internet, and an attacker has access to the network between the two endpoints. This is a possibility whenever you are using the Web over an unencrypted channel, or when using an unsecured Wi-Fi access point (to name just two examples). The attacker could eavesdrop on your communications, or even alter or substitute your data.

This is a possible vector for inserting malware on your machine: if you download a program to your computer over an unsecured channel, an attacker could substitute that program with one that includes a malicious payload. When downloading software over the web, it's always a good idea to make sure you're using an encrypted connection, from a website URL beginning with https:// (and not just http).

This applies to all software you download over the internet, and R is no exception. R packages should also be treated in the same manner, since packages also include executable code. To this end, the R Consortium has published a useful guide regarding best practices for using R securely. In short, you should always download R from a secure server, verify the MD5 checksums, and download R packages from a secure server.

Fortunately, many CRAN mirrors (including the master CRAN mirror, the Revolution Analytics mirror, and the RStudio mirror) support HTTPS today, and have defaulted to HTTPS downloads since before the release of R 3.2.2. Furthermore, R 3.2.2 supports package downloads using HTTPS, so if you downloaded R 3.2.2 (or later) from a secure mirror and are using that secure mirror as your default CRAN repository for packages, you're already protecting yourself from a man-in-the-middle attack. If you're using an earlier version of R, it's easy to configure it for HTTPS by using the steps in the R Consortium guide.

Revolution R Open 3.2.1 is also secure by default: MRAN defaults to HTTPS, and the default CRAN snapshot is also a secure (HTTPS-enabled) repository on MRAN. If you're using an earlier version of Revolution R Open, you should similarly follow the steps in the R Consortium guide for the corresponding version of R. And if you're using Revolution R Enterprise, we've provided simple steps to configure Revolution R Enterprise 7.4.1 for secure package downloads.

These are simple steps everyone should take. And remember: anytime you're downloading software from the Internet, make sure it's via https://.

R Consortium: Best Practices for Using R Securely

*by Carl Nan, DeployR PM*

A new version of DeployR, the server-based framework that provides simple and secure R integration for application developers, is now available. (If you're new to DeployR, take a look at the DeployR Overview or download the white paper, Using DeployR to Solve the R Integration Problem.)

The following list highlights the changes and improvements to DeployR 7.4.1.

- Binary downloads are available for Windows, Mac and Linux systems, including Windows 10.
- DeployR Enterprise now supports for RRE 7.4.1.
- DeployR Open now supports R 3.2.x and RRO 3.2.x.
- DeployR now supports Java 8.
- New and improved documentation, including the Getting Started guides referenced below and the new Writing Portable R Code guide for data scientists.
- Check out the R package: deployrUtils, whose goal is to solve several R portability issues that arise when data scientists develop R analytics for use in their local R environment and in the DeployR server environment.

DeployR comes in two editions: DeployR Open is available free on for Linux, Windows and MacOS X. Revolution R Enterprise DeployR adds production-grade features like enterprise authentication and grid scaling, with a paid subscription from Revolution Analytics.

See the official What's New document on the DeployR website.

*by Richard Kittler, Revolution R Enterprise PM, Microsoft Advanced Analytics*

In its latest release Revolution has added to the platform support of Revolution R Enterprise (RRE) version 7.4. Released August 14, version 7.4.1 extends RRE 7.4 capabilities to the Teradata database, HPC Server cluster, and Windows 10 platforms.

With RRE for Teradata customers enjoy the advantage of bringing the analytics to the data by having RRE’s high performance parallelized ScaleR algorithms run in-database on Teradata 14.10 and 15.00 databases rather than incurring the overhead of the traditional extract and analyze paradigm.

With RRE for Microsoft HPC Pack 2008 or 2012 these same high performance ScaleR algorithms can be run in a distributed fashion across HPC-based Windows clusters.

Developers running RRE on laptops and desktops are now able to leverage Microsoft’s latest update to Windows with full RRE support for Windows 10.

Other new features in version 7.4.1 include:

- Default installation on top of Revolution R Open, Revolution’s enhanced distribution of open source R.
- Enhanced support for importing from and exporting to composite CSV and XDF files (experimental).

For RRE users integrating R with enterprise apps, RRE's included DeployR integration server now supports use of Java8. A more detailed coverage of what’s new in DeployR 7.4.1 please see this companion blog post or the link below.

Details of the new features in RRE 7.4.1 can be found in the release notes here.

Details of improvements to DeployR integration server in RRE 7.4.1 can be found here.

The latest update to Revolution R Open, RRO 3.2.1, is now available for download from MRAN. This release upgrades to the latest R engine (3.2.1), enables package downloads via HTTPS by default, and adds new supported Linux platforms.

Revolution R Open 3.2.1 includes:

- The latest R engine, R 3.2.1. Improvements in this release include more flexible character string handling, reduced memory usage, and some minor bug fixes.
- Multi-threaded math processing, reducing the time for some numerical operations on multi-core systems.
- A focus on reproducibility, with access to a fixed CRAN snapshot taken on July 1, 2015. Many new and updated packages are available since the previous release of RRO — see the latest Package Spotlight for details. CRAN packages released since July 1 can be easily (and reproducibly!) accessed with the checkpoint function.
- Binary downloads for Windows, Mac and Linux systems, including new support for SUSE Linux Enterprise Server 10 and 11, and openSUSE 13.1.
- 100% compatibility with R 3.2.1, RStudio and all other R-based applications.

You can download Revolution R Open now from the link below, and we welcome comments, suggestions and other discussion on the RRO Google Group. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

by Joseph Rickert

June was a hot month for extreme statistics and R. Not only did we close out the month with useR! 2015, but two small conferences in the middle of the month brought experts together from all over the world to discuss two very difficult areas of statistics that generate quite a bit of R code.

The Extreme Value Analysis conference is a prestigious event that is held every two years in different parts of the world. This year, over 230 participants from 26 countries met from June 15th through 19th at the University of Michigan, Ann Arbor for EVA 2015. The program included theoretical advances as well as novel applications of Extreme Value Theory in fields including finance,

economics,insurance, hydrology, traffic safety, terrorism risk, climate and environmental extremes. You can get a good idea of the topics discussed at the EVA from the book of abstracts which includes an author index as well as a keyword index. The conference organizers are in the process of obtaining permissions to post the slides from the talk. These should be available soon.

In the meantime, have a look at the slides from two excellent presentations from the Workshop on Statistical Computing which was held the day before the main conference. Eric Gilleland's Introduction to Extreme Value Analysis provides a gentle introduction for anyone willing to look at some math. Eric begins with some motivating examples, develops some key concepts and illustrates them with R and even provides some history along the way. This quote from Emil Gumbel, a founding giant in the field, should be every modeler's mantra: “Il est impossible que l’improbable n’arrive jamais”. ("It's impossible for the improbable to never occur" -- *ed*)

In Modeling spatial extremes with the SpatialExtremes package, Mathieu Ribatet works through a complete example in R by fitting and evaluating a model and running simulations. This motivating slide from the presentation describes the kind of problems he is considering.

In our world of climate extremes and financial black swans there are probably few topics more of more immediate concern to statisticians that EVA, but the vexing problem of dealing with missing values might be one of them. So, it was not surprising that at nearly the same time (June 18th and 19th) a 150 people or so gathered on the other side of the world in Rennes France for missData 2015.

Over the years, R developers have expended considerable energy creating routines to missing values. The transcan function in the Hmisc package "automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables". mice provides functions using the Fully Conditional Specification using the MICE algorithm. (See the slides from Stef van Buuern's presentation Fully Conditional Specification: Past, present and beyond for a perspective on FCS and the reading list at left.) mi provides functions for missing value imputation in a Bayesian framework as does the BaBooN package and VIM provides for visualizing the structure of missing values. Slides for almost all of the talks are available online at the conference program page and videos will be available soon. Have a look at the slides from the lightning talk by Matthias Templ and Alexander Kowarik to see what the VIM package can do.

Revolution Analytics was very pleased to have been able to sponsor both of these conferences. For the next EVA mark your calendars to visit Delft, the Netherlands in 2017.

by Joseph Rickert

The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://cran-logs.rstudio.com/. The following code uses the download_RStudio_CRAN_data() function to download a month's worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.

# CODE TO DOWNLOAD LOG RILES FROM RSTUDIO CRAN MIRROR # FIND MOST DOWNLOADED PACKAGE AND PLOT DOWNLOADS # FOR SELECTED PACKAGES # ----------------------------------------------------------------- library(installr) library(ggplot2) library(data.table) #for downloading # ---------------------------------------------------------------- # Read data from RStudio site RStudio_CRAN_dir <- download_RStudio_CRAN_data(START = '2015-05-15',END = '2015-06-15', log_folder="C:/DATA/test3") # read .gz compressed files form local directory RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #> RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #Reading C:/DATA/test3/2015-05-15.csv.gz ... #Reading C:/DATA/test3/2015-05-16.csv.gz ... #Reading C:/DATA/test3/2015-05-17.csv.gz ... #Reading C:/DATA/test3/2015-05-18.csv.gz ... #Reading C:/DATA/test3/2015-05-19.csv.gz ... #Reading C:/DATA/test3/2015-05-20.csv.gz ... #Reading C:/DATA/test3/2015-05-21.csv.gz ... #Reading C:/DATA/test3/2015-05-22.csv.gz ... dim(RStudio_CRAN_data) # [1] 8055660 10 # Find the most downloaded packages pkg_list <- most_downloaded_packages(RStudio_CRAN_data) pkg_list #Rcpp stringr ggplot2 stringi magrittr plyr #125529 115282 103921 103727 102083 97183 lineplot_package_downloads(names(pkg_list),RStudio_CRAN_data) # Look at plots for some packages barplot_package_users_per_day("checkpoint",RStudio_CRAN_data) #$total_installations #[1] 359 barplot_package_users_per_day("Rcpp", RStudio_CRAN_data) #$total_installations #[1] 23832

The function lineplot_package_downloads() produces a multiple time series plot for the top five packages:

and the barplot_package_users_per_day() function provides download plots. Here we contrast downloads for the Revolution Analytics' checkpoint package and Rcpp.

Downloads for the checkpoint package look pretty uniform over the month. checkpoint is a relatively new, specialized package for dealing with reproducibility issues. The download pattern probably represents users discovering it. Rccp, on the other hand, is essential to an incredible number of other R packages. The right skewed plot most likely represents the tail end of the download cycle that started after Rcpp was upgraded on 5/1/15.

All of this works well for small amounts of data. However, the fact that read_RStudio_CRAN_data() puts everything in a data frame presents a bit of a problem for working with longer time periods with the 6GB of RAM on my laptop. So, after downloading the files representing the period (5/28/14 to 5/28/15) to my laptop,

# Convert .gz compresed files to .csv files in_names <- list.files("C:/DATA/RStudio_logs_1yr_gz", pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) length(in_names) for(i in 1:length(in_names)){ df <- read.csv(in_names[i]) write.csv(df, out_names[i],row.names=FALSE) }

I used the external memory algorithms in Revolution R Enterprise to work with the data on disk. First, rxImport() brings all of the .csv files into a single .xdf file and stores it on my laptop. (Note that the rxSummary() function indicates that the file has over 90 million rows.) Then, the super efficient rxCube() function to tabulate the package counts.

# REVOSCALE R CODE TO IMPORT A YEARS WORTH OF DATA data_dir <- "C:/DATA/RStudio_logs_1yr" in_names <- list.files(data_dir, pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) #---------------------------------------------------- # Import to .xdf file # Establish the column classes for the variables colInfo <- list( list(name = "date", type = "character"), list(name = "time", type = "character"), list(name = "size", type = "integer"), list(name = "r_version", type = "factor"), list(name = "r_arch", type = "factor"), list(name = "r_os", type = "factor"), list(name = "package", type = "factor"), list(name = "version", type = "factor"), list(name = "country", type = "factor"), list(name = "1p_1d", type = "integer")) num_files <- length(out_names) out_file <- file.path(data_dir,"RStudio_logs_1yr") append = FALSE for(i in 1:num_files){ rxImport(inData = out_names[i], outFile = out_file, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE } # Look at a summary of the imported data rxGetInfo(out_file)

#File name: C:\DATA\RStudio_logs_1yr\RStudio_logs_1yr.xdf

#Number of observations: 90200221

#Number of variables: 10 # Long form tablualtion cube1 <- rxCube(~ package,data= out_file) # Computation time: 5.907 seconds. cube1 <- as.data.frame(cube1) sort1 <- rxSort(cube1, decreasing = TRUE, sortByVars = "Counts") #Time to sort data file: 0.078 seconds write.csv(head(sort1,100),"Top_100_Packages.csv")

Here are the download counts for top 100 packages for the period (5/28/14 to 5/28/15).

You can download this data here: Download Top_100_Packages

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#----------------------------------------------- # Set up the data location information bigDataDir <- "C:/Data/Mortgage" mortCsvDataName <- file.path(bigDataDir,"mortDefault") trainingDataFileName <- "mortDefaultTraining" mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") targetDataFileName <- "mortDefault2009.xdf" #--------------------------------------- # Import the data from multiple .csv files into2 .XDF files # One file, the training file containing data from the years # 2000 through 2008. # The other file, the test file, containing data from the year 2009. defaultLevels <- as.character(c(0,1)) ageLevels <- as.character(c(0:40)) yearLevels <- as.character(c(2000:2009)) colInfo <- list(list(name = "default", type = "factor", levels = defaultLevels), list(name = "houseAge", type = "factor", levels = ageLevels), list(name = "year", type = "factor", levels = yearLevels)) append= FALSE for (i in 2000:2008) { importFile <- paste(mortCsvDataName, i, ".csv", sep = "") rxImport(inData = importFile, outFile = trainingDataFileName, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefaultTraining.xdf #Number of observations: 9e+06 #Number of variables: 6 #Number of blocks: 18 #Compression type: zlib #Variable information: #Var 1: creditScore, Type: integer, Low/High: (432, 955) #Var 2: houseAge #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40 #Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) #Var 4: ccDebt, Type: integer, Low/High: (0, 15566) #Var 5: year #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 #Var 6: default #2 factor levels: 0 1 rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo) rxGetInfo(targetDataFileName) #> rxGetInfo(targetDataFileName) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefault2009.xdf #Number of observations: 1e+06 #Number of variables: 6 #Number of blocks: 2 #Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt, data = trainingDataFileName, smoothingFactor = 1) #Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds #Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds #Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB # #Naive Bayes Classifier # #Call: #rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + #ccDebt, data = trainingDataFileName, smoothingFactor = 1) # #A priori probabilities: #default #0 1 #0.997242889 0.002757111 # #Predictor types: #Variable Type #1 year factor #2 creditScore numeric #3 yearsEmploy numeric #4 ccDebt numeric # #Conditional probabilities: #$year #year #default 2000 2001 2002 2003 2004 #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01 #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02 #year #default 2005 2006 2007 2008 2009 #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07 #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05 # #$creditScore #Means StdDev #0 700.0839 50.00289 #1 686.5243 49.71074 # #$yearsEmploy #Means StdDev #0 5.006873 2.009446 #1 4.133030 1.969213 # #$ccDebt #Means StdDev #0 4991.582 1976.716 #1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") #Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 # secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1") mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1)) #head(mortNBPred) #prob_0 prob_1 default_Pred #1 0.9968860 0.003114038 0 #2 0.9569425 0.043057472 0 #3 0.5725627 0.427437291 0 #4 0.9989603 0.001039729 0 #5 0.7372746 0.262725382 0 #6 0.4142266 0.585773432 1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]] predicted_value <- mortNBPred[["default_Pred"]] results <- table(predicted_value,actual_value) #> results #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 pctMisclassified <- sum(results[2,3])/sum(results)*100 pctMisclassified #[1] 10.1779

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results library(caret) library(e1071) confusionMatrix(results,positive="1") #Confusion Matrix and Statistics # #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 # #Accuracy : 0.8982 #95% CI : (0.8976, 0.8988) #No Information Rate : 0.9753 #P-Value [Acc > NIR] : 1 # #Kappa : NA #Mcnemar's Test P-Value : <2e-16 # #Sensitivity : 0.84673 #Specificity : 0.89953 #Pos Pred Value : 0.17614 #Neg Pred Value : 0.99570 #Prevalence : 0.02474 #Detection Rate : 0.02095 #Detection Prevalence : 0.11894 #Balanced Accuracy : 0.87313 # #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1) names(roc_data) <- c("predicted_value","actual_value") head(roc_data) hist(roc_data$actual_value) rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.