by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#----------------------------------------------- # Set up the data location information bigDataDir <- "C:/Data/Mortgage" mortCsvDataName <- file.path(bigDataDir,"mortDefault") trainingDataFileName <- "mortDefaultTraining" mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") targetDataFileName <- "mortDefault2009.xdf" #--------------------------------------- # Import the data from multiple .csv files into2 .XDF files # One file, the training file containing data from the years # 2000 through 2008. # The other file, the test file, containing data from the year 2009. defaultLevels <- as.character(c(0,1)) ageLevels <- as.character(c(0:40)) yearLevels <- as.character(c(2000:2009)) colInfo <- list(list(name = "default", type = "factor", levels = defaultLevels), list(name = "houseAge", type = "factor", levels = ageLevels), list(name = "year", type = "factor", levels = yearLevels)) append= FALSE for (i in 2000:2008) { importFile <- paste(mortCsvDataName, i, ".csv", sep = "") rxImport(inData = importFile, outFile = trainingDataFileName, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefaultTraining.xdf #Number of observations: 9e+06 #Number of variables: 6 #Number of blocks: 18 #Compression type: zlib #Variable information: #Var 1: creditScore, Type: integer, Low/High: (432, 955) #Var 2: houseAge #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40 #Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) #Var 4: ccDebt, Type: integer, Low/High: (0, 15566) #Var 5: year #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 #Var 6: default #2 factor levels: 0 1 rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo) rxGetInfo(targetDataFileName) #> rxGetInfo(targetDataFileName) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefault2009.xdf #Number of observations: 1e+06 #Number of variables: 6 #Number of blocks: 2 #Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt, data = trainingDataFileName, smoothingFactor = 1) #Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds #Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds #Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB # #Naive Bayes Classifier # #Call: #rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + #ccDebt, data = trainingDataFileName, smoothingFactor = 1) # #A priori probabilities: #default #0 1 #0.997242889 0.002757111 # #Predictor types: #Variable Type #1 year factor #2 creditScore numeric #3 yearsEmploy numeric #4 ccDebt numeric # #Conditional probabilities: #$year #year #default 2000 2001 2002 2003 2004 #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01 #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02 #year #default 2005 2006 2007 2008 2009 #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07 #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05 # #$creditScore #Means StdDev #0 700.0839 50.00289 #1 686.5243 49.71074 # #$yearsEmploy #Means StdDev #0 5.006873 2.009446 #1 4.133030 1.969213 # #$ccDebt #Means StdDev #0 4991.582 1976.716 #1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") #Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 # secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1") mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1)) #head(mortNBPred) #prob_0 prob_1 default_Pred #1 0.9968860 0.003114038 0 #2 0.9569425 0.043057472 0 #3 0.5725627 0.427437291 0 #4 0.9989603 0.001039729 0 #5 0.7372746 0.262725382 0 #6 0.4142266 0.585773432 1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]] predicted_value <- mortNBPred[["default_Pred"]] results <- table(predicted_value,actual_value) #> results #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 pctMisclassified <- sum(results[1:2,2])/sum(results[1:2,1])*100 pctMisclassified #[1] 2.536865

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results library(caret) library(e1071) confusionMatrix(results,positive="1") #Confusion Matrix and Statistics # #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 # #Accuracy : 0.8982 #95% CI : (0.8976, 0.8988) #No Information Rate : 0.9753 #P-Value [Acc > NIR] : 1 # #Kappa : NA #Mcnemar's Test P-Value : <2e-16 # #Sensitivity : 0.84673 #Specificity : 0.89953 #Pos Pred Value : 0.17614 #Neg Pred Value : 0.99570 #Prevalence : 0.02474 #Detection Rate : 0.02095 #Detection Prevalence : 0.11894 #Balanced Accuracy : 0.87313 # #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1) names(roc_data) <- c("predicted_value","actual_value") head(roc_data) hist(roc_data$actual_value) rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.

by Joseph Rickert

The 8th XLDB (Extremely Large Databases) Conference open at Stanford on Tuesday with an outstanding program. This conference has been providing leadership in the "Big Data" world since its first workshop which was held in 2007. For example, the summary report for that year notes: "Both communities (industry and science) are moving towards parallel ... architectures on large clusters of commodity hardware, with the map/reduce paradigm as he leading processing model." but also observes that: "The map/reduce paradigm ... will likely not be the final answer" — prescience and a sober assessment with none of the hype that was to follow.

The extraordinary feature of the first day of this year's conference was the prominence of R. Several talks were either directly about R, or discussed R in conjunction with a significant subtopic. John Chambers spoke on "R in the World: Interfaces between Languages". Karim Chine presented ElasticR. Hannes Mühleisen elaborated on some innovative ideas in his talk: "R as a Query Language" describing a system for using R to write effective queries based on renjin, (R on the JVM). Jeff Lefevre discussed HP's DistributedR in his talk on "Extending Vertica with External Analytics". Rene Brun described the Root-R package and Rcpp in his talk about "ROOT: a Data Storage and Analysis Framework" used at CERN, and Nachum Shacham mentioned both R and the R/H2O/Hadoop interface in his opening talk: "On the Practice of Predictive Modeling with Bit Data".

Even Stephen Wolfram obliquely referred to R! He began his special keynote talk, and very impressive impromptu demo of the Wolfram Language, with a statement that went something like this: "Unlike other languages that have a very small core and add features through packages we decided to build as much as possible into the language". The exact quote will have to wait untill the video is available, but it very much seemed to me that at least with respect to design he was positioning the Wolfram Language (the combination of Mathematica and Wolfram Alpha) as a kind of anti-R!

The slides from all of the talks will be available on the conference program page in a couple of days, and the conference videos will follow in June. In the meantime, through the kindness of Hannes Mühleisen and the conference organizers, we have Hannes' slides and those of Rene Brun and John Chambers available for download.

The following slide from Hannes' presentation indicates how R might be make more efficient through certain SQL sensibilities and seems to share the spirit of data.table.

Rene's presentation contains several informative slides. Be sure to check out slide 11 which shows when C/C++ overtook Fortran, slide 29 which gives an overview of the core ROOT Math/Stat libraries , and slide 40 which shows how R, Rcpp and RInside fit in.

John Chamber's presentation begins with a reminder that the original S language was initially conceived as an interface to the Fortran libraries, the outstanding computational resource of the day, and then stresses that R's interfaces to other languages and resources such as databases is one of its greatest strengths.

He then elaborates on his three principles for understanding R and describes the motivations, architecture and design of the new group of "XR" packages he is working on. When complete, these will provide a uniform interface to languages as diverse as Python and Julia and provide proxies to objects, functions and classes that will benefit both end-user programmers and developers.

If XLDB 8 turns out to be as prescient as its predecessors at pointing to the direction in which big databases will go, then the future will bring some pretty exciting developments to R.

Downloads:

Download Tues_Hannes_Muehleisen

Download 4_Tues_ReneBrun_XLDB

Download 9_Tues_Chambers - XLDB Conference

Earlier this month TechCrunch published an article of mine, "The Business Economics And Opportunity Of Open-Source Data Science". With this article I wanted to share how open-source software has disrupted the economics of doing business, now that data is a fundamental component of every businesses' operations. Open source projects like Hadoop and R, coupled with commodity hardware, have fundamentally changed the equation when it comes to the scale and scope of the problems that can feasibly be tackled.

If you'd like to read more on this topic, one other article I particularly recommend is by RedMonk's Stephen O'Grady: *Open Source and the Rise of as-a-Service Businesses*. Here's the key quote (as cited in the TechCrunch article):

“Unlike prior eras in which industry players lacking technical competencies effectively outsourced the job of software creation to third party commercial software organizations, companies like Amazon, Facebook and Google looked around and quickly determined that help was not coming from that direction – and even if it did, the economics of traditional software licensing would be a non-starter in scale-out environments.”

TechCrunch: The Business Economics And Opportunity Of Open-Source Data Science

*by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics*

Without missing a beat, the engineers at Revolution Analytics have brought another strong release to users of Revolution R Enterprise (RRE). Just a few weeks after acquisition of Revolution Analytics by Microsoft, RRE 7.4 was released to customers on May 15 adding new capabilities, enhanced performance and security, ann faster and simpler Hadoop editions.

New features in version 7.4 include:

- Addition of Naïve Bayes Classifiers to the ScaleR library of algorithms
- Optional coefficient tracking for stepwise regressions. Coefficient tracking makes stepwise less of a “black box” by illustrating how model features are selected as the algorithm iterates toward a final reduced model.
- Faster import of wide data sets, faster computation of data summaries, and faster fitting of tree-based models and predictions of decision forests and gradient boosted tree algorithms.
- Support for HDFS file caching on Hadoop that speeds analysis of most files, especially when applying multi-step and iterative algorithms.
- Improved tools for distributing R packages across Cloudera Hadoop clusters.
- An updated edition of R, version 3.1.3.
- Certification of RRE 7.4 on Cloudera CDH 5.2 and 5.3, Hortonworks HDP 2.2 and MapR 4.0.2, along with certification of the much requested CentOS as a supported Linux platform.

For RRE users integrating R with enterprise apps, RRE's included DeployR integration server now includes:

- New R Session Process Controls that provide fine-grained file access controls on Linux.
- Support for external file repositories including git and svn for managing scripts and metadata used by DeployR.
- Strengthened password handling to resist recent attack vectors.
- Updates to the Java, JavaScript and .NET broker frameworks and corresponding client libraries, and,
- A new DeployR command line tool that will grow in capability with subsequent releases.

More broadly, the release of version 7.4 so shortly after acquisition of Revolution by Microsoft underscores our commitment to delivery of an expanding array enterprise-capable R platforms. It also demonstrates Microsoft’s growing commitment to the growth of advanced analytics facilities that leverage and extend upon open source technologies such as the R language.

Details of the new features in RRE 7.4 can be found in the release notes here.

Details of improvements to DeployR integration server in RRE 7.4 can be found here.

Arthur Charpentier was trying to solve an interesting problem with R: given this data set of random walks in the 2-D plane, what is the likely *origin* of a pathway that ends in the black circle below?

It's pretty easy to generate random data like this with a few lines of code in R. And with 2 million trajectories of 80 points each, you have some moderately-sized data to analyze: about 4Gb.

There are several ways to tackle data of this size with R: you can use an ordinary data.frame object (provided you have sufficient RAM to hold it in memory) and use standard R functions to select the corresponding records; you can use functions in the dplyr package to filter the data; or you can use a data.table package and its operations to select the appropriate data. Arthur tried all three methods with the following results:

- Using ordinary data.frame operations, it took about a minute to extract the necessary data. Even then, Arthur had some challenges with out of memory errors when trying to create temporary columns in the data (which swelled its size to over 6 Gb).
- Using the dplyr package, Arthur read in the data as a data_frame object and filtered the data using dplyr's group_by, summarise, and left_join operations. This process took about two minutes.
- Using the data.table package and using its built-in selection syntax and merge operator, the process took around 10 seconds.

Note that all of these techniques are in-memory operations. Arthur doesn't note the size of the system he was using, but it probably has at least 8Gb of RAM to be able to accommodate the data. While dplyr's syntax is (for me) somewhat simpler to use, here data.table wins out on performance, thanks to its optimized operations and the ability to create new variables on the fly (and without requiring additional RAM) with its := syntax. You can see the complete code used for the various methods at the link below.

Freakonometrics: Working with "large" datasets, with dplyr and data.table

by Joseph Rickert

The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed environments.

The basic idea underlying the gbm is that weak learners may be combined in an additive, iterative fashion to produce highly accurate classifiers that are resistant to overfitting. While in theory any weak learner can be incorporated into a gbm, classification and regression trees, or decision trees as they are called in RRE, are particularly convenient. As Kuhn and Johnson (p 205) point out:

- They have the flexibility to be weak learners by simply restricting their depth
- Separate trees can be easily added together
- Trees can be generated very quickly

rxBTrees() scales gbm() by building the boosting algorithm around RRE’s rxDTree() decision tree implementation. As I described in a previous post, this algorithm scales to work with large data sets by building trees on histograms, a summary of the data, and not on the data itself. rxBTrees(), like rxDTree() is also a proper parallel external memory algorithm (PEMA) This means two things:

- It uses a “chunking” strategy to work with a block of data at a time that has been read from disk. Since there is no need to keep all of the data in memory at one time, rxBtrees() can build a gbm on arbitrarily large files.
- PEMA’s are parallel algorithms built on top of RRE’s distributed computing infrastructure. This allows them to take full advantage of the opportunities for parallelism implicit in the underlying hardware by setting up a “compute context”. In the example below, the compute context is set to local parallel to take advantage of the cores on my PC. RRE also allows establishing compute contexts that enable PEMAs to run in parallel on clusters.

The following function fits a rxBTrees() model to a training data subset of the a mortgage data set. rxBTrees() contains many input parameters. Some are common to all RevoScaleR PEMAs, others are inherited from rxDTree(), and still others such as nTree and learningRate have direct counterparts in R's gbm() function. In this function call, I have explicitly called out parameters that pertain to fitting a gbm model

# Fit a Boosted Trees Mode on the training data set form <- formula(default ~ houseAge + creditScore + yearsEmploy + ccDebt) BT_model <- rxBTrees(formula = form, data= "mdTrain", lossFunction = "bernoulli", minSplit = NULL, # Min num of obs at node before a split is attempted minBucket = NULL, # Min num of obs at a terminal node maxDepth = 1, # 2^maxDepth = number of leaves on tree cp = 0, # Complexity parameter maxSurrogate = 0, # Max surrogate splits retained, 0 improves perf useSurrogate = 2, # Determines how surrogates are used surrogateStyle = 0, # Sets surrogate algorithm nTree = 100, # Num of trees = num of boosting iterations mTry = NULL, # Num vars to sample as a split candidate sampRate = NULL, # % of observations to sample for each tree # default = 0.632 importance = TRUE, # Importance of predictors to be assessed learningRate = 0.1, # Also known as shrinkage maxNumBins = NULL, # Max num bins in histograms # default: min(1001,max(sqrt(num of obs))) verbose = 1, # Amount of status sent ot console computeContext="RxLocalParallel", # use all cores available in PC overwrite = TRUE) # Allow xdf output file to be overwritten

Some especially noteworthy parameters are maxDepth, maxNumBins and computeContext. maxDepth accomplishes the same purpose as interaction.depth in the gbm() function. However, they are not the same parmeter. Setting interaction.depth = 1 produces stump trees with two leaves each because the number of leaves = interaction.depth + 1. In rxBTRees(), stumps are produced by setting maxDepth = 1. But, this is because 2^{maxDepth} gives the number of leaves.

MaxNumBins determines the number of bins to be used in making histograms of the data. If anyone ever wanted to compare gbm() directly with rxBTrees() on a small data set for example, setting MaxNumBins to the number of points in the data set would mimic gbm(). Note, however, that this would saddle rxBTrees() with all of the overhead of computing histograms while receiving none of the benefits.

The computeContext parameter is the mechanism for connecting a PEMA to the underlying hardware and that enables functions such as rxBTrees() to run in parallel on a various clusters using distributed data. Some of the details of how this happens have been described in a previous post, but I hope to be able to follow up this post with an example of rxBTrees() running on Hadoop in the not too distant future.

The data set used to build the model above is available on Revolution Anaytics' data set web page both as an xdf file and as multiple csv files. The following command shows what the data looks like.

> rxGetInfo(mortData,getVarInfo=TRUE,numRows=3)

File name: C:\DATA\Mortgage Data\mortDefault\mortDefault2.xdf Number of observations: 1e+07 Number of variables: 6 Number of blocks: 20 Compression type: none Variable information: Var 1: creditScore, Type: integer, Low/High: (432, 955) Var 2: houseAge, Type: integer, Low/High: (0, 40) Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) Var 4: ccDebt, Type: integer, Low/High: (0, 15566) Var 5: year, Type: integer, Low/High: (2000, 2009) Var 6: default, Type: integer, Low/High: (0, 1) Data (3 rows starting with row 1): creditScore houseAge yearsEmploy ccDebt year default 1 615 10 5 2818 2000 0 2 780 34 5 3575 2000 0 3 735 12 1 3184 2000 0

In building the model, I divided the mortgage default data randomly into a training and test data sets. Then, I fit an rxBTrees() model on the training data and used the test data to compute the ROC curve shown below.

Along the way, the model also produced the out of bag error versus number of trees plot which shows that for this data set one could probably get by with only building 70 or 80 trees.

The code I used to build the model is available here: Download Post_model. If you do not have Revolution R Enterprise available to you at your company, you can try out this model for yourself in the AWS cloud by taking advantage of Revolution Analytics' 14 day free trial.

Take the first step towards building a gbm on big data.

by Joseph Rickert

Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.

Setting up the distcomp environment requires some preliminary work and out-of-band communication among the collaborators. In the first step, the lead investigator uses a distcomp function to invoke a browser-based Shiny application to describe the location of her data set, the variables to be used in the computation, the model formula and other metadata necessary to describe the computation.

Next, the investigator invokes another distcomp function to move the metadata and a copy of the local data set to computation server with a unique identifier. Once the master server is in place, collaborating investigators at remote locations perform a similar process to set up slave computation servers at their sites. When the lead investigator receives the URLs pointing to the slave servers she is ready to kick off the computation.

All of the details of this setup process are described in this paper by Narasimham et al. The paper also describes two non-trivial computations: a distributed rank-k singular value decomposition and distributed, stratified Cox model that are of interest in their own right. The algorithm and code for the stratified Cox model ought to be useful to data scientists in a number of fields working on time to event models. A really nice feature of the algorithm is that it only requires each site to independently optimize the partial likelihood function using its local data. The master process uses the partial likelihood information from all of the sites to compute a final estimate of the coefficients and their variances.

There are several nice aspects to this work:

- It builds on the cumulative work of the R community to provide a big league, big data application around open source R.
- It provides a flexible paradigm for implementing distributed / parallel applications that leverages existing R algorithms (e.g. the Cox model makes use of code in the survival package)
- It illustrates the ease with which R projects can be deployed in web services applications with Shiny and other R centric software such as DeployR
- It provides an alternative to building out infrastructure and aggregating data before realizing the benefits of a big data computation. (Prototyping calculations with distcomp might also serve to justify the expense and effort of developing centralized infrastructure.)
- It recognizes that privacy and other social concerns are important in big data applications and provides a model for respecting some of the social requirements for dealing with sensitive data.

Distcomp is new work and the developers acknowledge several limitations. (So far, they have only built out two algorithms and they don’t have a way to easily deal with factor data across the distributed data sets.) Nevertheless, the project appears to show great promise.

I spent last week at the Strata 2015 Conference in San José, California. As always, Strata made for a wonderful conference to catch up on the latest developments on big data and data science, and to connect with colleagues and friends old and new. Having been to every Strata conference since the first in XXXX, it's been interesting to see the focus change over the years. While past conferences have focused on big data and data science software (and to be sure, Hadoop, Spark, Python and R all got plenty of mentions this year), the focus has shifted to more of the applications and impacts of data science.

If you couldn't attend yourself, many of the keynote presentations are now available online. Follow the links below to watch a few of my favourites:

President Barack Obama introduced the DJ Patil, the US Government's new Chief Data Scientist (and even cracked a half-decent stats joke). DJ reviewed the advances in Data Science over the past four years, with a focus on the rise of open data and current and future open government initiatives.

Solomon Hsiang gave an inspirational presentation on using statistical analysis to quantify influence of climate change on conflict. This research was also the topic of a recent New York Times op-ed. The meta-analysis was conducted with R, and you can find the replication data and scripts here.

Eden Medina shared some lessons learned from a fascinating episode in computer history, when the Chilean government created Project Cybersyn in 1971 to create what we'd today call an economic dashboard, using only an obsolete mainframe and a "network" of Telex machines.

Joseph Sirosh described an interesting (and surprising) data science application in dairy farming: using pedometers on cows to detect when they are in heat, and even to influence the sex of their offspring.

Jeffrey Heer showed some examples of good and not-so-good data visualizations, and how he applied recent research in visual perception to the visualization tools in Trifacta.

Alistair Croll made some thought-provoking predictions about our future technological lives, including that digital agents may one day become the start of a new species.

That's just a sampling of the many keynotes from the conference. You can watch many of the others at the link before.

Strata + Hadoop World: Feb 17-20 2015, San Jose CA

by Joseph Rickert

In the last week of January, HP Labs in Palo Alto hosted a workshop on distributed computing in R that was organized by Indrajit Roy (Principal Researcher, HP) and Michael Lawrence (Genentech and R-core member). The goal was to bring together a small group of R developers with significant experience in parallel and distributed computing to discuss the feasibility of designing a standardized R API for distributed computing. The workshop (which would have been noteworthy for no other reason than five R-core members attended) took the necessary, level-setting, first step towards progress by getting this influential group to review most of the R distributed computing schemes that are currently in play.

The technical presentations which were organized in three sections: (1) Experience with MPI like backend, (2) Beyond embarrassingly parallel computations and (3) Embrace disks, thread parallelism, and more; spanned an impressive range of topics from a way to speed up computations with parallel C++ code on garden-variety hardware to trying to write parallel R code that will run on different supercomputer architectures.

Luke Tierney, who has probably been doing R parallel computing longer than anyone, led off the technical presentations by providing some historical context and describing a number of the trade-offs between “Explicit” parallelization (like snow and parallel) and the “Implicit” parallelization of vectorized operations (like pnmath). The last slide in Luke’s presentation is a short list of what might be useful additions to R.

The next three presentations in the first section described the use of MPI and snow like parallelism in high performance computing applications. Martin Morgan’s presentation explains and critiques the use of the BioConductor BiocParallel package in processing high throughput genomic data. Junji Nakano and George Ostrouchow both describe supercomputer applications. Junji’s talk shows examples of using the functions in the Rhpc package and presents some NUMA distance measurements. George's talk, an overview of high performance computing with R and the pbdR package, also provides a historical context before working through some examples. The following slide depicts his super condensed take on parallel computing.

All of these first session talks deal with some pretty low level details of parallel computing. Dirk Eddelbuettel’s talk from the third session continues in this vein, discussing thread management issues in the context of C++ and Rcpp. During the course of his talk Dirk compares thread management using “raw” threads, OpenMP and Intel’s TBB library.

The remaining talks delivered at the workshop dealt mostly with different approaches to implementing parallel external memory algorithms. These algorithms apply to applications where datasets are too large to be processed in memory all at one time. Parallelism comes into play either to improve performance or because the algorithms must be implemented across distributed systems such as Hadoop.

Michael Kane’s talk provides an overview of bigmemory, one of the first “chunk” computing schemes to be implemented in R. He describes the role played by mmap, shows how later packages have built on it and offers some insight into its design and features that continue to make it useful.

RHIPE was probably the first practical system for using R with Hadoop. Saptarshi Guha’s talk presents several programming paradigms popular with RHIPE users. Ryan Hafen’s talk describes how MapReduce supports the divide and recombine paradigm underlying the Tessera project and shows how all of this builds on the RHIPE infrastructure. He outlines the architecture of distributed data objects and distributed data frames and provides examples of functions from the datadr package.

Mario Inchiosa’s presentation describes RevoPemaR, an experimental R package for writing parallel external memory algorithms, based on R’s system of Reference Classes. Class methods initialize(), processData(), updateResults() and processResults() provide a standardized template for developing external memory algorithms. Although RevoPemaR has been released under the Apache 2.0 license it is currently only available with Revolution R Enterprise and depends on the RevoScaleR package for chunk processing and distributed computing.

Indrajit Roy’s presentation describes how the distributed data structures in the distributedR package extend existing R data structures and enables R programmers to write code that is capable of running on any number of underlying parallel and distributed hardware platforms. They can do this without having to worry about the location of the data or performing task and data scheduling operations.

Simon Urbanek’s presentation covers two relatively new projects that grew out of a need to work with very large data sets: iotools, a collection of highly efficient, chunk-wise functions for processing I/O streams and ROctopus a way to use R containers as in-memory compute elements. Iotools achieves parallelism through a split, compute and combine strategy with all jobs going through at least these three stages. One notable advantage is that iotools functions are very efficient using native R syntax. Here is the sample code for aggregating point locations by ZIP code:

ROctopus is still at the experimental stage. When completed it will allow running sophisticated algorithms such as GLMs and LASSO on R objects containing massive data sets with no movement of data and no need for data conversion after loading.

Simon finishes up with lessons learned from both projects which should prove to be influential in guiding future work.

Michael Sannella takes a little different approach than the others in his presentation. After an exceptionally quick introduction to Hadoop and Spark he examines some of the features of the SparkR interface with regard to the impact they would have on any distributed R system and makes some concrete suggestions. The issues Michael identifies include:

- Hiding vs exposing distributed operations
- Sending auxiliary data to workers
- Loading packages on workers
- Developing/testing code on distributed R processes

The various presentations make it clear that there are at least three kinds of parallel / distributed computing problems that need to be addressed for R:

- Massive, high-performance computing that may involve little or no data
- Single machine or small cluster parallelism where the major problem is that the data are too large to fit into the memory of a single machine. Here, parallel code is a kind of bonus. As long as the data is going to be broken up into chunks it is natural to think about processing the chunks in parallel. Chunk-wise computing, however, can work perfectly well without parallelism on relatively small data.
- Distributed computing on massive data sets that are distributed across a particular underlying architecture, such as Hadoop. Here parallelism is inherent in the structure of the platform, and the kinds of parallel computing that can be done may be constrained by the architecture.

The next step in this ambitious project is to undertake the difficult work of evaluation, selection and synthesis. It is hoped that, at the very least, the work will lead to a consensus on the requirements for distributed computing. The working group expects to produce a white paper in six months or so that will be suitable for circulation. Stay tuned, and please feel free to provide constructive feedback.

by Joseph Rickert

Apache Spark, the open-source, cluster computing framework originally developed in the AMPLab at UC Berkeley and now championed by Databricks is rapidly moving from the bleeding edge of data science to the mainstream. Interest in Spark, demand for training and overall hype is on a trajectory to match the frenzy surrounding Hadoop in recent years. Next month's Strata + Hadoop World conference, for example, will offer three serious Spark training sessions: Apache Spark Advanced Training, SparkCamp and Spark developer certification with additional spark related talks on the schedule. It is only a matter of time before Spark becomes a big deal in the R world as well.

If you don't know much about Spark but want to learn more, a good place to start is the video of Reza Zadeh's keynote talk at the ACM Data Science Camp held last October at eBay in San Jose that has been recently posted.

Reza is a gifted speaker, an expert on the subject matter and adept at selecting and articulating the key points that can carry an audience towards comprehension. Reza starts slowly, beginning with the block diagram of the Spark architecture and spends some time emphasizing RDDs, Resilient Distributed Data Sets as the key feature that enables Spark's impressive performance and defines and circumscribes its capabilities.

After the preliminaries, Reza takes the audience on a deep dive of three algorithms in Spark's machine learning library MLib, gradient descent logistic regression, page rank and singular value decomposition and moves on to discuss some of the new features in Spark release 1.2.0 including All Pairs Similarity.

Reza's discussion of Spark's SVD implementation is a gem of a tutorial on computational linear algebra. The SVD algorithm considers two cases, the "Tall and Skinny" situation where there are less than about a 1,000 columns and the "roughly square" case where the number of rows and columns are about the same. I found it comforting to learn that the code for this latter case is based on highly reliable and "immensely optimized" Fortran77 code. (Some computational problems get solved and stay solved.)

Reza's discussion of the All Pairs Similarity, based on the DIMSUM (Dimension Independent Matrix Square Using MapReduce) algorithm and a non-intuitive sampling procedure where frequently occurring pairs are sampled less often, is also illuminating.

To get some hands-on experience with Spark your next steps might be to watch the three hour, Databricks video: Intro to Apache Spark Training - Part 1.

From here, the next obvious question is: "How do I use Spark with R?" Spark itself is written in Scala and has bindings for Java, Python and R. Searching for a Spark demo online, however, will most likely turn up either a Scala or Python example. sparkR, the open source project to produce an R binding, is not as far along as the other languages. Indeed, a Cloudera web page refers to SparkR as "promising work". The SparkR GitHub page shows it to be a moderately active project with 410 commits to date from 15 contributors.

In SparkR Enabling Interactive Data Science at Scale, Zongheng Yang (only a 3rd year Berkeley undergraduate when he delivered this talk last July) lucidly works through a word count demo and a live presentation using sparkR with RStudio and a number of R packages and functions. Here is the code for his word count example.

**SparkR Word count Example**

Note the sparkR lapply() function which is an alias for the Spark map and mapPartitions functions.

These are still early times for Spark and R. We would very much like to hear about your experiences with sparkR or any other effort to run R over Spark.