Cortana Analytics Suite is Microsoft's cloud-based big data and advanced analytics suite. It includes a complete set of all the services need to build advanced analytics applications: from data ingestion and management, data warehousing, advanced analytics, data visualization and solution frameworks. You can use Cortana Analytics to build applications using R, by incorporating services including Data Factory, HDInsights Hadoop, and ML Studio.

If you'd like to spend some quality time with other developers and the Microsoft development team, there will be a first-ever Cortana Analytics Workshop at Microsoft HQ in Redmond, September 10-11. Or if you just want to check it out online, you can watch this introductory webinar, or join this upcoming application-focused webinar on September 1:

Leveraging Predictive Analytics for Sales and MarketingIn this session, learn how the Microsoft Global Marketing team built predictive analytics solutions to meet the marketing and sales business need for Microsoft subsidiaries and business groups. The session will describe how Cortana Analytics components (Azure Machine Learning, Azure Data Factory) are used to build e2e solution in Cloud.

Sign up for this free webinar here, or follow the link below to explore Cortana Analytics in depth.

Microsoft: Cortana Analytics Suite

The KDD Cup is an annual competition to build the best predictive model from a large data set. This years' contest tasked entrants to predict the likelihood of a student dropping out from one of XuetangX's massively-online open courses, based on the student's prior activities. The competition closed on July 12, and yesterday, the winning teams were announced. The winner was team "Intercontinental Ensemble" and the runner-up was "FEG&NSSOL@DataVeraci".

I couldn't find any details on what techniques were used — more will be revealed, I expect, at the KDD Conference in Sydney. But if you want to get a sense of what it's like to work with these data, take a look at this Data Until I Die blog post from a competitor who got close to the top of the leaderboard. He or she used a Gradient Boosting Model from the H20 R package, and found (amongst other things) that students who had completed prior courses were more likely to complete the next one.

If you'd like to play around with the data yourself, it's no longer available at the KDD Cup site, but it is available in an experiment in Azure ML Studio. If you haven't used Azure ML Studio before, it's free to get started and all you need is a modern web broswer (I used Chrome on a Mac). The screenshot below just shows the data munging steps, but later on in the flow a Python node is used to fit a predictive model. (This step-by-step tutorial on analyzing the KDD 2015 data walks you through the steps.) It's easy to add an R node as well, which gives you an R instance with 50 Gb of RAM and 8 cores to analyze the data.

For more details on using Azure ML Studio to analyze the KDD Cup data, check out the blog post below.

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#----------------------------------------------- # Set up the data location information bigDataDir <- "C:/Data/Mortgage" mortCsvDataName <- file.path(bigDataDir,"mortDefault") trainingDataFileName <- "mortDefaultTraining" mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") targetDataFileName <- "mortDefault2009.xdf" #--------------------------------------- # Import the data from multiple .csv files into2 .XDF files # One file, the training file containing data from the years # 2000 through 2008. # The other file, the test file, containing data from the year 2009. defaultLevels <- as.character(c(0,1)) ageLevels <- as.character(c(0:40)) yearLevels <- as.character(c(2000:2009)) colInfo <- list(list(name = "default", type = "factor", levels = defaultLevels), list(name = "houseAge", type = "factor", levels = ageLevels), list(name = "year", type = "factor", levels = yearLevels)) append= FALSE for (i in 2000:2008) { importFile <- paste(mortCsvDataName, i, ".csv", sep = "") rxImport(inData = importFile, outFile = trainingDataFileName, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefaultTraining.xdf #Number of observations: 9e+06 #Number of variables: 6 #Number of blocks: 18 #Compression type: zlib #Variable information: #Var 1: creditScore, Type: integer, Low/High: (432, 955) #Var 2: houseAge #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40 #Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) #Var 4: ccDebt, Type: integer, Low/High: (0, 15566) #Var 5: year #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 #Var 6: default #2 factor levels: 0 1 rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo) rxGetInfo(targetDataFileName) #> rxGetInfo(targetDataFileName) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefault2009.xdf #Number of observations: 1e+06 #Number of variables: 6 #Number of blocks: 2 #Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt, data = trainingDataFileName, smoothingFactor = 1) #Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds #Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds #Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB # #Naive Bayes Classifier # #Call: #rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + #ccDebt, data = trainingDataFileName, smoothingFactor = 1) # #A priori probabilities: #default #0 1 #0.997242889 0.002757111 # #Predictor types: #Variable Type #1 year factor #2 creditScore numeric #3 yearsEmploy numeric #4 ccDebt numeric # #Conditional probabilities: #$year #year #default 2000 2001 2002 2003 2004 #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01 #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02 #year #default 2005 2006 2007 2008 2009 #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07 #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05 # #$creditScore #Means StdDev #0 700.0839 50.00289 #1 686.5243 49.71074 # #$yearsEmploy #Means StdDev #0 5.006873 2.009446 #1 4.133030 1.969213 # #$ccDebt #Means StdDev #0 4991.582 1976.716 #1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") #Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 # secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1") mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1)) #head(mortNBPred) #prob_0 prob_1 default_Pred #1 0.9968860 0.003114038 0 #2 0.9569425 0.043057472 0 #3 0.5725627 0.427437291 0 #4 0.9989603 0.001039729 0 #5 0.7372746 0.262725382 0 #6 0.4142266 0.585773432 1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]] predicted_value <- mortNBPred[["default_Pred"]] results <- table(predicted_value,actual_value) #> results #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 pctMisclassified <- sum(results[2,3])/sum(results)*100 pctMisclassified #[1] 10.1779

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results library(caret) library(e1071) confusionMatrix(results,positive="1") #Confusion Matrix and Statistics # #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 # #Accuracy : 0.8982 #95% CI : (0.8976, 0.8988) #No Information Rate : 0.9753 #P-Value [Acc > NIR] : 1 # #Kappa : NA #Mcnemar's Test P-Value : <2e-16 # #Sensitivity : 0.84673 #Specificity : 0.89953 #Pos Pred Value : 0.17614 #Neg Pred Value : 0.99570 #Prevalence : 0.02474 #Detection Rate : 0.02095 #Detection Prevalence : 0.11894 #Balanced Accuracy : 0.87313 # #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1) names(roc_data) <- c("predicted_value","actual_value") head(roc_data) hist(roc_data$actual_value) rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.

by Joseph Rickert

Last year in a post on interesting R topics presented at the JSM I described how data scientists in Google's human resources department were using R and predictive analytics to better understand the characteristics of its workforce. Google may very well have done the pioneering work, but predictive analytics for HR applications is going mainstream. In the still below from a Predictive Analytics Times video on *Data Science for Work Force Optimization* Pasha Roberts, Chief Scientists at Talent Analytics, describes using survival analysis for modeling employee retention.

The video begins with a discussion of data analytics in industry, spends some time on three important curves for workforce analysis, presents some tips for talent modeling and ends with a case study on call center attrition. During the course of his presentation Pasha walks through all of the stages of a project from formulating a hypothesis, through model building and testing to model deployment.

But Pasha covers more ground than model building alone. It appears that leading edge HR departments are moving towards predicting individual employee performance. The discussion of Aptitude Metrics about 45 minutes into the talk should be of interest to anyone working, or looking for work at a technology company. Quantitative evaluation is likely to be a big part of our future. This video is well worth watching.

If you visit how-old.net and upload a photo of yourself, a maching learning algorithm (the 'How Old Robot') will indentify your gender and tell you how old you look. Here's how it did on a photo of me:

That's actually a pretty good guess for my age ... although this particular picture was taken a little over 3 years ago. I tried several different pictures from different time periods and the "How Old" estimates varied by plus or minus five years or so. On average it wasn't too bad, though with perhaps a bit of a statistical bias towards older estimates.

I actually got a preview of this app a little while ago, when it was part of an internal test at Microsoft. Despite only being announced internally, it was extremely popular. So when it was unveiled to the public as part of Joseph Sirosh's keynote at Build, it went viral almost immediately. It was soon the #1 trending topic on Twitter, tweeted by Ellen deGeneres, and people had a lot of fun trying it out on celebrities and the cast of Game of Thrones.The app itself uses the Face detection API from the Azure Machine Learning Gallery, which runs in the Azure cloud service. (There's code in that link if you want to try it out yourself.) Fortunately the Azure cloud was more than up to the task of handling all the traffic generated by all this social activity!

That's all for this week. Have a great weekend, and we'll see you back here on Monday!

by Sherri Rose

Assistant Professor of Health Care Policy

Harvard Medical School

Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.

Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.

Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.

Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.

Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.

library(SuperLearner) ##Generate simulated data## set.seed(27) n<-500 data <- data.frame(W1=runif(n, min = .5, max = 1), W2=runif(n, min = 0, max = 1), W3=runif(n, min = .25, max = .75), W4=runif(n, min = 0, max = 1)) data <- transform(data, #add W5 dependent on W2, W3 W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3)))) data <- transform(data, #add Y dependent on W1, W2, W4, W5 Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4)))))) summary(data) ##Specify a library of algorithms## SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest") ##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library## fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE) ##Run the cross-validated super learner to obtain its CV risk## fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial()) ##Cross validated risks## mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner fit.data.SL #CV risks for algorithms in the library

The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.

The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.

The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.

Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.

*The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.*

*by Bill Jacobs*

Revolution R Enterprise is the industry's first R-based analytics platform that supports a variety of parallel, grid and clustered systems such as Hadoop, Teradata database and Platform LSF Linux grids.

Last year, we enhanced Revolution R Enterprise (RRE) to support big data systems, with support for Hadoop. We continued expansion of RRE in 2014, adding support for Teradata EDWs in March, for Kerberos in May, and for MapR Hadoop in June.

We're continuing our commitment to big data analytics in R by releasing RRE Version 7.3. Available immediately, RRE V7.3 adds a number of new capabilities:

Algorithms:

- A new Stochastic Gradient Boosting algorithm called rxBTrees, provides a machine learning algorithm that creates boosted classification and regression trees. Like our Decision Forests algorithm (equivalent to Random Forest) trees are fitted to subsamples and boosted, but added sequentially. At each iteration, the new regression trees are fitted to the current pseudo-residuals, further improving prediction accuracy.
- PMML export for Decision Forests including our new Stochastic Gradient Boosting algorithm.
- A production-tested API for writing custom parallelized algorithms in R with expanded support for EDWs and clustered systems like Hadoop.
- Performance and memory utilization improvements for our Decision Forest algorithm.

Updated and Improved Platform Support

- Simplified and automated Hadoop installation processes including improved Cloudera Manager Parcels for CDH4 and CDH5.
- Improvements in performance and memory utilization for Teradata EDWs.
- Support for YARN on MapR Hadoop with support for MapR version 4.0.1.
- Certification of the new Teradata 15.0 and Cloudera Hadoop CDH 5.1 platforms.

Deployment and Integration

- New RBroker Framework added to DeployR speeds integration to provide on-demand R analytics to JavaScript, Java, or .NET applications.
- Enriched output from data scoring by appending additional variables that simplify use of scored data by other applications.

Open Source:

- RRE now includes R version 3.1.1
- Revolution Analytics has released a version of DeployR as open source.

RRE V7.3 is available now. Existing users have been notified of the availability of the download and it’s recommended for all RRE users. If you’ve not received your letter, contact support or your sales team to get the information.

While it’s a “minor” revision, it’s an important one, especially for our big data users. For more information on RRE version 7.3, browse:

http://packages.revolutionanalytics.com/doc/7.3.0/README_RevoEnt_Windows_7.3.0.pdf

For more information on Revolution’s DeployR integration and deployment platform browse:

http://deployr.revolutionanalytics.com/

Have a look and tell us what you think in the comments.

by Joseph Rickert

Recently, I had the opportunity to present a webinar on R and Data Science. The challenge with attempting this sort of thing is to say something interesting that does justice to the subject while being suitable for an audience that may include both experienced R users and curious beginners. The approach I settled on had three parts. I decided to:

- show a few slides that indicate the status of R among data scientists
- offer some thoughts as to why R is such a popular and effective tool
- work through some code.

The "why" slides attempt to convey the great number of machine learning and statistical algorithms available in R, the visualization capabilities, the richness of the R programming language and its many tools for data manipulation. I tried to emphasize the great amount of effort that the R community continues to make in order to integrate R with other languages and computing platforms, and to scale R to handle massive data sets on Hadoop and other big data platforms.

The code examples presented in the webinar emphasize the machine learning algorithms oganized in the caret package and the many tools available for working through the predictive modeling process such as functions for searching through the parameter space of a model, performing cross validation, comparing models etc. The code for the caret examples is available here.

Towards the end of the webinar I show the code for running a large Tweedie model with Revolution Analytics rxGlm() function and I also show what it looks like to run an rxLogit() model directly on Hadoop.

Click on video to view the webinar, or go to the Revolution Analytics' website to download the webinar and a pdf of the slides. All of the code is available on my GitHub repository.

by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.