by Joseph Rickert

Last year in a post on interesting R topics presented at the JSM I described how data scientists in Google's human resources department were using R and predictive analytics to better understand the characteristics of its workforce. Google may very well have done the pioneering work, but predictive analytics for HR applications is going mainstream. In the still below from a Predictive Analytics Times video on *Data Science for Work Force Optimization* Pasha Roberts, Chief Scientists at Talent Analytics, describes using survival analysis for modeling employee retention.

The video begins with a discussion of data analytics in industry, spends some time on three important curves for workforce analysis, presents some tips for talent modeling and ends with a case study on call center attrition. During the course of his presentation Pasha walks through all of the stages of a project from formulating a hypothesis, through model building and testing to model deployment.

But Pasha covers more ground than model building alone. It appears that leading edge HR departments are moving towards predicting individual employee performance. The discussion of Aptitude Metrics about 45 minutes into the talk should be of interest to anyone working, or looking for work at a technology company. Quantitative evaluation is likely to be a big part of our future. This video is well worth watching.

If you visit how-old.net and upload a photo of yourself, a maching learning algorithm (the 'How Old Robot') will indentify your gender and tell you how old you look. Here's how it did on a photo of me:

That's actually a pretty good guess for my age ... although this particular picture was taken a little over 3 years ago. I tried several different pictures from different time periods and the "How Old" estimates varied by plus or minus five years or so. On average it wasn't too bad, though with perhaps a bit of a statistical bias towards older estimates.

I actually got a preview of this app a little while ago, when it was part of an internal test at Microsoft. Despite only being announced internally, it was extremely popular. So when it was unveiled to the public as part of Joseph Sirosh's keynote at Build, it went viral almost immediately. It was soon the #1 trending topic on Twitter, tweeted by Ellen deGeneres, and people had a lot of fun trying it out on celebrities and the cast of Game of Thrones.The app itself uses the Face detection API from the Azure Machine Learning Gallery, which runs in the Azure cloud service. (There's code in that link if you want to try it out yourself.) Fortunately the Azure cloud was more than up to the task of handling all the traffic generated by all this social activity!

That's all for this week. Have a great weekend, and we'll see you back here on Monday!

by Sherri Rose

Assistant Professor of Health Care Policy

Harvard Medical School

Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.

Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.

Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.

Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.

Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.

library(SuperLearner) ##Generate simulated data## set.seed(27) n<-500 data <- data.frame(W1=runif(n, min = .5, max = 1), W2=runif(n, min = 0, max = 1), W3=runif(n, min = .25, max = .75), W4=runif(n, min = 0, max = 1)) data <- transform(data, #add W5 dependent on W2, W3 W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3)))) data <- transform(data, #add Y dependent on W1, W2, W4, W5 Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4)))))) summary(data) ##Specify a library of algorithms## SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest") ##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library## fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE) ##Run the cross-validated super learner to obtain its CV risk## fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial()) ##Cross validated risks## mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner fit.data.SL #CV risks for algorithms in the library

The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.

The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.

The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.

Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.

*The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.*

*by Bill Jacobs*

Revolution R Enterprise is the industry's first R-based analytics platform that supports a variety of parallel, grid and clustered systems such as Hadoop, Teradata database and Platform LSF Linux grids.

Last year, we enhanced Revolution R Enterprise (RRE) to support big data systems, with support for Hadoop. We continued expansion of RRE in 2014, adding support for Teradata EDWs in March, for Kerberos in May, and for MapR Hadoop in June.

We're continuing our commitment to big data analytics in R by releasing RRE Version 7.3. Available immediately, RRE V7.3 adds a number of new capabilities:

Algorithms:

- A new Stochastic Gradient Boosting algorithm called rxBTrees, provides a machine learning algorithm that creates boosted classification and regression trees. Like our Decision Forests algorithm (equivalent to Random Forest) trees are fitted to subsamples and boosted, but added sequentially. At each iteration, the new regression trees are fitted to the current pseudo-residuals, further improving prediction accuracy.
- PMML export for Decision Forests including our new Stochastic Gradient Boosting algorithm.
- A production-tested API for writing custom parallelized algorithms in R with expanded support for EDWs and clustered systems like Hadoop.
- Performance and memory utilization improvements for our Decision Forest algorithm.

Updated and Improved Platform Support

- Simplified and automated Hadoop installation processes including improved Cloudera Manager Parcels for CDH4 and CDH5.
- Improvements in performance and memory utilization for Teradata EDWs.
- Support for YARN on MapR Hadoop with support for MapR version 4.0.1.
- Certification of the new Teradata 15.0 and Cloudera Hadoop CDH 5.1 platforms.

Deployment and Integration

- New RBroker Framework added to DeployR speeds integration to provide on-demand R analytics to JavaScript, Java, or .NET applications.
- Enriched output from data scoring by appending additional variables that simplify use of scored data by other applications.

Open Source:

- RRE now includes R version 3.1.1
- Revolution Analytics has released a version of DeployR as open source.

RRE V7.3 is available now. Existing users have been notified of the availability of the download and it’s recommended for all RRE users. If you’ve not received your letter, contact support or your sales team to get the information.

While it’s a “minor” revision, it’s an important one, especially for our big data users. For more information on RRE version 7.3, browse:

http://packages.revolutionanalytics.com/doc/7.3.0/README_RevoEnt_Windows_7.3.0.pdf

For more information on Revolution’s DeployR integration and deployment platform browse:

http://deployr.revolutionanalytics.com/

Have a look and tell us what you think in the comments.

by Joseph Rickert

Recently, I had the opportunity to present a webinar on R and Data Science. The challenge with attempting this sort of thing is to say something interesting that does justice to the subject while being suitable for an audience that may include both experienced R users and curious beginners. The approach I settled on had three parts. I decided to:

- show a few slides that indicate the status of R among data scientists
- offer some thoughts as to why R is such a popular and effective tool
- work through some code.

The "why" slides attempt to convey the great number of machine learning and statistical algorithms available in R, the visualization capabilities, the richness of the R programming language and its many tools for data manipulation. I tried to emphasize the great amount of effort that the R community continues to make in order to integrate R with other languages and computing platforms, and to scale R to handle massive data sets on Hadoop and other big data platforms.

The code examples presented in the webinar emphasize the machine learning algorithms oganized in the caret package and the many tools available for working through the predictive modeling process such as functions for searching through the parameter space of a model, performing cross validation, comparing models etc. The code for the caret examples is available here.

Towards the end of the webinar I show the code for running a large Tweedie model with Revolution Analytics rxGlm() function and I also show what it looks like to run an rxLogit() model directly on Hadoop.

Click on video to view the webinar, or go to the Revolution Analytics' website to download the webinar and a pdf of the slides. All of the code is available on my GitHub repository.

by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.

by Joseph Rickert

The days are getting shorter here in California and the summer R conferences UseR!2014 and JSM are behind us, but there are still some very fine conferences for R users to look forward to before the year ends.

DataWeek starts in San Francisco on September 15th. I will be conducting a bootcamp for new R users, and on Wedneday the 17th Skylar Lyon and David Smith will talk about R in production during an R Case Studies Session.

The same week, halfway around the world, the EARL (Effective Applications of the R Language) Conference starts in London. Ben Goldacre, author of two best sellers Bad Science and Bad Pharma will be the keynote speaker. The technical program will include sessions from such R luminaries as Hadley Wickham, Patrick Burns, Matt Dowle, Andrie de Vries, Romain Francois, Tal Galili and more.

On October 5th through 9th, Predictive Analytics World - Healthcare will be held in Boston. Max Kuhn will be conducting a hands-on workshop on R for predictive modeling in Business and Healthcare.

One week later, October 15th, Strata and Hadoop World will kick off in New York City. The R Studio team will be start the conference with an R Day. Hadley Wickham, WInston Chang, Garrett Grolemund, JJ Allaire and Yihui Xie will all be giving presentations. On the 16th, Amanda Cox, the force behind the Time's superb R based graphics, will be givining a keynote address. Other R releated sessions include a presentation by Sunil Venkayala and Indrajit Roy on HP's Distributed R platform. Revolution Analytics will, once again, be sponsoring this conference. If you go, please drop by the Revolution Analytics' booth.

That same week, PAZUR'14 the Polish Academic R User's Meeting will be taking place. Workshops will be held on data visualization, the analysis of surveys, the exploration of geospatial data and much more. Revolution Analytics is pleased to be a sponsor here too.

On October 25th, the ACM will once again hold its very popular Data Science Bootcamp at eBay in San Jose. And, once again Revolution Analytics is proud to be a sponsor. I will be attending this event, so please look me up if you want to chat about R.

There is sure to be some R content at the ICSM, International Conference on Statistics and Mathematics and the Workshop on Bayesian Modeling to be held on November 24th through 26th in Surabaya, Indonesia.

Also, sometime in November, rOpenSci will be "bringing together a mix of developers and academics to hack on/create tools in the open data space using and for R". A date hasen't been set yet but we will let you know when things are finalized. Here is the link to the Spring hackathon that took place in San Francsco which was pretty impressive.

If I have missed anything, please let us know!

by Don Boyd, Senior Fellow, Rockefeller Institute of Government

The Rockefeller Institute of Government is excited to be developing models to simulate the finances of public pension funds, using R.

Public pension funds invest contributions from governments and public sector workers in an effort to ensure that they can pay all promised benefits when due. State and local government pension funds in the United States currently have more than $3 trillion invested, more than $2 trillion of which is in equity-like investments. For example, NYC, has over $158 billion invested. Governments usually act as a backstop: if pension fund investment returns do better than expected, governments will be able to contribute less, but if investment returns fall short they will have to contribute more. When that happens, politicians must raise taxes or cut spending programs. These risks often are not well understood or widely discussed. (For a discussion of many of the most significant issues, see *Strengthening the Security of Public Sector Defined Benefit Plans**.)*

We are building stochastic simulation models in R to help quantify the investment risks and their potential consequences. We are modeling the finances of specific pension plans, taking into account all of the main flows such as current and expected benefit payouts to workers, contributions from governments and from workers, and investment returns, and how they affect liabilities and investible assets. The models will take into account the changing demographics of the workforce and retiree populations. We are modeling investment returns stochastically, examining different return scenarios and different economic environments, as well as different governmental contribution policies. We will use these models to evaluate the risks currently being taken and to help provide policy advice to governments, pension funds, and others. (For a full description of our approach, see *Modeling and Disclosing Public Pension Fund Risk, and Consequences for Pension Funding Security*)

We have chosen R because:

- It is extremely flexible, allowing us to do data collection, data management, exploratory data analysis, and other essential non-modeling tasks.
- Manipulating matrices is easy.
- It has sophisticated tools for modeling investment returns and for analyzing and presenting results of simulations. And it has great tools for visualizing results.
- The work can be completely open and reproducible, which is essential to the success of this project.

All programming languages have weaknesses. R’s great flexibility means that it is easy to write ill-organized programs that are hard to understand and debug. And poorly written programs that do not take advantage of R’s strengths can be extremely slow. We believe we can compensate for these weaknesses by making our programs modular, using a consistent programming style with appropriate documentation, and by using R features smartly and speed-testing where appropriate.

R analysts and programmers interested in learning about the opportunity to work on this project should examine the programmer/analyst position description and related materials at the Rockefeller Institute’s web site.

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, *R Marries NetLogo: Introduction to the RNetLogo Package* in the *Journal of Statistical Software,* academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in *Nature* for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent *JASS* paper* Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R *by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's *JSS* paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fade-embers tick end ;; creates the fire turtles to ignite ;; patch procedure sprout-fires 1 [ set color red ] set pcolor black set burned-trees burned-trees + 1 end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path <- getwd() NLStart(nl.path) model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned <- NLDoReportWhile("any? turtles", "go", c("ticks", "(burned-trees / initial-trees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p <- ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1) # vector of densities to examine res <- rep.sim(d, 20) # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

- On validating ABMs
- ABMs and

InsideBigData has published a new *Guide to Machine Learning*, in collaboration with Revolution Analytics. As the name suggests, the Guide provides an overview of machine learning techniques, with a focus on implementation with the R language and (for big-data applications) Revolution R Enterprise. You can download the Guide here (email registration required), or for a quick overview of the contents check out the series of posts by Daniel Gutierrez on the topics covered in the Guide:

- About the insideBigData Guide to Machine Learning
- Introduction to Machine Learning
- R – the Data Scientist’s Choice and Data Access
- Data Munging, Exploratory Data Analysis, and Feature Engineering
- Supervised Machine Learning
- Unsupervised Machine Learning
- Production Deployment with R
- Production Deployment Environments for R

insideBigData: Guide to Machine Learning