by Joseph Rickert
If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, R Marries NetLogo: Introduction to the RNetLogo Package in the Journal of Statistical Software, academic work with ABMs didn’t really take off until the late 1990s.
Now, people are using ABMs for serious studies in economics, sociology, ecology, sociopsychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in Nature for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)
Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.
As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent JASS paper Facilitating Parameter Estimation and Sensitivity Analysis of AgentBased Models: a Cookbook Using NetLogo and R by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.
Here is part of an extended example from Thiele's JSS paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.
Snippet of NetLogo Code that drives the Fire model
to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fadeembers tick end ;; creates the fire turtles to ignite ;; patch procedure sproutfires 1 [ set color red ] set pcolor black set burnedtrees burnedtrees + 1 end
The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.
This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.
# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir < "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path < getwd() NLStart(nl.path) model.path < file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R
Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.
This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the nonlinear behavior of the system.
# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned < NLDoReportWhile("any? turtles", "go", c("ticks", "(burnedtrees / initialtrees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p < ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Nonlinear forest fire progression with density = 60")
As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.
RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.
d < seq(55, 65, 1) # vector of densities to examine res < rep.sim(d, 20) # Run the simulation
The plot below shows the variability of the percent of trees burned as a function of density in the transition region.
My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.
Finally, here are a few more interesting links related to ABMs.
InsideBigData has published a new Guide to Machine Learning, in collaboration with Revolution Analytics. As the name suggests, the Guide provides an overview of machine learning techniques, with a focus on implementation with the R language and (for bigdata applications) Revolution R Enterprise. You can download the Guide here (email registration required), or for a quick overview of the contents check out the series of posts by Daniel Gutierrez on the topics covered in the Guide:
insideBigData: Guide to Machine Learning
by Joseph Rickert
UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jampacked tutorials.
The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.
Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.
Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An ExampleDriven, Handson Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.
I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.
Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.
Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.
I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.
I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.
I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.
by Joseph Rickert
Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment, Applied Predictive Modeling by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.
The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yettobeseen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.
Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and RuleBased models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.
Applied Predictive Modeling begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.
There are many ways to go about reading Applied Predictive Modeling. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.
Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)
A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.
The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially wellsuited for self study.)
Applied Predictive Modeling is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.
The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good  just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.
Finally, Applied Predictive Modeling can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.
Applied Predictive Modeling is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s The Elements of Statistical Learning.
by Wayne Smith, Ph.D. California State University, Northridge
Editor's note: This post was abstracted from the monthly newsletter of the Southern California Chapter of the ASA.
On May 13th and 14^{th}, the Intel International Science and Engineering Fair (Intel ISEF) the world’s largest international precollege competition, was held at the Los Angeles Convention Center.
I was blessed with the opportunity to represent the American Statistical Association (ASA). As one of approximately 30 statisticians, I helped assist in the judging of the statisticsrelated elements of numerous prescient and empirical projects presented by high school students from around the world. These students had already won other local and regional science and engineering competitions. We selected first, second, and third place winners, but 16 student teams in total received special recognition and goodie bags filled with software, books, and other items.
The photograph below shows the first place winner, Soham Daga, from New York who used Google Trends to develop a model ot prodict the likelihood of mortgage delinquency. An Interview with Soham can be found here.
I have no doubt that a lasting affinity with statistical professionals and supporting organizations will be a tangible outcome for these motivated, young researchers.
I was energized and transformed by the breadth and depth of the research methods and concomitant inferential analysis applied to address pressing issues in areas as diverse as health care, energy, sustainability, material science, pharmacology, biochemistry, financial economics, and many others. Along with my ASA colleagues, I discussed projects with students as young as 15. As one might expect, many of the High School seniors are attending top research universities in the Fall. I was especially impressed with the rich diversity of students, including groups of students from Qatar, Egypt, Tunisia, Brazil, Japan, Russia, and historically underrepresented areas in the U.S. such as Fresno, CA. Some of the students' work has been ongoing for more than a year, and the students offered background literature (with references!), purposeful hypotheses, detailed analysis and results (occasionally with tool manifests and explanatory code), and integrated conclusions.
Of the 80 or so projects I reviewed, I observed applications of the general linear model; repeated measures; logistic regression; nonparametric measures; classification, feature extraction, and dimensionality reduction; sundry machine learning approaches; and Monte Carlo simulations. I was equally impressed by these students' abilities in fundamental research tasks such as locating and using open source software (e.g., R), understanding and coherently explaining potential I/O and computationalbounds, finding and interpreting peerreviewed literature, and seeking out the assistance of relevant industry professionals. Additionally, the students' ebullient entrepreneurial spirit in the design and execution of physical proofofconcept prototypes and related statistical experiments was especially noteworthy. I came away from each project and each student/team discussion with a new understanding of a thorny issue, a vision for what the solution space and product and process possibilities might be, and perhaps most germane for a College instructor, a renewed calibration for the knowledge, skills, and abilities of a tapestry of young people in the broad areas of mathematical, statistical, and computational sciences. I felt visceral pride in the statistical calling of many of these young finalists, and I know that they will craft much social, intellectual, and economic value for many decades to come.
A side benefit of service at this event was the opportunity to interact with academic and professional colleagues representing a variety of statisticaleducation interests. In particular, I'd like to thank Madeline Bauer (USC/Keck), Theresa Utlaut (Intel), Jo Hardin (Pomona College), and Olga Korosteleva (CSULB) for their guidance in the judging process. At this event one can interact with professionals from dozens of other professional societies and technology firms as well.
This Intelsponsored event circulates annually among three U.S. cities. I strongly recommend that individuals with an general interest in statistics and data science volunteer at this event and at local SCASA and OCLBASA events in the future.
Many many thanks to all the statisticians who participated as judges and/or behind the scenes! Thanks to the ASA for the cash prizes and thanks to Chapman Hall/CRC, JMP, Minitab, O’Reilly Media, Revolution Analytics, Sage, Stata, and Taylor & Francis for the donated books, magazines, software and other items.
by Joseph Rickert
I was very happy to have been able to attend R / Finance 2014 which wrapped up a couple of weeks ago. In general, the talks were at a very high level of play, some dealing with brand new ideas and many presented at a significant level of technical or mathematical sophistication. Fortunately, most of the slides from the presentations are quite detailed and available at the conference site. Collectively, these presentations provide a view of the boundaries of the conceptual space imagined by the leaders in quantitative finance. Some of this space covers infrastructure issues involving ideas for pushing the limits of R (Some Performance Improvements for the R Engine) or building a new infrasturcture (New Ideas for Large Network Analysis) or (Building Simple Data Caches) for example. Others are involved with new computational tools (Solving Cone Constrained Convex Programs) or attempt to push the limits on getting some actionable insight from the mathematical abstrations: (Portfolio Inference withthei One Wierd Trick) or (Twinkle twinkle litle STAR: Smooth Transition AR Models in R) for example.
But while the talks may be illuminating, the real takeaways from the conference are the R packages. These tools embody the work of the thought leaders in the field of computational finance and are the means for anyone sufficiently motivated to understand this cutting edge work. By my count, 20 of the 44 tutorials and talks given at the conference were based on a particular R package. Some of the packages listed in the following table are wellestablished and others are workinprogress sitting out on RForge or GitHub, providing opportunities for the interested to get involved.
R Finance 2014 Talk 
Package 
Description 
Introduction to data.table 
Extension of the data frame 

An ExampleDriven Handson introduction to Rcpp 
Functions to facilitate integrating R with C++ 

Portfolio Optimization: Utility, Computation, Equities Applications 
Environment for reaching Financial Engineering and Computational Finance 

ReEvaluation of the Low Risk Anomaly via Matching 
Implementation of the Coarsened Exact matching Algorithm 

BCP Stability Analytics: New Directions in Tactical Asset Management 
Bayesian Analysis of Change Point Problems 

On the Persistence of Cointegration in Pairs Trading 
EngleGranger Cointegration Models 

Tests for Robust Versus Least Squares Factor Model Fits 
robust methods 

The R Package cccp: Solving Cone Constrained Convex Programs 
Solver for convex problems for cone constraints 

Twinkle, twinkle little STAR: Smooth Transition AR Models in R 
Modeling smooth transition models 

Asset Allocaton with Higher Order Moments and Factor Models 
Global optimization by differential evolution / Numerical methods for portfolio optimization 

Event Studies in R 
Event study and extreme event analysis 

An R package on Credit Default Swaps 
Provides tools for pricing credit default swaps 

New Ideas for Large Network Analysis, Implemented in R 
Implicitly restarted Lanczos methods for R 

Package “Intermediate and Long Memory Time Series 

Simulate & Detect Intermediate and Long Memory Processes / in development 
Stochvol: Dealing with Stochastic Volatility in Time Series 
Efficient Bayesian Inference for Stochastic Volatility (SV) Models 

Divide and Recombine for the Analysis of Large Complex Data with R 
Package for using R with Hadoop 

gpusvcalibration: Fast Stochastic Volatility Model Calibration using GPUs 
Fast calibration of stochastic volatility models for option pricing models 

The FlexBayes Package 
Provides an MCMC engine for the class of hierarchical feneralized linear models and connections to WinBUGS and OpenBUGS 

Building Simple Redis Data Caches 
Rcpp bindings for Redis that connects R to the Redis key/value store 

Package pbo: Probability of Backtest Overfitting 
Uses Combinatorial Symmetric Cross Validation to implement performance tests. 
Many of these packages / projects also have supplementary material that is worth chasing down. Be sure to take a look at Alexios Ghalanos recent post that provides an accessible introduction to his stellar keynote address.
Many thanks to the organizers of the conference who, once again, did a superb job, and to the many professionals attending who graciously attempted to explain their ideas to a dilletante. My impression was that most of the attendies thoroughly enjoyed themselves and that the general sentiment was expressed by the last slide of Stephen Rush's presentation:
by Mike Bowles
In two previous posts, A Thumbnail History of Ensemble Methods and Ensemble Packages in R, Mike Bowles — a machine learning expert and serial entrepreneur — laid out a brief history of ensemble methods and described a few of the many implementations in R. In this post Mike takes a detailed look at the Random Forests implementation in the RevoScaleR package that ships with Revolution R Enterprise.
Revolution Analytics' rxDForest() function provides an ideal tool for developing ensemble models on very large data sets. It allows the data scientist to do prototyping on a single CPU version of the random forest algorithm and to then shift with relative ease to a multicore version for generating a higherperformance model on an extremely large data set. It’s convenient that the singleCPU and multipleCPU versions operate on the same data, have many of the same input parameters and deliver the same types of performance summaries and analyses. Revolution Anaytics is one of a very small number of companies offering a true multicore version of Random Forests. (I only know of one other)*.
The computationally intensive part of ensemble methods is training binary decision trees, and the computationally intense part of training a binary tree is splitpoint determination. Binary trees comprise a number of binary decisions that are of the form (attributeX < some number). The nodes in the tree pose this binary question and its answer determines whether an example goes to the left or right out of the node. To train the binary tree every possible split point for every attribute has to be tried in order to pick the best one. It’s easy to see why this splitpoint selection process consumes so much time — particularly on very large data sets. In a standard tree formulation (CART for example) the number of tests is equal to the number of points in the data set (not the number of examples (rows), the number of rows times the number of attributes (columns). This issue has been the subject of research for the last ten or so years.
The Google PLANET paper discusses the sensible idea of approximating the splitpoint selection process by aggregating points into bins instead of checking every possible value. More recent researchers have developed methods for generating approximate data histograms on streaming data. These methods are wellsuited to the mapreduce environment and implemented in the Revolution Analytics version of binary decision trees and Random Forests. Their incorporation makes the computation faster and introduces a “binning” parameter that may be unfamiliar to longtime users of singleCPU versions of random forests.
The screen shots below the input and output for running the Revolution Analytics multicpu random forests on AWS. The software is being run through a server version of RStudio. Two CPUs are included in the cluster for building trees. The first screen shot shows the code input for building a predictive model on the UC Irvine data set for red wine taste scores. The code is the figure shows how little change is required for running Revolution Analytics’ multicore version versus using one of the random forest packages available through CRAN.
The next two screen shots show some of the familiar output that’s available. The first plot shows the oobprediction error as a function of the number of trees in the ensemble. The second plot gives variable importance. Those familiar with the wine taste data set will recognize that alcohol is correctly identified as the most significant feature for predicting wine taste.
* Editor's Note: H20 from 0xdata contains a multicore Random Forest implementation.
by James P. Peruvankal
Kaggle just announced a competition to predict which shoppers will become repeat buyers. To aid with algorithmic development, they have provided complete, basketlevel, preoffer shopping history for a large set of shoppers who were targeted for an acquisition campaign. Files containing the incentives offered to each shopper as well as their postincentive behavior are also provided.
This challenge provides almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It is one of the largest problems run on Kaggle to date. Once unzipped, data size will be 22GB, more than what can fit into the memory of usual laptops.
If you like this sort of thing, a first look at the data ought to captivate your interest. The following plots shows the number of repeated trips to the store plotted against the offer value in dollars on the x axis. The data are shaded by market, a geographical area.
To get your own first look at the data, and maybe try out a few of the fast Parallel External Memory Algorithms included in Revolution R Enterprise, you might find it helpful to take advantage of Revolution Analytics offer to try out Revolution R Enterprise in the AWS cloud. (If you spin up a Linux box in AWS, you can go up to 64GB RAM.)
This contest is representative of the challenge coping with the exponential growth in realworld data projects. I am sure, we will see more of these kind of problems.
In addition to trying Revolution R Enterprise in the cloud, active Kaggle competitors can download the fullfeatured Revolution R Enterprise software and use it for free to create their own submissions.
Some of us Revolutionaries are jumping into the fray. See you at the competition!
by Joseph Rickert
Pasha Roberts, Chief Scientist at Talent Analytics, is writing a series of articles on Employee Churn for the Predictive Analytics Times that comprise a really instructive and valuable example of using R to do some basic predictive modeling. So far, Pasha has published Employee Churn 201 in which he makes a case for the importance of modeling employee churn, and Employee Churn 202 where he builds a fairly sophisticated interactive model from first principles using only RStudio and basic R functions. And, while the series is not even complete, I think that is is going to be unique because it is working well on multiple levels.
In Churn 201, Pasha uses R almost incidentally, to produce the following plot that illustrates the concepts involved in understanding the costs and benefits contributed by a single employee.
At the lowest level, this is a nice example of what might be called a “programming literate essay”. R clearly isn’t necessary just to build create a graphic. (Note the use of ggplot's annotate() capability.) But, if you look at the R code behind the scenes, you will see that Pasha has gone a bit further. In a few lines of annotated code he has sketched out a selfdocumenting model that someone else could use to get “back of the envelope” results for their business. The exercise is roughly at the level of what a business analyst might attempt in an Excel spreadsheet.
In Employee Churn 202, Pasha goes still further moving the series from essays alone to a modeling effort. He uses basic survival analysis ideas and simple R functions to create a sophisticated decision model that computes several performance measures including something he calls Expected Cumulative Net Benefit. This measures the net benefit to the corporation employees who leave for both “good” and “bad” reasons.
The following figure shows the simulation running in RStudio complete with interactive tools built with the manipulate() function to perform "what if" analyses and display the results.
Running the simulation is easy. All of the code is available on Github where the file, churn202.md, provides details on how things work. Once you have run the code in the churn202.R or issued the command source("churn202.R") from the console, running the function manipSim202() will produce the simulation. (Note that might be necessary to click on “gear” icon in the upper left hand corner of the plots panel to have the slide bar controls appear.) The function runSensitivityTests()varies each of the parameters in the simulation through a reasonable range of values, while holding the other parameters fixed, to show the sensitivity of Expected Net Chumulative Benefit to each parameter. The function runHistograms() produces histograms of the synthetic data that drive the simulation and hints at the data collection effort that would be required to run the simulation for real.
By placing the code on GitHub and inviting feedback, comments and pull requests Pasha has raised his literary efforts to the status of an open source, employee churn project without comprimising the clarity of his exposition. I, for one, am looking forward to the rest of thes series.
If you missed last week's webinar presented by Revolution Analytics' US Chief Scientist Mario Inchiosa, Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise, the slides and webinar replay are now available for download. The webinar includes a demo of building decision trees and regression trees in Revolution R Enterprise, and using the Tree Viewer to inspect the resulting tree, starting at the 30:20 mark.
If you'd like to learn more about the bigdata tree models in Revolution R Enterprise, check out the white paper, Big Data Decision Trees with R. You can also find the slides from Mario's webinar at the link below.
Revolution Analytics Webinars: Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise