InsideBigData has published a new Guide to Machine Learning, in collaboration with Revolution Analytics. As the name suggests, the Guide provides an overview of machine learning techniques, with a focus on implementation with the R language and (for bigdata applications) Revolution R Enterprise. You can download the Guide here (email registration required), or for a quick overview of the contents check out the series of posts by Daniel Gutierrez on the topics covered in the Guide:
insideBigData: Guide to Machine Learning
by Joseph Rickert
UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jampacked tutorials.
The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.
Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.
Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An ExampleDriven, Handson Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.
I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.
Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.
Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.
I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.
I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.
I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.
by Joseph Rickert
Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment, Applied Predictive Modeling by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.
The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yettobeseen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.
Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and RuleBased models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.
Applied Predictive Modeling begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.
There are many ways to go about reading Applied Predictive Modeling. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.
Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)
A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.
The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially wellsuited for self study.)
Applied Predictive Modeling is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.
The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good  just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.
Finally, Applied Predictive Modeling can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.
Applied Predictive Modeling is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s The Elements of Statistical Learning.
by Wayne Smith, Ph.D. California State University, Northridge
Editor's note: This post was abstracted from the monthly newsletter of the Southern California Chapter of the ASA.
On May 13th and 14^{th}, the Intel International Science and Engineering Fair (Intel ISEF) the world’s largest international precollege competition, was held at the Los Angeles Convention Center.
I was blessed with the opportunity to represent the American Statistical Association (ASA). As one of approximately 30 statisticians, I helped assist in the judging of the statisticsrelated elements of numerous prescient and empirical projects presented by high school students from around the world. These students had already won other local and regional science and engineering competitions. We selected first, second, and third place winners, but 16 student teams in total received special recognition and goodie bags filled with software, books, and other items.
The photograph below shows the first place winner, Soham Daga, from New York who used Google Trends to develop a model ot prodict the likelihood of mortgage delinquency. An Interview with Soham can be found here.
I have no doubt that a lasting affinity with statistical professionals and supporting organizations will be a tangible outcome for these motivated, young researchers.
I was energized and transformed by the breadth and depth of the research methods and concomitant inferential analysis applied to address pressing issues in areas as diverse as health care, energy, sustainability, material science, pharmacology, biochemistry, financial economics, and many others. Along with my ASA colleagues, I discussed projects with students as young as 15. As one might expect, many of the High School seniors are attending top research universities in the Fall. I was especially impressed with the rich diversity of students, including groups of students from Qatar, Egypt, Tunisia, Brazil, Japan, Russia, and historically underrepresented areas in the U.S. such as Fresno, CA. Some of the students' work has been ongoing for more than a year, and the students offered background literature (with references!), purposeful hypotheses, detailed analysis and results (occasionally with tool manifests and explanatory code), and integrated conclusions.
Of the 80 or so projects I reviewed, I observed applications of the general linear model; repeated measures; logistic regression; nonparametric measures; classification, feature extraction, and dimensionality reduction; sundry machine learning approaches; and Monte Carlo simulations. I was equally impressed by these students' abilities in fundamental research tasks such as locating and using open source software (e.g., R), understanding and coherently explaining potential I/O and computationalbounds, finding and interpreting peerreviewed literature, and seeking out the assistance of relevant industry professionals. Additionally, the students' ebullient entrepreneurial spirit in the design and execution of physical proofofconcept prototypes and related statistical experiments was especially noteworthy. I came away from each project and each student/team discussion with a new understanding of a thorny issue, a vision for what the solution space and product and process possibilities might be, and perhaps most germane for a College instructor, a renewed calibration for the knowledge, skills, and abilities of a tapestry of young people in the broad areas of mathematical, statistical, and computational sciences. I felt visceral pride in the statistical calling of many of these young finalists, and I know that they will craft much social, intellectual, and economic value for many decades to come.
A side benefit of service at this event was the opportunity to interact with academic and professional colleagues representing a variety of statisticaleducation interests. In particular, I'd like to thank Madeline Bauer (USC/Keck), Theresa Utlaut (Intel), Jo Hardin (Pomona College), and Olga Korosteleva (CSULB) for their guidance in the judging process. At this event one can interact with professionals from dozens of other professional societies and technology firms as well.
This Intelsponsored event circulates annually among three U.S. cities. I strongly recommend that individuals with an general interest in statistics and data science volunteer at this event and at local SCASA and OCLBASA events in the future.
Many many thanks to all the statisticians who participated as judges and/or behind the scenes! Thanks to the ASA for the cash prizes and thanks to Chapman Hall/CRC, JMP, Minitab, O’Reilly Media, Revolution Analytics, Sage, Stata, and Taylor & Francis for the donated books, magazines, software and other items.
by Joseph Rickert
I was very happy to have been able to attend R / Finance 2014 which wrapped up a couple of weeks ago. In general, the talks were at a very high level of play, some dealing with brand new ideas and many presented at a significant level of technical or mathematical sophistication. Fortunately, most of the slides from the presentations are quite detailed and available at the conference site. Collectively, these presentations provide a view of the boundaries of the conceptual space imagined by the leaders in quantitative finance. Some of this space covers infrastructure issues involving ideas for pushing the limits of R (Some Performance Improvements for the R Engine) or building a new infrasturcture (New Ideas for Large Network Analysis) or (Building Simple Data Caches) for example. Others are involved with new computational tools (Solving Cone Constrained Convex Programs) or attempt to push the limits on getting some actionable insight from the mathematical abstrations: (Portfolio Inference withthei One Wierd Trick) or (Twinkle twinkle litle STAR: Smooth Transition AR Models in R) for example.
But while the talks may be illuminating, the real takeaways from the conference are the R packages. These tools embody the work of the thought leaders in the field of computational finance and are the means for anyone sufficiently motivated to understand this cutting edge work. By my count, 20 of the 44 tutorials and talks given at the conference were based on a particular R package. Some of the packages listed in the following table are wellestablished and others are workinprogress sitting out on RForge or GitHub, providing opportunities for the interested to get involved.
R Finance 2014 Talk 
Package 
Description 
Introduction to data.table 
Extension of the data frame 

An ExampleDriven Handson introduction to Rcpp 
Functions to facilitate integrating R with C++ 

Portfolio Optimization: Utility, Computation, Equities Applications 
Environment for reaching Financial Engineering and Computational Finance 

ReEvaluation of the Low Risk Anomaly via Matching 
Implementation of the Coarsened Exact matching Algorithm 

BCP Stability Analytics: New Directions in Tactical Asset Management 
Bayesian Analysis of Change Point Problems 

On the Persistence of Cointegration in Pairs Trading 
EngleGranger Cointegration Models 

Tests for Robust Versus Least Squares Factor Model Fits 
robust methods 

The R Package cccp: Solving Cone Constrained Convex Programs 
Solver for convex problems for cone constraints 

Twinkle, twinkle little STAR: Smooth Transition AR Models in R 
Modeling smooth transition models 

Asset Allocaton with Higher Order Moments and Factor Models 
Global optimization by differential evolution / Numerical methods for portfolio optimization 

Event Studies in R 
Event study and extreme event analysis 

An R package on Credit Default Swaps 
Provides tools for pricing credit default swaps 

New Ideas for Large Network Analysis, Implemented in R 
Implicitly restarted Lanczos methods for R 

Package “Intermediate and Long Memory Time Series 

Simulate & Detect Intermediate and Long Memory Processes / in development 
Stochvol: Dealing with Stochastic Volatility in Time Series 
Efficient Bayesian Inference for Stochastic Volatility (SV) Models 

Divide and Recombine for the Analysis of Large Complex Data with R 
Package for using R with Hadoop 

gpusvcalibration: Fast Stochastic Volatility Model Calibration using GPUs 
Fast calibration of stochastic volatility models for option pricing models 

The FlexBayes Package 
Provides an MCMC engine for the class of hierarchical feneralized linear models and connections to WinBUGS and OpenBUGS 

Building Simple Redis Data Caches 
Rcpp bindings for Redis that connects R to the Redis key/value store 

Package pbo: Probability of Backtest Overfitting 
Uses Combinatorial Symmetric Cross Validation to implement performance tests. 
Many of these packages / projects also have supplementary material that is worth chasing down. Be sure to take a look at Alexios Ghalanos recent post that provides an accessible introduction to his stellar keynote address.
Many thanks to the organizers of the conference who, once again, did a superb job, and to the many professionals attending who graciously attempted to explain their ideas to a dilletante. My impression was that most of the attendies thoroughly enjoyed themselves and that the general sentiment was expressed by the last slide of Stephen Rush's presentation:
by Mike Bowles
In two previous posts, A Thumbnail History of Ensemble Methods and Ensemble Packages in R, Mike Bowles — a machine learning expert and serial entrepreneur — laid out a brief history of ensemble methods and described a few of the many implementations in R. In this post Mike takes a detailed look at the Random Forests implementation in the RevoScaleR package that ships with Revolution R Enterprise.
Revolution Analytics' rxDForest() function provides an ideal tool for developing ensemble models on very large data sets. It allows the data scientist to do prototyping on a single CPU version of the random forest algorithm and to then shift with relative ease to a multicore version for generating a higherperformance model on an extremely large data set. It’s convenient that the singleCPU and multipleCPU versions operate on the same data, have many of the same input parameters and deliver the same types of performance summaries and analyses. Revolution Anaytics is one of a very small number of companies offering a true multicore version of Random Forests. (I only know of one other)*.
The computationally intensive part of ensemble methods is training binary decision trees, and the computationally intense part of training a binary tree is splitpoint determination. Binary trees comprise a number of binary decisions that are of the form (attributeX < some number). The nodes in the tree pose this binary question and its answer determines whether an example goes to the left or right out of the node. To train the binary tree every possible split point for every attribute has to be tried in order to pick the best one. It’s easy to see why this splitpoint selection process consumes so much time — particularly on very large data sets. In a standard tree formulation (CART for example) the number of tests is equal to the number of points in the data set (not the number of examples (rows), the number of rows times the number of attributes (columns). This issue has been the subject of research for the last ten or so years.
The Google PLANET paper discusses the sensible idea of approximating the splitpoint selection process by aggregating points into bins instead of checking every possible value. More recent researchers have developed methods for generating approximate data histograms on streaming data. These methods are wellsuited to the mapreduce environment and implemented in the Revolution Analytics version of binary decision trees and Random Forests. Their incorporation makes the computation faster and introduces a “binning” parameter that may be unfamiliar to longtime users of singleCPU versions of random forests.
The screen shots below the input and output for running the Revolution Analytics multicpu random forests on AWS. The software is being run through a server version of RStudio. Two CPUs are included in the cluster for building trees. The first screen shot shows the code input for building a predictive model on the UC Irvine data set for red wine taste scores. The code is the figure shows how little change is required for running Revolution Analytics’ multicore version versus using one of the random forest packages available through CRAN.
The next two screen shots show some of the familiar output that’s available. The first plot shows the oobprediction error as a function of the number of trees in the ensemble. The second plot gives variable importance. Those familiar with the wine taste data set will recognize that alcohol is correctly identified as the most significant feature for predicting wine taste.
* Editor's Note: H20 from 0xdata contains a multicore Random Forest implementation.
by James P. Peruvankal
Kaggle just announced a competition to predict which shoppers will become repeat buyers. To aid with algorithmic development, they have provided complete, basketlevel, preoffer shopping history for a large set of shoppers who were targeted for an acquisition campaign. Files containing the incentives offered to each shopper as well as their postincentive behavior are also provided.
This challenge provides almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It is one of the largest problems run on Kaggle to date. Once unzipped, data size will be 22GB, more than what can fit into the memory of usual laptops.
If you like this sort of thing, a first look at the data ought to captivate your interest. The following plots shows the number of repeated trips to the store plotted against the offer value in dollars on the x axis. The data are shaded by market, a geographical area.
To get your own first look at the data, and maybe try out a few of the fast Parallel External Memory Algorithms included in Revolution R Enterprise, you might find it helpful to take advantage of Revolution Analytics offer to try out Revolution R Enterprise in the AWS cloud. (If you spin up a Linux box in AWS, you can go up to 64GB RAM.)
This contest is representative of the challenge coping with the exponential growth in realworld data projects. I am sure, we will see more of these kind of problems.
In addition to trying Revolution R Enterprise in the cloud, active Kaggle competitors can download the fullfeatured Revolution R Enterprise software and use it for free to create their own submissions.
Some of us Revolutionaries are jumping into the fray. See you at the competition!
by Joseph Rickert
Pasha Roberts, Chief Scientist at Talent Analytics, is writing a series of articles on Employee Churn for the Predictive Analytics Times that comprise a really instructive and valuable example of using R to do some basic predictive modeling. So far, Pasha has published Employee Churn 201 in which he makes a case for the importance of modeling employee churn, and Employee Churn 202 where he builds a fairly sophisticated interactive model from first principles using only RStudio and basic R functions. And, while the series is not even complete, I think that is is going to be unique because it is working well on multiple levels.
In Churn 201, Pasha uses R almost incidentally, to produce the following plot that illustrates the concepts involved in understanding the costs and benefits contributed by a single employee.
At the lowest level, this is a nice example of what might be called a “programming literate essay”. R clearly isn’t necessary just to build create a graphic. (Note the use of ggplot's annotate() capability.) But, if you look at the R code behind the scenes, you will see that Pasha has gone a bit further. In a few lines of annotated code he has sketched out a selfdocumenting model that someone else could use to get “back of the envelope” results for their business. The exercise is roughly at the level of what a business analyst might attempt in an Excel spreadsheet.
In Employee Churn 202, Pasha goes still further moving the series from essays alone to a modeling effort. He uses basic survival analysis ideas and simple R functions to create a sophisticated decision model that computes several performance measures including something he calls Expected Cumulative Net Benefit. This measures the net benefit to the corporation employees who leave for both “good” and “bad” reasons.
The following figure shows the simulation running in RStudio complete with interactive tools built with the manipulate() function to perform "what if" analyses and display the results.
Running the simulation is easy. All of the code is available on Github where the file, churn202.md, provides details on how things work. Once you have run the code in the churn202.R or issued the command source("churn202.R") from the console, running the function manipSim202() will produce the simulation. (Note that might be necessary to click on “gear” icon in the upper left hand corner of the plots panel to have the slide bar controls appear.) The function runSensitivityTests()varies each of the parameters in the simulation through a reasonable range of values, while holding the other parameters fixed, to show the sensitivity of Expected Net Chumulative Benefit to each parameter. The function runHistograms() produces histograms of the synthetic data that drive the simulation and hints at the data collection effort that would be required to run the simulation for real.
By placing the code on GitHub and inviting feedback, comments and pull requests Pasha has raised his literary efforts to the status of an open source, employee churn project without comprimising the clarity of his exposition. I, for one, am looking forward to the rest of thes series.
If you missed last week's webinar presented by Revolution Analytics' US Chief Scientist Mario Inchiosa, Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise, the slides and webinar replay are now available for download. The webinar includes a demo of building decision trees and regression trees in Revolution R Enterprise, and using the Tree Viewer to inspect the resulting tree, starting at the 30:20 mark.
If you'd like to learn more about the bigdata tree models in Revolution R Enterprise, check out the white paper, Big Data Decision Trees with R. You can also find the slides from Mario's webinar at the link below.
Revolution Analytics Webinars: Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise
by Joseph Rickert
One of the remarkable features of the R language is its adaptability. Motivated by R’s popularity and helped by R’s expressive power and transparency developers working on other platforms display what looks like inexhaustible creativity in providing seamless interfaces to software that complements R’s strengths. The H2O R package that connects to 0xdata’s H2O software (Apache 2.0 License) is an example of this kind of creativity.
According to the 0xdata website, H2O is “The Open Source InMemory, Prediction Engine for Big Data Science”. Indeed, H2O offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multilayer neural net models). Examples with timing information of running all of these models on fairly large data sets are available on the 0xdata website. Execution speeds are very impressive. In this post, I thought I would start a little slower and look at H2O from an R point of View.
H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the documentation a cluster’s “memory capacity is the sum across all H2O nodes in the cluster”. So, as I understand it, if you were to build a 16 node cluster of machines each having 64GB of DRAM, and you installed H2O everything then you could run the H2O machine learning algorithms using a terabyte of memory.
Underneath the covers, the H2O JVM sits on an inmemory, nonpersistent keyvalue (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. (See Cliff Click’s blog for the details.) The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O.
The R H2O package communicates with the H2O JVM over a REST API. R sends RCurl commands and H2O sends back JSON responses. Data ingestion, however, does not happen via the REST API. Rather, an R user calls a function that causes the data to be directly parsed into the H2O KV store. The H2O R package provides several functions for doing this Including: h20.importFile() which imports and parses files from a local directory, h20.importURL() which imports and pareses files from a website, and h2o.importHDFS() which imports and parses HDFS files sitting on a Hadoop cluster.
So much for the background: let’s get started with H2O. The first thing you need to do is to get Java running on your machine. If you don’t already have Java the default download ought to be just fine. Then fetch and install the H2O R package. Note that the h2o.jar executable is currently shipped with the h2o R package. The following code from the 0xdata website ran just fine from RStudio on my PC:
# The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # Next, we download, install and initialize the H2O package for R. install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2orelease/h2o/relkahan/5/R", getOption("repos")))) library(h2o) localH2O = h2o.init() # Finally, let's run a demo to see H2O at work. demo(h2o.glm)
Created by Pretty R at insideR.org
Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster. h2o.init(Xmx="10g") will start up the H2O KV store with 10GB of RAM. demo(h2o,glm) runs the glm demo to let you know that everything is working just fine. I will save examining the model for another time. Instead let's look at some other H2O functionality.
The first thing to get straight with H2O is to be clear about when you are working in R and when you are working in the H2O JVM. The H2O R package implements several R functions that are wrappers to H2O native functions. "H2O supports an Rlike language" (See a note on R) but sometimes things behave differently than an R programmer might expect.
For example, the R code:
y < apply(iris[,1:4],2,sum)
y
produces the following result:
Sepal.Length Sepal.Width Petal.Length Petal.Width
876.5 458.6 563.7 179.9
Now, let's see how things work in H2O, The following code loads the H2O package, starts a local instance of H2O, uploads the iris data set into the H2O instance from the H2O R package and produces a very Rlike summary.
library(h2o) # Load H2O library localH2O = h2o.init() # initial H2O locl instance # Upload iris file from the H2O package into the H2O local instance iris.hex < h2o.uploadFile(localH2O, path = system.file("extdata", "iris.csv", package="h2o"), key = "iris.hex") summary(iris.hex)
However, the apply() function from the H2O R package behaves a bit differently
x < apply(iris.hex[,1:4],2,sum)
x
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: Last.value.17
Instead of returning the the results it returns the attributes of file in which the results are stored. You can see this from looking at the structure of x.
str(x)
Formal class 'H2OParsedData' [package "h2o"] with 3 slots
..@ h2o :Formal class 'H2OClient' [package "h2o"] with 2 slots
.. .. ..@ ip : chr "127.0.0.1"
.. .. ..@ port: num 54321
..@ key : chr "Last.value.17"
..@ logic: logi FALSE
H2O dataset 'Last.value.17': 4 obs. of 1 variable:
$ C1: num 876.5 458.1 563.8 179.8
You can get the data out, by coercing x into being a data frame.
df < as.data.frame(x)
df
C1
1 876.5
2 458.1
3 563.8
4 179.8
So, as one might expect, there are some differences that take a little getting used to. However, the focus ought not to be on the differences from R but on the pontential of having some capabilities for manipulating huge data sets from with R. In combination, the H2O R package functions h2o.ddply() and h2o.addFunction(), which permits users to push a new function into the H2O JVM, do a fine job of providing some ddply() features to H2O data sets.
The following code loads one year of the airlines data set from my hard drive into the H2O instance, gives me the dimensions of the data, and lets me know what variables I have.
path < "C:/DATA/Airlines_87_08/2008.csv"
air2008.hex < h2o.uploadFile(localH2O, path = path,key="air2008")
dim(air2008.hex)
[1] 7009728 29
colnames(air2008.hex)
Then, using h20.addFunction(), define a function to compute the average departure delay, and create a new H2O data set without DepDelay missing values that would otherwise blow up the added function.
# Define function to compute an average for colume 16
fun = function(df) { sum(df[,16])/nrow(df) }
h2o.addFunction(localH2O, fun) # Push the function to H2O
# Filter out missing values
air2008.filt = air2008.hex[!is.na(air2008.hex$DepDelay),]
head(air2008.filt)
Finally, run h2o.ddply() to get average departure delay by day of the week and pull down the results from H2O.
airlines.ddply = h2o.ddply(air2008.filt, "DayOfWeek", fun)
as.data.frame(airlines.ddply)
DayOfWeek C1
1 2 8.976897
2 6 8.645681
3 7 11.568973
4 4 9.772897
5 1 10.269990
6 5 12.158036
7 3 8.289761
Exactly, what you would expect!
Having h2o.ddply() being limited to functions that can be pushed to H2O may seem limiting to some. However, in the context of working with huge data sets I don't see this to be a problem. Presumably the real data cleaning and preperation will be accompished by other tools that are appropriate for the environment (e.g. Hadoop) where the data resides. In a future post, I hope to more closely examine H2O's machine learning algorithms. As it stands, from and R perspective H2O appears to be an impressive accomplishment and welcome addition to the open source world.