by Joseph Rickert
I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.
So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.
For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)
Some key features of the airlines data set are:
An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.
The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.
# Fit a Tweedie GLM tm < system.time( glmOut < rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo < rxGetVarInfo(airData) predData < data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut < rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTimeUniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")
Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.
The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.
It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.
It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.
I was honoured to be invited earlier this month to the Directions of Statistical Computing meeting in Brixen, Italy. DSC is one of two meetings run by the R Project and unlike the useR! conference, DSC is a much smaller and intimate meeting (DSC 2014 had about 30 participants). If you haven't come across DSC meeting before (quite possible, given that it had last been held in 2009), R Core Group member Martyn Plummer has a nice overview of DSC.
A focus of the first day of the conference was on the performance of R computation engine. The organizers invited representatives from all of the "alternative" R engine implementations, and I believe it marked the first time that developers involved with pqR, Renjin, FastR, and Riposte and TERR were gathered in the same place. (The CXXR project was unfortunately not represented.) Jan Vitek [slides] presented a fascinating comparison of the various projects, based on his interviews with the developers.
It was interesting to see the commonalities in many of the approaches. Three projects, Renjin [slides], FastR [slides] and Riposte [slides] use justintime compilation and an optimized bytecode engine. All have achieved impressive performance gains, but have struggled with compatibility (and especially being able to run the 6000+ CRAN packages). But it's clear that their work is having an influence on R itself: Thomas Kalibera [slides] (who previously worked on the FastR project) is working with Luke Tierney and Jan Vitek to improve the performance of R's bytecode interpreter.
Other approaches are also being pursued to improve the performance of the R engine. Luke Tierney [slides] described new improvements in R 3.1 to streamline the reference counting system, and noted that several of the performance improvements implemented by Radford Neal [slides] in pqR have already been incorporated into the R engine. And Helena Kotthaus [slides] has done some very exciting work to profile the performance of the R engine which has already led to performance improvements when virtual memory is being used.
Overall, it was exciting to see collaboration and research into R as a language, and especially the attention from the computer science community to the implementation of R. As Robert Gentleman (cocreator of R and conference lead) noted, R now has a new community beyond statisticians and data scientists: computer scientists. It's exciting to see how R is incorporating learning and innovation from this new community.
For more on DSC 2014, see the reports from Martyn Plummer on Day 1 and Day 2 of the conference. The full program, with links to download the slide presentation, is at the link below.
DSC 2014: Schedule (and slide downloads)
by Joseph Rickert
UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jampacked tutorials.
The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.
Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.
Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An ExampleDriven, Handson Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.
I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.
Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.
Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.
I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.
I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.
I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.
by Joseph Rickert
I was very happy to have been able to attend R / Finance 2014 which wrapped up a couple of weeks ago. In general, the talks were at a very high level of play, some dealing with brand new ideas and many presented at a significant level of technical or mathematical sophistication. Fortunately, most of the slides from the presentations are quite detailed and available at the conference site. Collectively, these presentations provide a view of the boundaries of the conceptual space imagined by the leaders in quantitative finance. Some of this space covers infrastructure issues involving ideas for pushing the limits of R (Some Performance Improvements for the R Engine) or building a new infrasturcture (New Ideas for Large Network Analysis) or (Building Simple Data Caches) for example. Others are involved with new computational tools (Solving Cone Constrained Convex Programs) or attempt to push the limits on getting some actionable insight from the mathematical abstrations: (Portfolio Inference withthei One Wierd Trick) or (Twinkle twinkle litle STAR: Smooth Transition AR Models in R) for example.
But while the talks may be illuminating, the real takeaways from the conference are the R packages. These tools embody the work of the thought leaders in the field of computational finance and are the means for anyone sufficiently motivated to understand this cutting edge work. By my count, 20 of the 44 tutorials and talks given at the conference were based on a particular R package. Some of the packages listed in the following table are wellestablished and others are workinprogress sitting out on RForge or GitHub, providing opportunities for the interested to get involved.
R Finance 2014 Talk 
Package 
Description 
Introduction to data.table 
Extension of the data frame 

An ExampleDriven Handson introduction to Rcpp 
Functions to facilitate integrating R with C++ 

Portfolio Optimization: Utility, Computation, Equities Applications 
Environment for reaching Financial Engineering and Computational Finance 

ReEvaluation of the Low Risk Anomaly via Matching 
Implementation of the Coarsened Exact matching Algorithm 

BCP Stability Analytics: New Directions in Tactical Asset Management 
Bayesian Analysis of Change Point Problems 

On the Persistence of Cointegration in Pairs Trading 
EngleGranger Cointegration Models 

Tests for Robust Versus Least Squares Factor Model Fits 
robust methods 

The R Package cccp: Solving Cone Constrained Convex Programs 
Solver for convex problems for cone constraints 

Twinkle, twinkle little STAR: Smooth Transition AR Models in R 
Modeling smooth transition models 

Asset Allocaton with Higher Order Moments and Factor Models 
Global optimization by differential evolution / Numerical methods for portfolio optimization 

Event Studies in R 
Event study and extreme event analysis 

An R package on Credit Default Swaps 
Provides tools for pricing credit default swaps 

New Ideas for Large Network Analysis, Implemented in R 
Implicitly restarted Lanczos methods for R 

Package “Intermediate and Long Memory Time Series 

Simulate & Detect Intermediate and Long Memory Processes / in development 
Stochvol: Dealing with Stochastic Volatility in Time Series 
Efficient Bayesian Inference for Stochastic Volatility (SV) Models 

Divide and Recombine for the Analysis of Large Complex Data with R 
Package for using R with Hadoop 

gpusvcalibration: Fast Stochastic Volatility Model Calibration using GPUs 
Fast calibration of stochastic volatility models for option pricing models 

The FlexBayes Package 
Provides an MCMC engine for the class of hierarchical feneralized linear models and connections to WinBUGS and OpenBUGS 

Building Simple Redis Data Caches 
Rcpp bindings for Redis that connects R to the Redis key/value store 

Package pbo: Probability of Backtest Overfitting 
Uses Combinatorial Symmetric Cross Validation to implement performance tests. 
Many of these packages / projects also have supplementary material that is worth chasing down. Be sure to take a look at Alexios Ghalanos recent post that provides an accessible introduction to his stellar keynote address.
Many thanks to the organizers of the conference who, once again, did a superb job, and to the many professionals attending who graciously attempted to explain their ideas to a dilletante. My impression was that most of the attendies thoroughly enjoyed themselves and that the general sentiment was expressed by the last slide of Stephen Rush's presentation:
Many companies are considering switching from SAS to R for statistical data analysis, and may be wondering how R compares in performance and data size scalability to the legacy SAS systems (base SAS and SAS/Stat) they are currently using. Performance and scalability for R is exactly what Revolution R Enterprise (RRE) was designed for. In a recent webinar, Thomas Dinsmore described a benchmarking process to compare performance of legacy SAS and RRE. (The benchmarking process is described in the white paper Revolution R Enterprise: Faster Than SAS, and you can see the code behind the benchmarking process here.) In the webinar, Thomas revealed the following results:
Also in the webinar, John Wallace, founder and CEO of DataSong, described how performance and scalability requirements led to the selection in 2011 of Revolution R Enterprise as the analytics engine in their softwareasaservice platform. DataSong's industryleading marketing analytics system currently analyzes more than $3billion in marketing spend by major retailers.
The slides from the webinar are embedded above, and you can watch and download the full webinar at the link below.
Revolution Analytics webinars: Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
by Joseph Rickert
In last week’s post, I sketched out the history of Generalized Linear Models and their implementations. In this post I’ll attempt to outline how GLM functions evolved in R to handle large data sets.
The first function to make it possible to build GLM models with datasets that are too big to fit into memory was the bigglm()from Thomas Lumley’s biglm package which was released to CRAN in May 2006. bigglm()is an example of a external memory or “chunking” algorithm. This means that data is read from some source on disk and processed one chunk at a time. Conceptually, chunking algorithms work as follows: a program reads a chunk of data into memory, performs intermediate calculations to compute the required sufficient statistics, saves the results and reads the next chunk. The process continues until the entire dataset is processed. Then, if necessary, the intermediate results are assembled into a final result.
According to the documentation trail, bigglm()is based on Alan Miller’s 1991 refinement (algorithm AS 274 implemented in Fortran 77) to W. Morevin Genetlemen’s 1975 Algol algorithm ( AS 75). Both of these algorithms work by updating the Cholesky decomposition of the design matrix with new observations. For a model with p variables, only the p x p triangular Cholesky factor and a new row of data need to be in memory at any given time.
bigglm()does not do the chunking for you. Working with the algorithm requires figuring out how to feed it chunks of data from a file or a database that are small enough to fit into memory with enough room left for processing. ( Have a look at the make.data() function defined on page 4 of the biglm pdf for the prototype example of chunking by passing a function to bigglm()’s data argument.) bigglm() and the biglm package offer few features for working with data. For example, bigglm() can handle factors but it assumes that the factor levels are consistent across all chunks. This is very reasonable under the assumption that the appropriate place to clean and prepare the data for analysis is the underlying database.
The next steps in the evolution of building GLM models with R was the development of memorymapped data structures along with the appropriate machinery to feed bigglm() data stored on disk. In late 2007, Dan Alder et al. released the ff package which provides data structures that, from R's point of view, make data residing on disk appear is if it were in RAM. The basic idea is that only a chunk (pagesize) of the underlying data file is mapped into memory and this data can be fed to bigglm(). This strategy really became useful in 2011 when Edwin de Jonge, Jan Wijffels and Jan van der Laan released ffbase, a package of statistical functions designed to exploit ff’s data structures. ffbase contains quite a few functions including some for basic data manipulation such as ffappend() and ffmatch(). For an excellent example of building a bigglm() model with a fairly large data set have a look at the post from the folks at BNOSAC. This is one of the most useful, handson posts with working code for building models with R and large data sets to be found. (It may be a testimony to the power of provocation.)
Not longer after ff debuted (June 2008), Michael Kane, John Emerson and Peter Haverty released bigmemory, a package for working with large matrices backed by memorymapped files. Thereafter followed a sequence of packages in the Big Memory Project, including biganalytics, for exploiting the computational possibilities opened by by bigmemory. bigmemory packages are built on the Boost Interprocess C++ library and were designed to facilitate parallel programming with foreach, snow, Rmpi and multicore and enable distributed computing from within R. The biganalytics package contains a wrapper function for bigglm() that enables building GLM models from very large files mapped to big.matrix objects with just a few lines of code.
The initial release in early August 2010 of the RevoScaleR package for Revolution R Enterprise included rxLogit(), a function for building logistic regression models on very masive data sets. rxLogit() was one of the first of RevoScaleR’s Parallel External Memory Algorithms (PEMA). These algorithms are designed specifically for high performance computing with large data sets on a variety of distributed platforms. In June 2012, Revolution Analytics followed up with rxGlm(), a PEMA that implements all of the all of the standard GLM link/family pairs as well as Tweedie models and userdefined link functions. As with all of the PEMAS, scripts including rxGlm() may be run on different platforms just by changing a few lines of code that specifies the user’s compute context. For example, a statistician could test out a model on a local PC or cluster and then change the compute context to run it directly on a Hadoop cluster.
The only other Big Data GLM implementation accessible through an R package of which I am aware is h20.glm() function that is part of the 0xdata’s JVM implementation of machine learning algorithms which was announced in October 2013. As opposed the the external memory R implementations described above, H20 functions run in the distributed memory created by the H20 process. Look here for h20.glm() demo code.
And that's it, I think this brings us up to date with R based (or accessible) functions for running GLMs on large data sets.
by Seth Mottaghinejad, Analytic Consultant for Revolution Analytics
In the last article, we showed two separate R implementations of the Collatz conjecture: 'nonvec_collatz' and 'vec_collatz', with the latter being more efficient than the former because of the way it takes advantage of vectorization in R. Let's once again take a look at 'vec_collatz':
Today we will learn a thrid, and far more efficient way of implementing the Collatz conjecture. It involves rewriting the function in C++ and using the 'Rcpp' package in R to compile and run the function without ever leaving the R environment.
library("Rcpp")
One important difference between R and C++ is that when you write a C++ function, you need to declare your variable types. The C++ code chunk shown below creates a function called 'cpp_collatz' which takes an input of type 'IntegerVector' and whose output is of type 'IntegerVector'. Unlike in R, where explicit loops can slow your code down, loops in C++ are usually very efficient, even though they are tedious to write.
cpptxt < ' + IntegerVector cpp_collatz(IntegerVector ints) { + IntegerVector iters(ints.size()); + for (int i=0; i<ints.size(); i++) { + int nn = ints(i); + while (nn != 1) { + if (nn % 2 == 0) nn /= 2; + else nn = 3 * nn + 1; + iters(i) += 1; + } + } + return iters; + }' cpp_collatz < cppFunction(cpptxt) set.seed(20) cpp_collatz(sample(20)) [1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9
Let's now redo our C++ implementation in a slightly different way. We would rather not have C++ code interspersed with R code: Not only does it make it hard to read, but we also won't be able to take advantage of syntaxhightlighting specific to C++ (among other annoyances). So let's store the C++ code in a file we call 'collatz.cpp' and use the 'sourceCpp' function in R to call it. Here is the content of 'collatz.cpp':
cat(paste(readLines(file("collatz.cpp")), collapse = "\n")) #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { IntegerVector iters(ints.size()); for (int i=0; i<ints.size(); i++) { iters(i) = collatz(ints(i)); } return iters; } // [[Rcpp::export]] IntegerVector sug_collatz(IntegerVector ints) { return sapply(ints, collatz); }
There are three things worth mentioning about the above code chunk:
To compile the C++ code, just type the following in R:
sourceCpp("collatz.cpp")
Assuming that we have a C++ compiler istalled, it will take a few seconds to run. We can now type the following into the R console:
cpp_collatz function (ints) .Primitive(".Call")(<pointer: 0x000000006e042180>, ints) cpp_collatz(1:10) # seems to be working fine. [1] 0 1 7 2 5 8 16 3 19 6 sug_collatz(1:10) # same as above. [1] 0 1 7 2 5 8 16 3 19 6
And now let's get to the reason we bothered with C++ in the first place: efficiency. There are four comparisons we're interested in:
We can think of the last case as being a hybrid approach. One reason to include it is because someone may share a complicated piece of C++ code that works but is not vectorized and vectorizing it may turn out to be a nontrivial task (programmers are supposed to be lazy after all!).
collatz_benchmark < function(nums, ...) { + require(rbenchmark) + benchmark( + cpp_collatz(nums), # runs completely in C++ + sug_collatz(nums), # runs completely in C++ + vec_collatz(nums), # runs completely in R in a vectorized fashion + sapply(nums, collatz), # runs collatz in C++ but sapply in R + columns = c("test", "replications", "elapsed", "relative"), + ... + ) + }
Let's compare the two functions for all integers from 1 to 10^4:
collatz_benchmark(1:10^4, replications = 20) test replications elapsed relative 1 cpp_collatz(nums) 20 0.03 1.000 4 sapply(nums, collatz) 20 0.53 17.667 2 sug_collatz(nums) 20 0.03 1.000 3 vec_collatz(nums) 20 51.36 1712.000
And for all integers from 1 to 10^5:
collatz_benchmark(1:10^5, replications = 20) test replications elapsed relative 1 cpp_collatz(nums) 20 0.45 1.071 4 sapply(nums, collatz) 20 5.76 13.714 2 sug_collatz(nums) 20 0.42 1.000 3 vec_collatz(nums) 20 753.08 1793.048
As we can see 'cpp_collatz' and 'sug_collatz' are almost identical when it comes to efficiency and both are far more efficient than 'vec_collatz', and increasingly so for larger sequences of integers. Also notice the relative efficiency of the hybrid approach compared to 'vec_collatz'.
Let's benchmark on a sample of 1000 integers from 1 to 10^6 for a more realistic comparison of four approaches:
set.seed(20) collatz_benchmark(sample(1:10^6, 1000), replications = 100) # will take a while to run! test replications elapsed relative 1 cpp_collatz(nums) 100 0.05 1.667 4 sapply(nums, collatz) 100 0.23 7.667 2 sug_collatz(nums) 100 0.03 1.000 3 vec_collatz(nums) 100 31.15 1038.333
The results are remarkable: the C++ functions 'sug_collatz' is the winner with 'cpp_collatz' a close second. The slight advantage of 'sug_collatz' may be due to 'sapply' in C++ using a more efficient method for running the loop (such as iterators). Moreover, both functions are about 1000 times faster than 'vec_collatz'! Even though 'vec_collatz' was specifically written to take advantage of vectorization in R, it still pales in comparison to the might of C++. Even more surprising is that the hybrid approach is only about 8 times slower than C++ approach, which is not bad at all. But if your "vectorization" consists of wrapping a function in 'sapply', then it's best to do it in C++ as we did with 'sug_collatz' than to do it in R.
One takeaway lesson for us is that istead of spending a lot of time and effort making our R code as efficient as possible, it may be worth investing that time in learning how to code in a language like C++. Especially since packages like 'Rcpp' and the 'Rcppsugar' extension give us the best of both worlds. Another take away is that hybrid solutions are a good alternative in cases when our C++ code is efficient but not vectorized.
by Thomas Dinsmore
Regular readers of this blog may be familiar with our ongoing effort to benchmark Revolution R Enterprise (RRE) across a range of use cases and on different platforms. We take these benchmarks seriously at Revolution Analytics, and constantly seek to improve the performance of our software.
Previously, we shared results from a performance test conducted by Allstate. In that test, RRE ran a GLM analysis in five minutes; SAS took five hours to complete the same task. A reader objected that the test was unfair because SAS ran on a single machine, while RRE ran on a five node cluster. It's a fair point, except that given the software in question (PROC GLM in SAS/STAT) the performance would be the same on five nodes or a million nodes, since PROC GLM can scale up but not out.
Arguing that the Allstate benchmark was "apples to oranges", SAS responded by publishing its own apples to orange benchmark. In this benchmark, SAS demonstrated that its new HPGENSELECT procedure is very fast when it runs on a 144 node grid with 2,304 cores. As noted in the paper, this performance is only possible if you license more software, since HPGENSELECT can only run in Distributed mode if the customer licenses SAS High Performance Statistics.
We will be happy to stipulate that PROC HPGENSELECT runs faster on 2,304 cores than RRE on 20 cores.
As a matter of best practices, software benchmarks should run in comparable hardware environments, so that we can attribute performance differences to the software alone and not to differences in available computing resources. Consequently, we engaged an outside vendor with experience running SAS in clustered environments to perform an "apples to apples" benchmark of RRE vs. SAS. The consultant used a clustered computing environment consisting of five fourcore commodity servers (with 16G RAM each) running CentOS, Ethernet connections and a separate NFS Server.
We tested RRE 7 versus SAS Release 9.4, with Base SAS, SAS/STAT and SAS Grid Manager. (We did not test with SAS High Performance Statistics because we could find no vendors with experience using this new software. We note that more than two years into General Availability, SAS appears to have no public reference customers for this software.) In our experience, when customers ask how we perform compared to SAS, they are most interested in how we compare with the SAS software they already use.
To test Revolution R Enterprise ScaleR, we first deployed IBM Platform LSF and Platform MPI Release 9 on the grid, then installed Revolution R Enterprise Release 7 on each node. SAS Grid Manager uses an OEM version of IBM Platform LSF that cannot run concurrently with the standard version from IBM, so we configured the environment and ran the tests sequentially.
To simplify test replication across different environments, we used data manufactured through a random process. The time needed to manufacture the data is not included in the benchmark results. Prior to running the actual tests, we loaded the randomized data into each software product’s native file system: for SAS, a SAS Data Set; for Revolution R Enterprise, an XDF file.
Although we have benchmarked Revolution R Enterprise on data sets as large as a billion rows, typical data sets used by even the largest enterprises tend to be much smaller. We chose to perform the tests on wide files of 591 columns and row counts ranging from 100,000 to 5,000,000, file sizes that represent what we consider to be typical for many analysts. We also ran scoring tests on “narrow” files of 21 columns with row counts ranging up to 50,000,000.
Rather than comparing performance on a single task, we prepared a list of multiple tasks, then wrote programs in SAS and RRE to implement the test. Readers will find the benchmarking scripts here, on Git together with a script to produce the manufactured data.
To implement a fair test, we asked the SAS consultant to review the SAS programs and enable them for best performance in the clustered computing environment.
Detailed results of the benchmark test are shown here, in our published white paper
We invite readers to use the scripts in your own environment; let us know the results you achieve.
Revolution Analytics Whitepapers: Revolution R Enterprise: Faster Than SAS
by Joseph Rickert
The seven lightning talks presented to the Bay Area useR Group on Tuesday night were not only really interesting (in some cases downright entertaining) in their own right, but they also illustrated the diversity of R applications, and the extent to which R has become embedded in the corporate world. Two presentations with a whimsical touch were Gaston Sanchez’s talk on Arc Diagrams with R and Ram Narasimhan presentation on comparing the weather of various cities. Gaston showed a statistical text analysis of the movie scripts from three Star War episodes using arcdiagram representations. Gaston did some original work here in creating the arcdiagram plots and showed how to use R’s tm and igraph packages to extract text and compute adjacency matrices. The Star Wars analaysis code and the arcdiagram code are both available.
Ram’s talk was based on his weather data package (V0.3 on CRAN and V0.4 at GitHub) which has become a very useful and popular tool for scraping weather data from airports and weather stations around the world. The following plot shows how various cities rank according to his wife’s personal comfort score.
Also have a look at Ram’s Shiny app next time you are wondering whether you should visit San Francisco or Honolulu.
Presentations from Sara Brumbaugh on Running R from Excel, Winston Chen on Data Analysis with RStudio and MongoDB, and Cliff Click and Nidhi Mehta on Using H20 with R all made cases for integrating R with other corporate tools. Sara showed how to combine R scripts and Excel VBA code to pass inputs and parameters from a worksheet to a batch process, and back again. She showed several practical examples as well as quite a few virtuoso Excel tricks like storing and R script in a hidden Excel worksheet.
Winston’s talk emphasized how R’s visualization capabilities alone are enough to earn it a place in a bigleague machine learning shop. The platform stack at Winston’s company, fliptop. is built around Java/Scala, MongoDB/MySql and Python. But with all of that power they still didn’t have a good way to do data visualization with exploratory data analysis. Winston showed some examples with code of how they use RStudio to pull data from MongoDB into an R data frame where they can plot it.
Cliff, 0xdata’s CTO, gave a succinct overview of how the H20 JVM can free R from its memory and speed limitations and make it possible to run machine learning algorithms from the R environment on huge data sets. According to Cliff, if you built a 16 node cluster of machines each with 64GB of RAM and all running H20 you could have a terabyte cluster for H20’s inmemory analytics and run logistic regression, gbm, neural nets, random forests and other machine learning algorithms through the R to H20 Interface. Cliff emphasized that H20 implements a "groupby" feature that is very similar to the way plyr’s ddply function making it possible to do R style analyses on big data. Nidhi followed up by running several of the examples that can be found on the 0xdata website. Nidhi showed real grace under pressure, and made the speed of the H20 algorithms seem all the more impressive by running live demos one after the other while the clock on the 12 minute presentation time limit was running out.
Finally the two presentations, the first by, Raman Kapur on Managing Enterprise Cyber Risk through Big Data & Analytics, and the second by Giovanni Seni on Intuit’s new Rego package show how R applications can form the foundation of a production system. After providing some background information on the prevalence of information security breaches, Raman talked about how Foundation’s Edge has built Avana, an R based system to model the risk profile of a corporation’s business units.
Giovanni gave a brief introduction to the rule based ensemble methods developed by Friedman and Popescu and worked through an example using the Rego package, which is newly available on Github. Giovanni, who has considerable experience with ensemble methods (have a look at the book he wrote with John Elder), said that he favors rule based methods because of their interpretability. He stressed that in addition to building predictive models, data scientists are often seeking insight into how complex systems work. Rule based ensemble models are useful for both purposes, often outperforming tree based classifiers for prediction. A notable feature of the Rego package is that it has a commandline, batch interface. Here we have an R package that is meant to do the heavy lifting in a production system.
key link: BARUG presentations
by Joseph Rickert
Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in Doing Data Science by Rachel Schutt and Cathy O’Neil. Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available.
However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good:
Here are three data sets that meet these criteria in ascending order of degree of difficulty:
The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files.
Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.
Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here.
As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account.
Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.