I've hinted this was coming a few times before, but with today's press release the announcement is official: the next release of Revolution R Enterprise will include "Big Data" capabilities thanks to the new RevoScaleR package. We're pretty excited at how it's turned out: it's kinda amazing to be able to use R's formula syntax like this:
arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime))
and be able to do a regression on 120+million rows (more than 13Gb) of data in just a few seconds using an ordinary laptop. With the more powerful multicore machines now on the market, the parallel processing algorithms in RevoScaleR really scream. You can see some of the details of how the RevoScaleR package works in the white paper "Big Data Analysis with Revolution R Enterprise", and I'll be giving a presentation about the new Big Data capabilities of Revolution R Enterprise in a live webcast on August 25.
One aspect of the announcement that really seems to have generated attention is how you can use Revolution R Enterprise with Hadoop to process super-massive data sets. (As I write this, before the press release is even on the wires, this article by Dave Rosenberg at Cnet has already been retweeted over 100 times.) Hadoop and RevoScaleR complement each other well: like a freight train, Hadoop can do the heavy lifting of preprocessing a distributed data set to get it ready for statistical analysis, and then, like a race car, RevoScaleR fits the statistical model. We'll be coming out very soon with a white paper authored by Saptarshi Guha (author of the the Rhipe integration between Hadoop and R) demonstrating how he used Hadoop to extract out individual conversations from packet-level VOIP data, and then used RevoScaleR to perform a regression analysis on those calls. We'll have more information about that analysis here in the blog in the next couple of weeks.
Revolution Analytics: Revolutionary New Levels of Performance and Scalability for Big Data Analysis
Will this be available for standard R, or is this a proprietary Revolution R package?
Posted by: Ryan | August 03, 2010 at 06:34
RevoScaleR is part of Revolution R Enterprise, available to paying commercial subscribers (and free to academia).
Posted by: David Smith | August 03, 2010 at 06:45
Is it possible to get a copy of the AirlineData87to08 (in a standard format, say CSV) so we can try the equivalent examples on standard R to get a feel for how bad it would be with one of the usual packages?
Posted by: Allan Engelhardt | August 03, 2010 at 10:10
Oh, Google is my friend and I guess it is this data set you are using:
http://stat-computing.org/dataexpo/2009/the-data.html
Posted by: Allan Engelhardt | August 03, 2010 at 10:19
A very interesting post, and white paper describing the big data capabilities.
It would be interesting to see how the rxCrostabs and rxDataStepXdf compare to the 'proc freq' and 'data steps' in SAS. Your system can beat SAS on price and also maybe on performance in manipulating big data sets.
The white paper describes the fitting of a linear and logistic regression with these huge sets. That is impressive, but would you really need millions of rows to estimate 5 to 10 model coefficients...?
Posted by: Longhow Lam | August 03, 2010 at 10:39
Any plans to do a Mac port for Revolution R Enterprise?
Posted by: Frank | August 03, 2010 at 11:57
The rxCrostabs and rxDataStepXdf compare to the proc freq' and 'data steps' in SAS. Your system can beat SAS on price and also maybe on performance in manipulating big data sets.
Posted by: Refurbished Computers | November 24, 2010 at 00:57
I would love to see a Mac port for the Revolution R Enterprise addition as well.
Posted by: wireless headset | May 25, 2011 at 13:06