You might think that doing advanced statistical analysis on Big Data is out of reach for those of us without access to expensive hardware and software. For example, back in April SAS was proud to demonstrate being able to run logistic regression on a billion records (and "just a few" variables) in less than 80 seconds. But that feat required some serious hardware: two racks of Greenplum's Data Computing Appliance (DCA). Each rack of the DCA has 16 servers, 192 Intel cores, and 768 GB of RAM, and pricing starts at $1 million. Add on SAS license fees for its High Performance Analytics suite, and you're talking serious money.
We're currently beta testing Revolution R Enterprise 5.0, which includes new features for using the power of a cluster of commodity hardware machines running Windows Server to perform statistical analysis on huge data sets. In the video below, Revolution Analytics' Sue Ranney takes the beta for a spin, and uses the RevoScaleR package to run a logistic regression on 1.2 billion records of data on our 5-node cluster:
For comparison, each of the five nodes in our cluster has 16 GB of RAM with an Intel Xeon E3-1230, 3.2Ghz 8M cache quad-core processor, and a 1 TB hard drive. Total hardware cost: around $5,000. All the machines are running Windows Server 2008 with the Windows HPC Pack and Revolution R Enterprise 5.0 beta 1.
And the time for that 1.2 billion row regression? 75 seconds: just as fast, and at less than 1% of the hardware cost. See the details in the video linked below.
Revolution Analytics YouTube Channel: Logistic Regression in R with a Billion Rows
What if we had a cluster of Linux machines rather than Windows?
Posted by: Matt | July 18, 2011 at 08:46
@Matt, that's next on the Revolution roadmap. Here's a question for you and other readers: what job control / scheduling software would we need to support to make this useful to you on a Linux-based cluster?
Posted by: David Smith | July 18, 2011 at 09:06
@David: If I'm not mistaken, I feel like all of the (academic) institutions I've been in, the job queue for the cluster was managed by Sun/Oracle Grid Engine.
Posted by: Steve Lianoglou | July 18, 2011 at 12:22
There is also a fairly large community using the SLURM stack distributed by LLNL (https://computing.llnl.gov/linux/slurm/), which is highly scalable and runs a number of machines in the top 500.
We've used SLURM quite successfully with R on a small cluster, just using the Rscript scripting executable. See http://pscluster.berkeley.edu for more information or ping me if you want details.
Posted by: Mark Huberty | July 19, 2011 at 09:59
I have heard a lot of good things about the job scheduler known as Condor, (http://www.cs.wisc.edu/condor/) though I have no personal experience with it. I know it supports MPI, which is really nice.
Posted by: Chris | July 19, 2011 at 17:00
The article looks misleading.
First, SAS high performance logistic regression is able to handle tens of thousands variables very easily. The demo using a data of a few variable does not mean it can only handle data of that kind.
Second, SAS high performance suite can run on all kinds of cluster infrastructure. Being able to run with greenplum for along-side database analysis is just a plus, which shows its capability on handling very large problems with very advanced hardware.
Posted by: Alan Zhao | August 24, 2011 at 11:19
I'm Sue Ranney and I created the video in the blog post. SAS isn't very forthcoming in benchmarks, so we can only make comparisons with what's public. The article cited in the blog provides two pieces of information on SAS’ performance:
In early April, SAS demonstrated the power of high performance analytics at its Global Forum meeting. In the first case, two racks (16 nodes) of Greenplum's Data Computing Appliance (DCA) were used to run a logistic regression of bank loan defaults across a database with a billion records, applying just a few variables. The regression was able to complete in less than 80 seconds (as compared to 20 hours for an unspecified serial implementation). Another demonstration, this time on a 24-node Teradata platform, used 1,800 variables applied to 50 million observations. In this case, the analysis finished in 42 seconds.
My video provides an example of RevoScaleR on 5 nodes of commodity hardware with a total of 20 cores (total cost about $5K). Running a logistic regression with over a billion records, applying a just a few variables, takes under 80 seconds. On the same cluster, we have also run examples of linear and logistic regressions with over 1,800 variables applied to about a billion observations (instead of just 50 million) with timings of about 12 seconds and 112 seconds, respectively.
Posted by: Sue Ranney | August 29, 2011 at 10:30
my implementation result:
10 000 000 rows, 10 characteristics, c#, 18 sec.
Intel core i5-2410M
4Gb RAM
Posted by: Kurov Dmitro | July 24, 2013 at 00:12
i'm sorry i'm newbie)
i have no enough memory for billion records((((
Posted by: Kurov Dmytro | July 24, 2013 at 00:18