« Because it's Friday: Printing a wrench | Main | GigaOm article on R, Big Data and Data Science »

July 18, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.

What if we had a cluster of Linux machines rather than Windows?

@Matt, that's next on the Revolution roadmap. Here's a question for you and other readers: what job control / scheduling software would we need to support to make this useful to you on a Linux-based cluster?

@David: If I'm not mistaken, I feel like all of the (academic) institutions I've been in, the job queue for the cluster was managed by Sun/Oracle Grid Engine.

There is also a fairly large community using the SLURM stack distributed by LLNL (https://computing.llnl.gov/linux/slurm/), which is highly scalable and runs a number of machines in the top 500.

We've used SLURM quite successfully with R on a small cluster, just using the Rscript scripting executable. See http://pscluster.berkeley.edu for more information or ping me if you want details.

I have heard a lot of good things about the job scheduler known as Condor, (http://www.cs.wisc.edu/condor/) though I have no personal experience with it. I know it supports MPI, which is really nice.

The article looks misleading.

First, SAS high performance logistic regression is able to handle tens of thousands variables very easily. The demo using a data of a few variable does not mean it can only handle data of that kind.

Second, SAS high performance suite can run on all kinds of cluster infrastructure. Being able to run with greenplum for along-side database analysis is just a plus, which shows its capability on handling very large problems with very advanced hardware.

I'm Sue Ranney and I created the video in the blog post. SAS isn't very forthcoming in benchmarks, so we can only make comparisons with what's public. The article cited in the blog provides two pieces of information on SAS’ performance:

In early April, SAS demonstrated the power of high performance analytics at its Global Forum meeting. In the first case, two racks (16 nodes) of Greenplum's Data Computing Appliance (DCA) were used to run a logistic regression of bank loan defaults across a database with a billion records, applying just a few variables. The regression was able to complete in less than 80 seconds (as compared to 20 hours for an unspecified serial implementation). Another demonstration, this time on a 24-node Teradata platform, used 1,800 variables applied to 50 million observations. In this case, the analysis finished in 42 seconds.

My video provides an example of RevoScaleR on 5 nodes of commodity hardware with a total of 20 cores (total cost about $5K). Running a logistic regression with over a billion records, applying a just a few variables, takes under 80 seconds. On the same cluster, we have also run examples of linear and logistic regressions with over 1,800 variables applied to about a billion observations (instead of just 50 million) with timings of about 12 seconds and 112 seconds, respectively.

my implementation result:

10 000 000 rows, 10 characteristics, c#, 18 sec.

Intel core i5-2410M

i'm sorry i'm newbie)

i have no enough memory for billion records((((

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr