« Quick notes from Strata NYC 2012 | Main | R 2.15.2 now available »

October 25, 2012


Feed You can follow this conversation by subscribing to the comment feed for this post.

As a poisson regression was used, it would be nice to also see as a benchmark the computational speed when using the biglm package in open source R? Just import your csv in sqlite and run biglm to obtain your poisson regression. biglm also loads in data in R in chunks in order to update the model so that looks more similar to the RevoScaleR setup then just running plain glm in R.

How did the results compare with what you get from sampling?


Yes. Snedecor was right. This notion that every last one must be used is silly.

All of the comparisons have different hardware specification. For example, the comparison could have been run with SAS using a 5-node LSF cluster configuration as well.

By the same token a map-reduce program can be efficient or incredibly sloppy. So I am not suprised to see a single result > 10 hrs.

Very nice article on usage of Hadoop insurance model.

Just display them a duplicate of your NEW plan and they will end the old.

Little is said exactly what "Revolution R Enterprise" is for those of us who don't use it. However, it seems obvious that it is not a simple procedure like SAS' PROC GENMOD used as an alternate process for comparison. I haven't started using R language yet but am close. However, as a long-time expert-level SAS® DATA step programmer, I know for a fact that even with SAS® software, simply using some procedures "as is", you can easily run into serious RAM-related problems. The number of variables (columns) is very likely to get you into trouble faster than the number of observations (rows) unless you try to save data values in arrays (which would be poor programming with large data sets.) Surprisingly the author doesn't seem to care about that but he probably has his reasons.

These RAM-related problems are relatively easily solved within a macro system that uses the DATA step and SAS® macro facility with the main procedure considered; temporarily saving data on a magnetic tape (hard drive), retrieving them when needed and deleting these temporary files always comes in handy in addition. I strongly suspect this general scheme to be possible with the R language but do not know for sure. It also seems obvious that "Revolution R Enterprise" is NOT a simple procedure but falls within the realm of what I described as "a macro system that uses the DATA step with the main procedure considered". I thus disagree with the conclusions of this presentation because they don't seem to compare apples to apples. Software developers can only go so far in solving any data analyst's problems. It is incumbent upon us to develop the expertise to get around whatever problems arise; and this can only be done through expert-developer programming.

Could you also share information on the following -

1) Quality results of the model obtained
2) Number of iterations used by RevoScaleR rxGlm


3) How many categorical variables and how many numerical
4) Average number of levels per categorical variable
5) maximum number of levels for a categorical variable

Ih ave seen that SAS released on june 2013 a procedure called HPGENSELECT : it is almost a High-Performance version of GENMOD.
Would it be possible to include this new procedure in the benchmark ?

I work with both SAS and R. I especially like melding the two. I also suspect that if you had given the project to SAS, you would have had different results. I work on a SAS/Grid and would put that against any engine you can name. I would disagree with the results. You are testing under different environments.

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr