« Quick notes from Strata NYC 2012 | Main | R 2.15.2 now available »

October 25, 2012

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b017d3cfc2888970c

Listed below are links to weblogs that reference Allstate compares SAS, Hadoop and R for Big-Data Insurance Models:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

As a poisson regression was used, it would be nice to also see as a benchmark the computational speed when using the biglm package in open source R? Just import your csv in sqlite and run biglm to obtain your poisson regression. biglm also loads in data in R in chunks in order to update the model so that looks more similar to the RevoScaleR setup then just running plain glm in R.

How did the results compare with what you get from sampling?

Blaise:

Yes. Snedecor was right. This notion that every last one must be used is silly.

All of the comparisons have different hardware specification. For example, the comparison could have been run with SAS using a 5-node LSF cluster configuration as well.

By the same token a map-reduce program can be efficient or incredibly sloppy. So I am not suprised to see a single result > 10 hrs.

Very nice article on usage of Hadoop insurance model.

Just display them a duplicate of your NEW plan and they will end the old.

Little is said exactly what "Revolution R Enterprise" is for those of us who don't use it. However, it seems obvious that it is not a simple procedure like SAS' PROC GENMOD used as an alternate process for comparison. I haven't started using R language yet but am close. However, as a long-time expert-level SAS® DATA step programmer, I know for a fact that even with SAS® software, simply using some procedures "as is", you can easily run into serious RAM-related problems. The number of variables (columns) is very likely to get you into trouble faster than the number of observations (rows) unless you try to save data values in arrays (which would be poor programming with large data sets.) Surprisingly the author doesn't seem to care about that but he probably has his reasons.

These RAM-related problems are relatively easily solved within a macro system that uses the DATA step and SAS® macro facility with the main procedure considered; temporarily saving data on a magnetic tape (hard drive), retrieving them when needed and deleting these temporary files always comes in handy in addition. I strongly suspect this general scheme to be possible with the R language but do not know for sure. It also seems obvious that "Revolution R Enterprise" is NOT a simple procedure but falls within the realm of what I described as "a macro system that uses the DATA step with the main procedure considered". I thus disagree with the conclusions of this presentation because they don't seem to compare apples to apples. Software developers can only go so far in solving any data analyst's problems. It is incumbent upon us to develop the expertise to get around whatever problems arise; and this can only be done through expert-developer programming.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog