At the Strata conference in New York today, Steve Yun (Principal Predictive Modeler at Allstate's Research and Planning Center) described the various ways he tackled the problem of fitting a generalized linear model to 150M records of insurance data. He evaluated several approaches:
- Proc GENMOD in SAS
- Installing a Hadoop cluster
- Using open-source R (both on the full data set, and on using sampling)
- Running the data through Revolution R Enterprise
Steve described each of the approaches as follows.
Approach 1 is current practice. (Allstate is a big user of SAS, but there's a growing contingent of R users.) Proc GENMOD takes around 5 hours to return results for a Poisson model with 150 million observations and 70 degrees of freedom. "It's difficult to be productive on a tight schedule if it takes over 5 hours to fit one candidate models!", Steve said.
Approach 2: It was hoped that installing a Hadoop cluster and running the model there would improve performace. According to Steve, "a lot of plumbing was required: this involved coding the matrix equations for iteratively-reweighted least squares as a map-reduce task and manually coding the factor variables as indicator columns in the design matrix. Unfortunately, each iteration took abour 1.5 hours, with 5-10 iterations required to convergence. (Even then, there were problems with singularites in the design matrix.)
Approach 3: Perhaps installing R on a server with lots of RAM would help. (Because open-source R runs in-memory, you need RAM in the order of several times the size of the data to make it work.) Alas, not even a 250Gb server was sufficient: even after waiting three days, the data couldn't even be loaded. Sampling the data down into 10 partitions was more successful, and allowed for the use of the glmnet package and L1 regularization to automate the variable selection process. But each glmnet fit on a partition still took over 30 minutes, and Steve said it would be difficult for managers to accept a process that involved sampling.
Approach 4: Steve turned to Revolution Analytics' Joe Rickert to evaluate how long the same model would take using the big-data RevoScaleR package in Revolution R Enterprise. Joe loaded the data onto a 5-node cluster (20 cores total), and used the distributed rxGlm function, which was able to process the data in 5.7 minutes. Joe demonstrated this process live during the session.
So in summary, here's how the four approaches fared:
Approach |
Platform |
Time to fit |
1: SAS |
16-core Sun Server |
5 hours |
2: rmr / map-reduce |
10-node (8 cores / node) Hadoop cluster |
> 10 hours |
3: Open source R |
250 GB Server |
Impossible (> 3 days) |
4: RevoScaleR |
5-node (4 cores / node) LSF cluster |
5.7 minutes |
That's quite a difference! So what have we learned:
- SAS works, but is slow.
- It's possible to program the model in Hadoop, but it's even slower.
- The data is too big for open-source R, even on a very large server.
- Revolution R Enterprise gets the same results as SAS, but about 50x faster.
Steve and Joe's slides and video of the presentation, Start Small Before Going Big, will be available on the Strata website in due course.
As a poisson regression was used, it would be nice to also see as a benchmark the computational speed when using the biglm package in open source R? Just import your csv in sqlite and run biglm to obtain your poisson regression. biglm also loads in data in R in chunks in order to update the model so that looks more similar to the RevoScaleR setup then just running plain glm in R.
Posted by: Jan | October 26, 2012 at 01:01
How did the results compare with what you get from sampling?
Posted by: Blaise F Egan | October 26, 2012 at 01:44
Blaise:
Yes. Snedecor was right. This notion that every last one must be used is silly.
Posted by: Robert Young | October 26, 2012 at 05:24
All of the comparisons have different hardware specification. For example, the comparison could have been run with SAS using a 5-node LSF cluster configuration as well.
By the same token a map-reduce program can be efficient or incredibly sloppy. So I am not suprised to see a single result > 10 hrs.
Posted by: Ralph Winters | October 26, 2012 at 08:39
Very nice article on usage of Hadoop insurance model.
Posted by: srini | November 01, 2012 at 03:02
Just display them a duplicate of your NEW plan and they will end the old.
Posted by: rca astra | February 23, 2013 at 04:41
Little is said exactly what "Revolution R Enterprise" is for those of us who don't use it. However, it seems obvious that it is not a simple procedure like SAS' PROC GENMOD used as an alternate process for comparison. I haven't started using R language yet but am close. However, as a long-time expert-level SAS® DATA step programmer, I know for a fact that even with SAS® software, simply using some procedures "as is", you can easily run into serious RAM-related problems. The number of variables (columns) is very likely to get you into trouble faster than the number of observations (rows) unless you try to save data values in arrays (which would be poor programming with large data sets.) Surprisingly the author doesn't seem to care about that but he probably has his reasons.
These RAM-related problems are relatively easily solved within a macro system that uses the DATA step and SAS® macro facility with the main procedure considered; temporarily saving data on a magnetic tape (hard drive), retrieving them when needed and deleting these temporary files always comes in handy in addition. I strongly suspect this general scheme to be possible with the R language but do not know for sure. It also seems obvious that "Revolution R Enterprise" is NOT a simple procedure but falls within the realm of what I described as "a macro system that uses the DATA step with the main procedure considered". I thus disagree with the conclusions of this presentation because they don't seem to compare apples to apples. Software developers can only go so far in solving any data analyst's problems. It is incumbent upon us to develop the expertise to get around whatever problems arise; and this can only be done through expert-developer programming.
Posted by: Bikila bi Gwet | May 09, 2013 at 09:48
Could you also share information on the following -
1) Quality results of the model obtained
2) Number of iterations used by RevoScaleR rxGlm
Also,
3) How many categorical variables and how many numerical
4) Average number of levels per categorical variable
5) maximum number of levels for a categorical variable
Posted by: Anand S | July 09, 2013 at 21:10
Ih ave seen that SAS released on june 2013 a procedure called HPGENSELECT : it is almost a High-Performance version of GENMOD.
Would it be possible to include this new procedure in the benchmark ?
Posted by: Collet Jérôme | July 17, 2013 at 00:34
I work with both SAS and R. I especially like melding the two. I also suspect that if you had given the project to SAS, you would have had different results. I work on a SAS/Grid and would put that against any engine you can name. I would disagree with the results. You are testing under different environments.
Posted by: Mark Ezzo | August 15, 2013 at 06:55