R''s glm function for generalized linear modeling is very powerful and flexible: it supports all of the standard model types (binomial/logistic, Gamma, Poisson, etc.) and in fact you can fit any distribution in the exponential family (with the family argument). But if you want to use it on a data set with millions or rows, and especially with more than a couple of dozen variables (or even just a few categorical variables with many levels), this is a big computational task that quickly grows in time as the data gets larger, or even exhaust the available memory.
The rxGlm function included in the RevoScaleR package in Revolution R Enterprise 6 has the same capabilities as R's glm, but is designed to work with big data, and to speed up the computation using the power of multiple processors and nodes in a distributed grid. In the analysis of census data in the video below, fitting a Tweedie model on 5M observations and 265 variables takes around 25 seconds on a laptop. A similar analysis, using 14 million observations on a 5-node Windows HPC Server cluster takes just 20 seconds.
This demonstration was part of last week's webinar on Revolution R Enterprise 6. If you're not familiar with Revolution R Enterprise, the first 10 minutes is an overview of the differences from open-source R, and the remaining 20 minutes describes the new features in version 6. Follow the link below to check out the replay.
Revolution Analytics webinars: 100% R and More: Plus What's New in Revolution R Enterprise 6.0
This is pretty cool. So, what's the reason for this big divergence? E.g what makes xdf so much better? besides xdf, parallelism is there anything else? Does logistic regression scale in the same way? I'd love to see that in a future presentation.
Posted by: nick | June 29, 2012 at 09:48
I'm Sue Ranney and I ran the GLM timings on my laptop for the plot in this blog post. There are three main reasons for the big divergence: efficient use of memory (data is not copied unless absolutely necessary), efficient handling of categorical variables (saves memory and time), and efficient use of threads for parallelization. Logistic regression scales the same way. By the way, these timings were all done using in-memory data frames. The RevoScaleR analysis functions continue to scale up for huge data sets if data is stored in the efficient .xdf file format.
Posted by: Sue Ranney | July 10, 2012 at 10:25