By Joseph Rickert
In a recent blog post, David Smith reported on a talk that Steve Yun and I gave at STRATA in NYC about building and benchmarking Poisson GLM models on various platforms. The results presented showed that the rxGlm function from Revolution Analytics’ RevoScaleR package running on a five node cluster outperformed a Map Reduce/ Hadoop implementation as well as an implementation of legacy software running on a large server. An alert R user posted the following comment on the blog:
As a poisson regression was used, it would be nice to also see as a benchmark the computational speed when using the biglm package in open source R? Just import your csv in sqlite and run biglm to obtain your poisson regression. Biglm also loads in data in R in chunks in order to update the model so that looks more similar to the RevoScaleR setup then just running plain glm in R.
This seemed like a reasonable, simple enough experiment. So we tried it. The benchmark results presented at STRATA were done on a 145 million record file, but as a first step, I thought that I would try it on a 14 million record subset that I already had loaded on my PC, a quad core Dell, with i7 processors and 8GB of RAM. It took almost an hour to build the SQLite data base:
. . . and then just a couple of lines of code to set up the connections and run the model.
# BIGGLM library(biglm) #-------------------- # Set up the data base cinnections tablename <- "main.AT2_10Pct" DB <- file.path(getwd(),"AdataT2SQL") conn <- dbConnect(dbDriver("SQLite"),dbname=DB) modelvars <- all.vars(formula) query <- paste("select ", paste(modelvars, collapse = ", ")," from ", tablename) #-------------------- # Run bigglm gc() system.time(model <- bigglm (formula = formula, data = dbGetQuery(conn,query),family = poisson(),chunksize=10000,maxit=10))
Unfortunately, the model didn’t run to completion. The error messages returned were of the form:
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : #contrasts can be applied only to factors with 2 or more levels #In addition: There were 50 or more warnings (use warnings() to see the first 50) #Timing stopped at: 186.39 80.81 470.33 warnings() #1: In model.matrix.default(tt, mf) : variable 'V1' converted to a factor #2: In model.matrix.default(tt, mf) : variable 'V2' converted to a factor
This error suggests that while chunking through the data bigglm came across a variable that should be converted into a factor. But, since there was only value for the variable in the chunk that was in memory bigglm threw an error.
In general, factors present a significant challenge for external memory algorithms. Not only might an algorithm fail to create factor variables, even when the algorithm runs there may be unanticipated consequences that cause big trouble downstream. For example, variations in text can cause attempts at automatic factor conversion to make several versions of the same variable. This, in turn, may make it impossible to merge files, or cause an attempt to predict results on a hold out data set to fail because the factor levels are different. Even more insidiously, when hundreds of variables are involved in a model, an analyst might not notice a few bogus factor levels.
bigglm does not provide a mechanism for setting factor levels on the fly. In my opinion, far from being a fault, this was an intelligent design choice. rxGlm, RevoScaleR’s function for building GLM models, does provide some capability to work with factors on the fly. But, this is not recommended practice — too many things can go wrong. The recommended way to do things is to use RevoScaleR’s rxFactors function on data stored in RevoScaleR native .XDF file. rxFactors provides the user with very fine control of factor variables. Factor levels can be set, sorted, created and merged.
The analogous course of action with bigglm would be to set up the factor variables properly in the data base. Whenever, I have database problems, my go to guy is my colleague Steve Weller. Steve loaded the data into a MySQL database installed on a quad-core PC with 8 GB of RAM running Windows 2008 Server R2 Standard. He manually added new indicator variables to the database corresponding to the factor levels in the original model, and built a model that was almost statistically equivalent to the original model (we never quite got the contrasts right) but good enough to benchmark. It took bigglm about 27 minutes to run working off the MySQL database. By comparison, rxGlm completed in less than a minute on Steve’s test machine. We have not yet tried to run bigglm on the entire 145 million record dataset. It would be nice to know if bigglm scales linearly with the number of records. If it does, that would bring bigglm in at about 4.5 hours to process the entire data set, considerably longer than the 54 minutes it took to process the large data set with RevoScaleR on my PC.
It would be nice to hear from R users who have built bigglm models on large data sets. Unfortunately, I cannot make the proprietary data set used in the benchmark available. However, it should not be too difficult to find a suitable publicly-available substitute. Our benchmark data set had 145,814,000 rows and 139 variables. These included integer, numeric, character and factor variables. There were 40 independent variables in the original model. If you try, be prepared to spend a little time on the project. It is not likely to be as easy as the beguiling phrase in the comment to the blog post (“Just import your csv in sqlite and run biglm…”) would indicate.
Nice work Joe!
Posted by: Derek Norton | November 13, 2012 at 09:50
Thanks for picking up the question and trying to benchmark it with biglm. Too bad, we can't have access to the data.
Following your post, I decided two days ago to integrate biglm with package ffbase (available here http://code.google.com/p/fffunctions/ but I have made the current package also available at dropbox - see below).
I wasn't that hard to integrate and now there is a simple wrapper for a biglm data in an ffdf.
So if you are running the following code (4 lines) on your dataset, we could get a good benchmark (if it doesn't fail of course ;)) between your parallel setup and a single process.
library(biglm)
library(ff)
download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
system("R CMD INSTALL ffbase_0.7.tar.gz")
library(ffbase)
x <- read.csv.ffdf(file = file.path(getwd(), "AdataT2.csv"))
system.time(model <- bigglm(formula = yourmodel, data = x, family = poisson(), chunksize=100000, maxit=10)
Posted by: Jan | November 13, 2012 at 11:03
Thanks Jan, you made my day by asking for the most adequate comparison: combining biglm with ff. Thanks for providing such an elegant ffbase convenience wrapper which makes the biglm example in ff's help to ?chunk.ffdf more accessible. Indeed csv import to ffdf automatically handles the factor levels, so once bigglm commences on the first chunk, all the levels are known. Also the performance should be better than with SQLite. Of course we don't expect the speed of parallel execution, but at least we expect a result for little effort and investment. And once we need speed there is rxGlm which seems to do an excellent job. Curious for the final timings.
Jens
Posted by: Jens Oehlschlägel | November 13, 2012 at 13:08
I've used biglm with a large dataset with factor levels. My solution was to simply convert factor levels to multiple indicator variables in SQL.
Take care though the standard biglm doesn't work with Microsoft SQL. The wrapper needs to change slightly as it uses SQL that is not compatible with MS SQL.
Posted by: Louis | November 16, 2012 at 05:41
What is the status of the bechmarking with bigglm + ffdf ?
Thanks
Posted by: Chris | January 09, 2013 at 01:59
I also use indicator variables prepared in SQL with large datasets for bigglm.
Posted by: Louis | January 16, 2013 at 01:30
Hello
I would like zoo and other packages (cointegration, wavelets analysis) with large datasets (10 to 100GB).
What tool (Sqldf, Revoscaler, RHadoop, ff, bigglm...) would be the best option?
I mean I don't want just use the commands that these tools provide to perform calculations but I would like to forward that data to zoo. But I guess that zoo will complain because it's not designed to use streaming data. ??
How can I do it?
regards
Posted by: skan | March 26, 2013 at 18:44