« RStudio releases Shiny | Main | Webinar Tomorrow: Big Data Trees and Hadoop Connection in Revolution R Enterprise 6.1 »

November 13, 2012


Feed You can follow this conversation by subscribing to the comment feed for this post.

Nice work Joe!

Thanks for picking up the question and trying to benchmark it with biglm. Too bad, we can't have access to the data.

Following your post, I decided two days ago to integrate biglm with package ffbase (available here http://code.google.com/p/fffunctions/ but I have made the current package also available at dropbox - see below).
I wasn't that hard to integrate and now there is a simple wrapper for a biglm data in an ffdf.

So if you are running the following code (4 lines) on your dataset, we could get a good benchmark (if it doesn't fail of course ;)) between your parallel setup and a single process.

download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
system("R CMD INSTALL ffbase_0.7.tar.gz")
x <- read.csv.ffdf(file = file.path(getwd(), "AdataT2.csv"))
system.time(model <- bigglm(formula = yourmodel, data = x, family = poisson(), chunksize=100000, maxit=10)

Thanks Jan, you made my day by asking for the most adequate comparison: combining biglm with ff. Thanks for providing such an elegant ffbase convenience wrapper which makes the biglm example in ff's help to ?chunk.ffdf more accessible. Indeed csv import to ffdf automatically handles the factor levels, so once bigglm commences on the first chunk, all the levels are known. Also the performance should be better than with SQLite. Of course we don't expect the speed of parallel execution, but at least we expect a result for little effort and investment. And once we need speed there is rxGlm which seems to do an excellent job. Curious for the final timings.

I've used biglm with a large dataset with factor levels. My solution was to simply convert factor levels to multiple indicator variables in SQL.

Take care though the standard biglm doesn't work with Microsoft SQL. The wrapper needs to change slightly as it uses SQL that is not compatible with MS SQL.

What is the status of the bechmarking with bigglm + ffdf ?

I also use indicator variables prepared in SQL with large datasets for bigglm.


I would like zoo and other packages (cointegration, wavelets analysis) with large datasets (10 to 100GB).

What tool (Sqldf, Revoscaler, RHadoop, ff, bigglm...) would be the best option?

I mean I don't want just use the commands that these tools provide to perform calculations but I would like to forward that data to zoo. But I guess that zoo will complain because it's not designed to use streaming data. ??

How can I do it?


The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr