« How UpStream uses R for Attribution Analysis | Main | NYT uses R to investigate NFL draft picks »

April 30, 2013

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Gigabytes is not a useful unit of dataset size. Its an amount of memory. Are these 1-byte integers, 8-byte floats, or the size of a CSV file when each number is written out to 23 decimal places?

For the usual grid-type of dataset, stating the rows, columns, and types of each column (numeric, integer, categorical) is a much more informative measure.

It doesn't seem that the authors did their due diligence with regard to the R eco-system. I would just add:

1. Logistic regression can be performed out of memory in open-source R using the biglm or speedglm packages.

2. Out of memory random forests are available via bigrf.

3. It seems pretty naive to leave things like the % correctly classified blank for R because they are not reported by the summary function. Sure, R doesn't spit out tons of irrelevant quantities by default, but that is by design not by deficiency. Is table(data$variable,predict(model,newdata=data)) too hard for them?

4. The whole object-oriented leads to big data problems really makes me wonder whether they have a solid understanding of computer science. R has more limited functionality for out-of-memory data, but that is not due to object orientation.

5. Regarding reading in data: "missing values must be designated "NA" in the original data" is a false statement. See the na.strings parameter of read.table. Though who knows how they are reading in their data as it is not stated.

6. What is going on with random forests??? They report 29% correct classification for SAS, and 0.8% for R and 0.1% for Mahoot. No mention of these insanely different numbers is given in the text.

It is also interesting that they mention issues with the use of RAM in R but allocate a machine with the least of it. The machine used for Enterprise Miner was the exact same except that it had 21 GB of RAM instead of 15. The laptop I'm on now has more than that (16 GB).
What is step 4 in the time to analyze for R, readable form at (.csv). If they used the RMySQL package are they saying they only used that for one part but that they manually loaded a generated csv file into SAS in order to put it into MYSQL, that seems obvious that it would take up some time (SAS -> R -> CSV -> SAS -> MySQL -> R). There also seems to be nothing involved with SAS, you just have source data and in 10 minutes you have a model. I am not sure that is a good thing, how many assumptions does it have to implement for you, are they good or bad. They show time spent partitioning the data in Mahout and R but not in SAS, yet they talk about that step happening. That step even seems to have some questionable practice behind it.
Are they saying that they actually trained the SAS model on an oversampled set because it forces this, no arguments to override this action. R and SAS HPAS do not support ensembles, well they do have Random Forests which is an ensemble. The NA and accuracy issue almost seems like if there wasn't a gui widget or drop down to do something it wasn't done. It would be nice if it was reproducible but that is not easy when you have point and click tools.

can you also comment on the following paper that claims R's mixed model procedure is inferior to SAS in terms of inflated type I error? thanks.
http://onlinelibrary.wiley.com/doi/10.1002/sim.4265/abstract


non-linear mixed models can be pretty sensitive to the number of quadrature points used. They used 20 points, which I assume would be enough for the models being estimated, but perhaps not if they are getting significantly different results than the SAS algorithm.

Their idea of a benchmark is running programs on four completely different computers?

R correctly predicted 0.8% vs 29% for SAS? I'd think more Kaggle users would be using SAS if this were true.

Great post - I'm looking forward to your next post on the findings.

I agree with Barry Rowlingson - gigabytes is a bad measure of dataset size. In order to reproduce the results, the authors of the SAS paper should make the actual test datasets available.

I also agree with Wallace Campbell - benchmarks should be run on the same computer, or at least computers with the same hardware and configuration.

The largest data sample set used for this paper has 2 million observations. I'm shocked that this is their definition of "big data." We routinely deal with data sets that have more than 50 million observations.

This is one of the many problems with the phrase "big data" - no one defines how big "big" is.

I want to see benchmarks of truly big datasets - on the order of 4 billion or more "facts." Anyone know of anything like this?

--comment deleted--

The comments to this entry are closed.

Search Revolutions Blog




Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr