by Thomas Dinsmore
On April 26, SAS published on its website an undated Technical Paper entitled Big Data Analytics: Benchmarking SAS, R and Mahout. In the paper, the authors (Allison J. Ames, Ralph Abbey and Wayne Thompson) describe a recent project to compare model quality, product completeness and ease of use for two SAS products together with open source R and Apache Mahout.
Today and next week, I will post a two-part review of the SAS paper. In today's post I will cover simple factual errors and disclosure issues; next week's post will cover the authors' methodology and findings.
Mistakes and Errors
This section covers simple mistakes by the authors.
(1) In Table 2, the authors claim to have tested Mahout "7.0". I assume they mean Mahout 0.7, the most current release.
(2) In the "Overall Completeness" section, the authors write that "R uses the change in Akaike's Information Criteria (AIC) when it evaluates variable importance in stepwise logistic regression wheras SAS products use a change in R-squared as a default." This statement is wrong. SAS products use the R-squared to evaluate variable importance in stepwise linear models, but not for logistic regression (where the R-squared concept does not apply). Review of SAS documentation confirms that SAS products use the Wald Chi-Square to evaluate variable importance in stepwise logistic.
(3) Table 3 in the paper states that R does not support ensemble models. This is incorrect. See, for example, these packages:
(4) The "Overall Modeler Effort" section includes this statement: "Bewerunge (2011) found that R could not model a data set larger than 1.3 GB because of the object-oriented programming environment within R." The cited paper (linked here) makes no general statements about R, it simply notes the capacity of one small PC and does not demonstrate a link between R's object-oriented approach and its use of memory. The authors fail to state that in Bewerunge's tests R ran faster than SAS in every test where it was able to run, and that Bewerunge (a long-time SAS Alliance Partner) drew no conclusions about the relative merits of SAS and R.
Disclosure Issues
Benchmarking studies should provide sufficient information about how the testing was performed; this makes it possible for readers to make informed decisions about how well the results generalize to everyday experience. For tests of model quality, publishing the actual data used in the benchmark ensures that the results are replicable.
As we know from the debate over the Reinhart-Rogoff findings, even the best-trained and credentialed individuals can commit simple coding errors. We invite the authors to make the data used in the benchmark study available to the SAS and R communities.
In addition, we think that additional disclosures by the authors will help readers evaluate the methodology and interpret findings from the paper. These include:
(1) Additional detail about the testing environment. I'll remark on the obvious differences in the hardware provisioning in next week's post, but for now I will simply note that the HPA environment described in the paper does not appear to match any existing Greenplum production appliances;
(2) Actual R packages used for the benchmark;
(3) Size of the data sets (in Gigabytes);
(4) Actual sample sizes for the training and validation sets for each method, together with more detail about the sampling methods used;
(5) Details of the model parameter settings used for each method and product;
(6) The value of "priors" used for each model run (which alone may explain the observed differences in event precision);
(7) In the results tables, detailed model quality statistics for each test, including sensitivity, specificity, precision and accuracy, the actual confusion matrices and method-specific diagnostics;
(8) Detailed model quality tables for the Marketing and Telecom problems, which are not disclosed in the paper;
We invite readers to review the paper and share your thoughts in the Comments section below.
Continue reading: SAS Big Data Analytics Benchmark (Part Two)
Derek Norton, Joseph Rickert, Bill Jacobs, Mario Inchiosa, Lee Edlefsen and David Smith all contributed to this post.
Gigabytes is not a useful unit of dataset size. Its an amount of memory. Are these 1-byte integers, 8-byte floats, or the size of a CSV file when each number is written out to 23 decimal places?
For the usual grid-type of dataset, stating the rows, columns, and types of each column (numeric, integer, categorical) is a much more informative measure.
Posted by: Barry Rowlingson | April 30, 2013 at 08:51
It doesn't seem that the authors did their due diligence with regard to the R eco-system. I would just add:
1. Logistic regression can be performed out of memory in open-source R using the biglm or speedglm packages.
2. Out of memory random forests are available via bigrf.
3. It seems pretty naive to leave things like the % correctly classified blank for R because they are not reported by the summary function. Sure, R doesn't spit out tons of irrelevant quantities by default, but that is by design not by deficiency. Is table(data$variable,predict(model,newdata=data)) too hard for them?
4. The whole object-oriented leads to big data problems really makes me wonder whether they have a solid understanding of computer science. R has more limited functionality for out-of-memory data, but that is not due to object orientation.
5. Regarding reading in data: "missing values must be designated "NA" in the original data" is a false statement. See the na.strings parameter of read.table. Though who knows how they are reading in their data as it is not stated.
6. What is going on with random forests??? They report 29% correct classification for SAS, and 0.8% for R and 0.1% for Mahoot. No mention of these insanely different numbers is given in the text.
Posted by: Ian Fellows | April 30, 2013 at 12:15
It is also interesting that they mention issues with the use of RAM in R but allocate a machine with the least of it. The machine used for Enterprise Miner was the exact same except that it had 21 GB of RAM instead of 15. The laptop I'm on now has more than that (16 GB).
What is step 4 in the time to analyze for R, readable form at (.csv). If they used the RMySQL package are they saying they only used that for one part but that they manually loaded a generated csv file into SAS in order to put it into MYSQL, that seems obvious that it would take up some time (SAS -> R -> CSV -> SAS -> MySQL -> R). There also seems to be nothing involved with SAS, you just have source data and in 10 minutes you have a model. I am not sure that is a good thing, how many assumptions does it have to implement for you, are they good or bad. They show time spent partitioning the data in Mahout and R but not in SAS, yet they talk about that step happening. That step even seems to have some questionable practice behind it.
Are they saying that they actually trained the SAS model on an oversampled set because it forces this, no arguments to override this action. R and SAS HPAS do not support ensembles, well they do have Random Forests which is an ensemble. The NA and accuracy issue almost seems like if there wasn't a gui widget or drop down to do something it wasn't done. It would be nice if it was reproducible but that is not easy when you have point and click tools.
Posted by: kjd | April 30, 2013 at 16:35
can you also comment on the following paper that claims R's mixed model procedure is inferior to SAS in terms of inflated type I error? thanks.
http://onlinelibrary.wiley.com/doi/10.1002/sim.4265/abstract
Posted by: Sean | April 30, 2013 at 20:58
non-linear mixed models can be pretty sensitive to the number of quadrature points used. They used 20 points, which I assume would be enough for the models being estimated, but perhaps not if they are getting significantly different results than the SAS algorithm.
Posted by: Ian Fellows | May 01, 2013 at 09:21
Their idea of a benchmark is running programs on four completely different computers?
R correctly predicted 0.8% vs 29% for SAS? I'd think more Kaggle users would be using SAS if this were true.
Posted by: Wallace Campbell | May 01, 2013 at 19:34
Great post - I'm looking forward to your next post on the findings.
I agree with Barry Rowlingson - gigabytes is a bad measure of dataset size. In order to reproduce the results, the authors of the SAS paper should make the actual test datasets available.
I also agree with Wallace Campbell - benchmarks should be run on the same computer, or at least computers with the same hardware and configuration.
Posted by: Liz Derr | May 05, 2013 at 17:01
The largest data sample set used for this paper has 2 million observations. I'm shocked that this is their definition of "big data." We routinely deal with data sets that have more than 50 million observations.
This is one of the many problems with the phrase "big data" - no one defines how big "big" is.
I want to see benchmarks of truly big datasets - on the order of 4 billion or more "facts." Anyone know of anything like this?
Posted by: Liz Derr | May 05, 2013 at 17:10
--comment deleted--
Posted by: rich | May 06, 2013 at 07:28