Responding to the birth rates analysis in the post earlier this week on big-data analysis with Revolution R Enterprise, Luis Apiolaza asks at the Quantum Forests blog, do we really need to deal with big data in R?
My basic question is why would I want to deal with all those 100 million records directly in R? Wouldn’t it make much more sense to reduce the data to a meaningful size using the original database, up there in the cloud, and download the reduced version to continue an in-depth analysis?
As Luis points out (and as most of us know from experience), 90% of statistical data analysis is data preparation. Many "big data" problems are in fact analyses of small data sets, that have been carefully (and often painfully) extracted from a data store we'd refer to today as "Big Data". And while we could use another tool to do that extraction, personally I'd prefer to do it in R myself. Not just because needing access to another tool probably means delays, authorizations, and probably having to ask a DBA nicely, but also because the extraction process itself (in my opinion) requires a certain level of statistical expertise. For me, at least, it's often an iterative process of identifying the variables I need, the right way to do the aggregation/smoothing/dimension reduction, how to handle missing values and data quality issues ... the list goes on and on. To be able to extract from a large data set using the R language alone is a great boon -- especially when the source data set is very large. That's why we created the rxDataStep function in RevoScaleR. (You can read more about rxDataStep in our new white paper, The RevoScaleR Data Step White Paper.)
Then again, some statistical problems simply do require analysis of very large datasets. wholesale. Some of the commenters to Luis's post provide their own examples, and Revolution Analytics' CEO Norman Nie has written a white paper identifying five situations where analysis of large data sets in R is useful:
- Use Data Mining to Make Predictions
- Make Predictive Models More Powerful
- Find and Understand Rare Events
- Extract and Analyze ‘Low Incidence Populations’
- Avoid Dependence on ‘Statistical Significance’
You can read Norman's explanations of these uses of Big Data in the white paper, The Rise of Big Data Spurs a Revolution in Big Analytics, available for download at the link below.
Revolution Analytics White Papers: The Rise of Big Data Spurs a Revolution in Big Analytics
Isn't it fun to live in this age when we can exchange ideas so easily? Take care,
Luis
Posted by: Luis | November 22, 2011 at 17:47
Great reply David - thank you.
Posted by: Tal Galili | November 23, 2011 at 02:38
Sue's demo is great, and I like it very much. Just one thing to mention! Her example is related to Windows HPC server. And not every one of us has access to this kind of hardware. Is that possible for her to give examples on Amazon Web Services? For example, she could set up five S3 buckets. And she can make one of those nodes into head node which distributes the tasks to four sub nodes, gathers results, and sends back to the requesting laptop. Big data analysis and parallel computing on Amazon Cloud sounds more accessible for everyone.
Posted by: Vivian Zhang | November 23, 2011 at 20:42
From Susan earlier slide in useR2011, there is a lot of dirty work to define each variable properly and make them into the dataset "birthAll", such as SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"),description = "Sex of Infant“). I spent several hours tonight to match columns and their names. Is that possible for Susan to share this "birthALL"? Certainly, it will be fun to have this dataset and actually play with it in various distributed environments.
Posted by: Vivian Zhang | November 24, 2011 at 01:59