by Joseph Rickert
When talking with data scientists and analysts — who are working with large scale data analytics platforms such as Hadoop — about the best way to do some sophisticated modeling task it is not uncommon for someone to say, "We have all of the data. Why not just use it all?" This sort of comment often initially sounds pragmatic and reasonable to almost everyone. After all, wouldn’t a model based on all of the data be better than a model based on a subsample? Well, maybe not — it depends, of course, on the problem at hand as well as time and computational constraints. To illustrate the kinds of challenges that large data sets present, let’s just look at something very simple using the airlines data set from the 2009 ASA challenge.
Here are some of the results for a regression of ArrDelay on CRSDepTime with a random sample of 12,283 records drawn from that data set:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.85885 0.80224 -1.071 0.284
# CRSDepTime 0.56199 0.05564 10.100 2.22e-16
# Adjusted R-squared: 0.008157
And here are some results from the same model using 120,947,440 records:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -2.4021635 0.0083532 -287.6 2.22e-16 ***
# CRSDepTime 0.6990404 0.0005826 1199.9 2.22e-16 ***
# Adjusted R-squared: 0.01176
More data data didn’t yield an obvously better model! I don’t think anyone would really find this to be much of a surprise. We are dealing with a not very good model to begin with. Nevertheless, the example does provide the opportunity to investigate how estimates of the coefficients change with sample size. This next graph shows the coeffients of the slope plotted against sample size with sample sizes ranging from 12,283 to 12,094,709 records. Each regression was done on a random sample that includes about 12,000 points more than the previous one. The graph also shows the standard estimate for the confidence interval for the coefficient at each point in red. Notice that after some initial instability, the coefficient estimates settle down to something close to the value of beta obtained using all of the data.
The rapid approach to the full-data-set value of the coefficient is even more apparent in the following graph that shows the difference between the estimated value of the beta coefficient at each sample and the value obtained using all of the data. The maximum difference from the fourth sample on is 0.07. This is pretty close indeed. In cases like this, if you believed that your samples were representative of the entire data set, working with all of the data to evaluate possible models would be a waste of time an possibly counterproductive.
I am certainly not arguing that one never wants to use all of the data. For one thing, when scoring a model or making predictions the goal is to do something with all of the records. Moreover, in more realistic modeling situations where there are thousands of predictor variables 120M observations might not be enough data to conclude anything. A large model can digest degrees of freedom very quickly and severely limit the ability to make any kind of statistical inference. I do want to argue, however, that with large data sets the ability to work with random samples of the data confers the freedom to examine several models quickly with considerable confidence that results would be decent estimates of what would be obtained in using the full data set.
I did the random sampling and regressions in my little example using functions from Revolution Analytics RevoScaleR package. Initially, all of the data was read from the csv files that comprise the FAA data set into the binary .xdf file format that is used by the RevoScaleR package. Then the random samples were selected by using the rxDataStep function of RevoScaleR. This function was designed to quickly manipulate large data sets. The code below reads a record, draws a random number with a value between 1 and 9999 and assigns it to the variable urns.
Random samples for each regression were drawn by looping throught the appropriate values of the variables. Notice how the call to R’s runif() function happens within the transforms parameter of rxDataStep. It took about 33 seconds to do the full regression on my laptop which made it feasible to undertake the extravagent number of calculations necessary to do the 1,000 regressions in a few hours after dinner.
I think there are three main take aways from this exercise:
- Lots of data does not necessarily equate to “Big Data”
- For exploratory modeling you want to work in an environment that allows for the rapid prototyping and provides the statistical tools for model evaluation and visualizations. There is no better environment that R for this kind of work, and the Revolution’s distribution of R offers the ability to work with very large samples.
- The ability draw random samples from large data sets is the way to balance accuracy against computational constraints.
To my way of thinking, the single most important capability to implement in any large scale data platform that is going to support sophisticated analytics is the ability to quickly construct, high quality random samples.