by Joseph Rickert
When
talking with data scientists and analysts — who are working with large scale data
analytics platforms such as Hadoop — about the best way to do some sophisticated
modeling task it is not uncommon for someone to say, "We have all of the data. Why not just use it all?" This sort of comment often initially sounds pragmatic and reasonable
to almost everyone. After all, wouldn’t a model based on all of the data be
better than a model based on a subsample? Well, maybe not — it depends, of
course, on the problem at hand as well as time and computational constraints. To
illustrate the kinds of challenges that large data sets present, let’s just look
at something very simple using the airlines data set from the 2009 ASA
challenge.
Here are some of the results for a regression of ArrDelay on
CRSDepTime with a random sample of 12,283 records drawn from that data set:
# Coefficients:
# Estimate
Std. Error t value Pr(>|t|)
# (Intercept)
-0.85885 0.80224 -1.071
0.284
# CRSDepTime 0.56199
0.05564 10.100 2.22e-16
# Multiple
R-squared: 0.008238
# Adjusted R-squared: 0.008157
And here are some results from the same model using 120,947,440
records:
#Coefficients:
# Estimate
Std. Error t value Pr(>|t|)
# (Intercept)
-2.4021635 0.0083532 -287.6 2.22e-16 ***
# CRSDepTime 0.6990404
0.0005826 1199.9 2.22e-16 ***
# Multiple
R-squared: 0.01176
# Adjusted R-squared: 0.01176
More data data didn’t yield an obvously better model! I don’t think anyone
would really find this to be much of a surprise. We are dealing with a not very
good model to begin with. Nevertheless, the example does provide the
opportunity to investigate how estimates of the coefficients change with sample
size. This next graph shows the coeffients of the slope plotted against sample
size with sample sizes ranging from 12,283 to 12,094,709 records. Each
regression was done on a random sample that includes about 12,000 points more
than the previous one. The graph also shows the standard estimate for the
confidence interval for the coefficient at each point in red. Notice that after
some initial instability, the coefficient estimates settle down to something
close to the value of beta obtained using all of the data.
The rapid approach to the full-data-set value of the coefficient is even
more apparent in the following graph that shows the difference between the
estimated value of the beta coefficient at each sample and the value obtained using
all of the data. The maximum difference from the fourth sample on is 0.07. This is pretty close indeed. In cases like this, if you
believed that your samples were representative of the entire data set, working
with all of the data to evaluate possible models would be a waste of time an possibly counterproductive.
I am certainly not arguing that one never wants to use all of the data.
For one thing, when scoring a model or making predictions the goal is to do
something with all of the records. Moreover, in more realistic modeling
situations where there are thousands of predictor variables 120M observations
might not be enough data to conclude anything. A large model can digest degrees
of freedom very quickly and severely limit the ability to make any kind of statistical
inference. I do want to argue, however, that with large data sets the ability
to work with random samples of the data confers the freedom to examine several
models quickly with considerable confidence that results would be decent
estimates of what would be obtained in using the full data set.
I did the random sampling and regressions in my little example using
functions from Revolution Analytics RevoScaleR package. Initially, all of the
data was read from the csv files that comprise the FAA data set into the binary
.xdf file format that is used by the RevoScaleR package. Then the random samples were selected
by using the rxDataStep function of RevoScaleR. This function was designed to
quickly manipulate large data sets. The
code below reads a record, draws a random number with a value between 1 and 9999
and assigns it to the variable urns.
rxDataStep(inData = working.file,
outFile = working.file,
transforms=list(urns = as.integer(runif(.rxNumRows,1,10000))),
overwrite=TRUE)
Random samples for each regression were drawn by looping throught the
appropriate values of the variables. Notice how the call to R’s runif()
function happens within the transforms parameter of rxDataStep. It took about
33 seconds to do the full regression on my laptop which made it feasible to
undertake the extravagent number of calculations necessary to do the 1,000
regressions in a few hours after dinner.
I think there are three main take aways from this exercise:
- Lots of data does not necessarily equate to “Big Data”
- For exploratory modeling you want to work in an environment that allows
for the rapid prototyping and provides the statistical tools for model
evaluation and visualizations. There is no better environment that R for this
kind of work, and the Revolution’s distribution of R offers the ability to work
with very large samples.
- The ability draw random samples from large data sets is the way to balance
accuracy against computational constraints.
To my way of thinking, the single most important capability to implement
in any large scale data platform that is going to support sophisticated
analytics is the ability to quickly construct, high quality random samples.