I have always thought it odd that statisticians who live and die by the formal machinery of Neyman-Pearson hypothesis testing will also examine residual plots and qqplots to assess the validity of their models. To be fair, looking a plots is far from being all that is done, but still – looking? Even after having gained some experience myself in these matters there still lingers a cognitive dissonance. Think of all the anguish associated with setting up the experiment, choosing the level of the test and steeling oneself to accept the grim tyranny of the p-value. Give this, how could anyone without the prescience of a guild navigator be guided by just looking?

Well, I got over it. Now, I hate to build any model without looking at something. But, this need to look has made working with big data files emotionally challenging: it’s just not practical to plot galaxies of millions and billions of points. To feel better about things I have taken to sampling.

The code below, adapted from Maindonald and Braun’s book “Data Analysis and Graphics Using R” uses Revolution Analytics’ RevoScaleR package to run a simple linear regression on the entire airlines challenge data set (120M + observations) and then samples from an Xdf file containing the residuals to produce lots and lots of qqplots. Each graph contains a qqplot of the residuals in the lower left corner surrounded by 7 reference plots drawn from a normal distribution having a mean and variance of equal to the sample statistics. Figure 1, is one such cluster of plots. By itself, it may not mean much, but if thousands of the samples looked like this, I would be inclined to say that the residuals are not close to being normal.

If you like the idea, adapt this code to navigate through your favorite galactic size residual space, prepare the orange spice and, guided by your intuition, gaze intently for an afternoon. (It takes my 8-core Dell loptop about 13 seconds to paint the screen with each plot)

Figure 1: qqplots for sample of residuals from regressio,of arrival delay on arrival time lm(ArrDelay ~ ArrTime)

## Comments

You can follow this conversation by subscribing to the comment feed for this post.