By Joseph Rickert
“Listen Corso, there are no innocent readers anymore. Each overlays the text with his own perverse view. A reader is the total of all he’s read, in addition to the films and television he’s seen. To the information supplied by the author he’ll always add his own.” (Arturo Perez Reverte: The Club Dumas)
And so it is with data scientists. There are no innocent data scientists. All data analysis, even an exercise in simple descriptive statistics, requires bringing a vast amount of information to the analysis. First there is the meta-data: information about the structure of the data, how and when it was collected and to what end. Then, there is the information that the data scientist brings. This includes information about the intent of the analysis and the suitability of the data to support that intent as well as the explicit assumptions that the data scientist is compelled to make even if he or she does not feel compelled to explicitly articulate. What population does the data set represent? Is it a sample or the entire population? Is it a random sample? Are the observations independent? What assumptions need to be tested?
Next, there is information supporting the choice of analytic techniques that may seem appropriate; information about the assumptions underlying those techniques; information about similar analyses conducted in the past and so on. Even in the simplest case, a full accounting of all of the information brought by a data scientist to an analysis would show the data embedded in a vast matrix of supporting information, much of it subjective, contingent and inextricably tangled with the experience of the data scientist: the residue of every data set seen, every model built, every insight, every blind alley.
“Let the data speak for themselves” is an utterance one still hears with some frequency these days. I take it as an expression of naive optimism. If we would just get out of the way and let the data speak we would get much closer to the truth. Classically trained statisticians know this to be a dangerous dream. For many, if not most of this “small data – big inference” group, there are no data until the analysis has been conceptualized. First, one must design the experiment, select a method or model based on an a priori understanding of the goals of the analysis and the experimental conditions, formulate the null hypothesis, settle on the tests for significance and then, and only then, collect the data. Under this set-up the data are constrained to commenting on a very narrowly scripted story. Any elaboration could only be exercise in self-deception and bad science.
Data scientists have also been aware of these issues for some time. More than a decade ago in his paper Statistical Modeling: The Two Cultures, Leo Breiman, probably the first superstar data miner, described what he called the Rashomon Effect: for large, complex data sets there are often many models, each using different variables that meet the same performance criteria. Leo writes: “there is often a multitude of different descriptions ... in a class of functions giving about the same minimum error rate”, and goes on to describe a hypothetical example in which three models each built with different variables from the same data set yield models with residual sum or squares or test set error within 1% of each other. Leo comments: “Which one is better? The problem is that each one tells a different story about which variables are important”.
The data never really speak for themselves. At best, they tell a plausible story for which they have been well-rehearsed.
For a very entertaining and topical presentation of some of the issues raised here have a look at Kate Crawford’s keynote address at Strata last week.