*By Joseph Rickert*

“Listen Corso, there are no innocent readers anymore. Each
overlays the text with his own perverse view.
A reader is the total of all he’s read, in addition to the films and
television he’s seen. To the information supplied by the author he’ll always add
his own.” (Arturo Perez Reverte: The
Club Dumas)

And so it is with data scientists. There are no innocent
data scientists. All data analysis, even
an exercise in simple descriptive statistics, requires bringing a vast amount
of information to the analysis. First there is the meta-data: information about
the structure of the data, how and when it was collected and to what end. Then,
there is the information that the data scientist brings. This includes information
about the intent of the analysis and the suitability of the data to support
that intent as well as the explicit assumptions that the data scientist is
compelled to make even if he or she does not feel compelled to explicitly
articulate. What population does the data set represent? Is it a sample or the entire population? Is it
a random sample? Are the observations independent? What assumptions need to be
tested?

Next, there is information supporting the choice of analytic techniques
that may seem appropriate; information about the assumptions underlying those
techniques; information about similar analyses conducted in the past and so
on. Even in the simplest case, a full
accounting of all of the information brought by a data scientist to an analysis
would show the data embedded in a vast matrix of supporting information, much
of it subjective, contingent and inextricably tangled with the experience of
the data scientist: the residue of every data set seen, every model built,
every insight, every blind alley.

“Let the data speak for themselves” is an utterance one
still hears with some frequency these days. I take it as an expression of naive
optimism. If we would just get out of the way and let the data speak we would
get much closer to the truth. Classically trained statisticians know this to be
a dangerous dream. For many, if not most of this “small data – big inference”
group, there are no data until the analysis has been conceptualized. First, one
must design the experiment, select a method or model based on an *a priori*
understanding of the goals of the analysis and the experimental conditions,
formulate the null hypothesis, settle on the tests for significance and then,
and only then, collect the data. Under this set-up the data are constrained to
commenting on a very narrowly scripted story. Any elaboration could only be exercise
in self-deception and bad science.

Data scientists have also been aware of these issues for
some time. More than a decade ago in his paper Statistical Modeling: The Two
Cultures, Leo Breiman, probably the first
superstar data miner, described what he called the Rashomon Effect: for large,
complex data sets there are often many models, each using different variables
that meet the same performance criteria. Leo writes: “there is often a
multitude of different descriptions ... in a class of functions giving about the
same minimum error rate”, and goes on to describe a hypothetical example in
which three models each built with different variables from the same data set
yield models with residual sum or squares or test set error within 1% of each
other. Leo comments: “Which one is better? The problem is that each one tells a
different story about which variables are important”.

The data never really speak for themselves. At best, they tell a plausible story for
which they have been well-rehearsed.

For a very entertaining and topical presentation of some of
the issues raised here have a look at Kate Crawford’s keynote address at Strata
last week.