There's a reason why data scientists spend so much time exploring data using graphics. Relying only on data summaries like means, variances, and correlations can be dangerous, because wildly different data sets can give similar results. This is a principle that has been demonstrated in statistics classes for decades with Anscombe's Quartet: four scatterplots which despite being qualitatively different all have the same mean and variance and the same correlation between them.

(You can easily check this in R by loading the data with `data(anscombe)`

.) But what you might not realize is that it's possible to generate bivariate data with a given mean, median, and correlation in *any* shape you like — even a dinosaur:

The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the "DataSaurus" (first created by Alberto Cairo). The authors have published a dataset they call the "DataSaurus Dozen" (also available as an R package on GitHub, with thanks to Steph Locke) of the 12 scatterplots shown. Interestingly, even the transitional frames in the animations above maintain the same summary statistics to two decimal places. Python was used to generate the data sets (and the code should be available at the link below soon.)

Read the paper linked below for more details, and always remember: look at your data!

AutoDesk Research: Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

Impressive, although keep in mind the near-zero correlation in the examples, which makes the problem more tractable. Drawing a picture of a dinosaur with a correlation of 0.81 would be much more difficult I think.

Posted by: Kevin Wright | May 02, 2017 at 09:02

Kevin,

These examples should be mandatory for every freshman course in data analysis.

Thank you for making me aware of how I can lie with statistics ;-)

Posted by: Randy Betancourt | May 02, 2017 at 14:57