As a discipline, Data Science is growing up fast. That's my key takeaway from the 2012 Data Science Summit.
At the inaugural 2011 Data Science Summit (you can see some highlights in this recap video), the focus was on the Big Data part of Data Science: issues with streaming data, how to store big data, technology platforms, that kind of thing. This year's summit was much more focused on the "Science" part of Data Science: applications of Big Data, and statistical issues related to the analysis of Big Data. A few examples:
- Nate Silver (political forecaster for the NYT) talked not just about building models and making predictions, but also the importance of, in his words, "embracing uncertaintly". A prediction often isn't useful without an assessment of its uncertaintly (or risk). He gave this real-life example: a flood-level prediction of 49 feet doesn't mean a city can rest easy because the levees are 51 feet high. The weather service failed to mention that there was a plus-or-minus 9 feet margin of error to that prediction, or about a 50-50 chance the city would be flooded. (It was.)
- Michael Chui (author of the McKinsey Big Data report) said that schools should be teaching more Statistics, and less Calculus, so that graduates have a better grasp of issues like sampling and selection bias.
- Michael Brown (CTO of ComScore) talked about the need to understand the impact of recall bias and outliers.
- Jeremy Howard (Chief Data Scientist of Kaggle) warned of the dangers of observation bias inherent in "data exhaust", and extolled the benefits of statistical experiments to distringuish between causality and correlation.
- Tony Jebara (co-founder of Sense Networks) expressed the need for the focus of predictive analytics to graduate from mere accuracy to making models interpretable, and predictions actionable.
- Hadley Wickham (R package author and educator) described the variety of application areas for Data Science, from cheesemakers to airport designers, and from sports teams to cruise lines.
These are all important statistical issues, which until recently have had a back-seat to the technological and operational issues of data science. It's great to see the practice maturing, and this new focus will lead to data applications which are not just more powerful, but more reliable and more impactful as well. Data Science has come of Statistical age.