A New York Times article yesterday discovers the 80-20 rule: that 80% of a typical data science project is sourcing cleaning and preparing the data, while the remaining 20% is actual data analysis. The article gives short shrift to this important task by calling it "janitorial work", but whether you call it data munging, data wrangling or anything else, it's a critical part of the data science. I'm in agreement with Jeffrey Heer, professor of computer science at the University of Washington and a co-founder of Trifacta, who is quoted in the article saying,
“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”
As an illustration of this point, check out the essay by Julia Evans, Machine learning isn't Kaggle competitions (hat tip: Drew Conway). A Kaggle competion typically presents a nice, clean, regularized data set to the competitors, but this isn't representative of the real-world process of making predictions from data. As Julia points out:
Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time.
While there are projects underway to help automate the data cleaning process and reduce the time it takes, the task of automation is made difficult by the fact that the process is as much art as science, and no two data preparation tasks are the same. That's why flexible, high-level langauages like R are a key part of the process. As Mitchell Sanders notes in a Tech Republic article,
Data science requires a difficult blend of domain knowledge, math and statistics expertise, and code hacking skills. In particular, he suggests that expert knowledge of tools like R and SAS are critical. "If you can't use the tools, you can't analyze the data."
This is a critical step to gaining any kind of insight from data, which is why data scientists still command premium salaries today, according to data from Indeed.com.
It's interesting to see something like this in a popular press article (was it also in the print version?). While it's nice to see the recognition, I would argue that the breakdown is more 95-5, rather than 80-20 (plus or minus 10%..:).
Posted by: Frank | August 19, 2014 at 16:24
There are a number of startups in this space to solve the problem of data cleaning. Tamr (http://www.tamr.com/) Paxata (http://www.paxata.com/) Trifacta (http://www.trifacta.com/) are good examples to ease this pain, especially in an enterprise setting.
Posted by: James | August 20, 2014 at 14:38
It sure is, and unfortunately it's a task where organizations still struggle. According to a recent IDG SAS survey only 10 percent say their are extremely capable in this area.
Peter Fretty
Posted by: Pfretty | August 28, 2014 at 09:06