Just what is Data Science, anyway? Here's one take:
Ever since the term "Data Scientist" was coined by DJ Patil and Jeff Hammerbacker in 2009, there's been a vigorous debate on what the team actually means. More than 80% of statisticians consider themselves data scientists, but Data Science is more than just Statistics. (My own take is that Data Science is a valuable rebranding of computer science and applied statistics skills.)
To help bring clarity to the issue, Data Scientist and R user Harlan Harris has published a great presentation he gave at the Data Science DC meetup group, "What is Data Science anyway?". The presentation recaps the key data science discussions over the last few years, from Hal Varian ("the sexy job in the next 10 years will be Statistics"), Mike Driscoll ("sexy skills of data geeks"), Nathan Yau ("data scientists: people who can do it all"), Mike Loukides ("Data science enables the creation of data products"), Hilary Mason ("Data science is clearly a blend of the hackers"), Drew Conway ("The Data Science Venn Diagram") and many others.
In fact, the entire presentation servers as a literature review for the birth of "Data Science" as a concept, and would make excellent fodder for the "Data Science" page on Wikipedia which, sadly, is still a blank page.
One thing that seems certain: Data Science is here to stay. Companies are clamoring to hire people with data science skills, and the excitement at data science events like the recent Strata conference in New York is palpable. (This review of the conference even says that "Data Scientists are the new rock stars of the technology world.") As with all new concepts, the definition of "data science" may seem a bit cloudy now, but I'd wager that in 5 years or less "Data Scientist" will be as natural a job title as "Product Manager" or "Engineer". Haran notes a most apposite analogy made by John Cook, who looked at the "computer programmer" role:
"Take an expert programmer back in time 100 years. What are his skills? Maybe he’s pretty good at math. He has good general problem solving skills, especially logic. He has dabbled a little in linguistics, physics, psychology, business, and art. He has an interesting assortment of knowledge, but he’s not a master of any recognized trade."
Apply the same quote to "data scientists", and I'd wager that just 5 years or so we'll look at all the data scientists working around us and wonder how companies ever survived without them.
Harlan adds additional thoughts to his presentation at the link below.
Harlan Harris: Data Science, Moore’s Law, and Moneyball
Thanks for the summary. I'm having a hard time adjusting to the term "data scientist" because, to me, it is what statisticians have been doing for years. I would argue that Tukey and Cleveland were "data scientists" way back when. Perhaps they are the "Fathers of Data Science"?
One thing that I hope: that we, as statisticians, do not surrender the moniker to other groups such as graphics designers. I hope that the terms "data scientist" and "statistician" remain linked in the eyes of the media and the public. We can all help achieve that goal by being active participants in the discussions and topics that you've presented here, and by making sure that the media understands that statistics is the science behind understanding data.
Posted by: Rick Wicklin | September 29, 2011 at 06:41
Data Science is what used to be called Operations Research. The current jargon is largely a creation of O'Reilly publishing, just like Web 2.0 was.
As such, it's about a way to sell stuff to marginally educated folk who hope to be well paid. We saw similar with ISO-9000 and Six Sigma. This is not much different.
Posted by: Robert Young | September 29, 2011 at 10:46
If you were designing a curriculum for an undergraduate major in data science, what would it include? Here's what I'd have:
-Intro CS sequence, with a significant algorithms component (probably 2 courses)
-Intro Stats sequence (also 2 courses, at least when I was an undergrad, one for prob/stats, one for econometrics)
-Course on data structures, probably in the CS dept
-Course on model building, like an advanced econometrics or somethign with big time series component
-Course on Machine Learning (ideally designed to minimize overlap with model building)
-Course on Data Visualization (nothing like this was offered when I was an undergrad, but maybe this could be taught from Tufte books?)
At the end the person should somehow have a working knowledge of SQL, Java/C++, R and possibly hadoop, though that could be too advanced for an undergrad curriculum. Maybe not though.
Your thoughts?
Posted by: Mike Nute | October 03, 2011 at 10:43
> If you were designing a curriculum for an undergraduate major in data science, what would it include?
A Kaggle-in-Class competition: http://inclass.kaggle.com
Posted by: Angus Hammond | October 03, 2011 at 12:32
> If you were designing a curriculum for an undergraduate major in data science, what would it include?
Nice list, Mike. Maybe something on text analytics/information retrieval, and I'd definitely add a Kaggle In Class competition: http://inclass.kaggle.com
Posted by: Angus Hammond | October 03, 2011 at 12:33
I am rather surprised that the proponents of the data science movement do not pay homage to Cleveland's paper entitled "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics " in ISI Review. It's #6 (along with PS and PDF links) here: http://stat.bell-labs.com/wsc/publish.html
Notice that the term is not brand new, as this paper was written in 2001.
Posted by: John Ramey (@ramhiser) | October 03, 2011 at 15:55
John, thank you for posting that! I wasn't aware of Cleveland's paper, and I agree with you that I'm surprised few others are either. Cleveland proposes a curriculum for applied statistics ("data science") that incorporates a wide-ranging and computationally sophisticated set of statistical topics. On the other hand, he mentions few ideas from other fields that I think a more current view of Data Science would want to incorporate. In particular, programming skills, the Machine Learning point of view, Operations Research, and Data Visualization make minimal contributions to hist proposed curriculum. On the other hand, his suggestion to explicitly include Pedagogy makes a lot of sense, as "communication" is a key component of data science projects.
Posted by: Harlan | October 05, 2011 at 11:48
It's been a few months, and Revolutions is dipping its toes into PL/R like integrators. Too bad that RDBMS hasn't been mentioned. There is no other actual data model out there. Hierarchy is ad hoc, as are the various "NoSql" implementation of same. If you want to corral exploding data, don't blow it up in the first place.
Posted by: Robert Young | March 02, 2012 at 12:57
As a programmer, I find myself having to do the kind of operations outlined here, now I know what its called - Data Science.
Imagine how valuable this kind of research would be for large companies, to have skilled programmers and mathematicians analyzing thier data set's and identifying key metrics, patterns and reporting elements. Think I might add this as a point of interest on my resume.
Posted by: Marcus from Simple 1300 Numbers | November 11, 2012 at 22:31
The information provided by national science foundation of USA mentions about the term "Data Scientist" way back in their September 2005 article.
See the link: http://www.nsf.gov/pubs/2005/nsb0540/ or
http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdf
So it is wrong to say that DJ Patil and Jeff Hammerbacker coined this term in 2009 as claimed in few internet sites.
Posted by: Sree | March 07, 2013 at 05:07