The terms "Data Science" and "Data Scientist" have only been in common usage for a little over a year, but they've really taken off since then: many companies are now hiring for "data scientists", and entire conferences are run under the name of "data science". But despite the widespread adoption, some have resisted the change from the more traditional terms like "statistician" or "quant" or "data analyst".
Personally, I love the term. As a statistician, I was getting tired of explaining that no, I don't spend my time writing down baseball or cricket scores. I think "Data Science" better describes what we actually do: a combination of computer hacking, data analysis, and problem solving. Pete Warden, initally resistant to the terminology, has since come around to the benefits of the phrase. (Pete, by the way, is the creator of the awesome Data Science Toolkit, a awesome open-source server with APIs for handy data-related tasks, like identifying proper names in unstructured text, or converting street addresses to latitude/longitude.) In his post at O'Reilly Radar, he addresses the following objections to the use of the term "data science":
- Data Science is not a real science. ("Anything that needs science in the name is not a real science")
- It's an unnecessary label (why not just stick with statistician, etc.?)
- The name doesn't even make sense (what science doesn't involve data?)
- There's no definition (personally, I think Drew Conway's Data Science Venn Diagram is an excellent definition, expanded in his paper in IQT Quarterly)
Check out Pete's full post for his refutations of these points. Pete concludes by saying it's time for the community to rally around "Data Science":
I'm betting a lot on the persistence of the term. If I'm wrong the Data Science Toolkit will end up sounding as dated as "surfing the information super-highway." I think data science, as a phrase, is here to stay though, whether we like it or not. That means we as a community can either step up and steer its future, or let others exploit its current name recognition and dilute it beyond usefulness. If we don't rally around a workable definition to replace the current vagueness, we'll have lost a powerful tool for explaining our work.
O'Reilly Radar: Why the term "data science" is flawed but useful
Full disclosure: I'm a statistician as well.
I'm definitely in the anti-"Data Science"-as-a-label camp because I think it minimizes other extremely useful areas of statistics. Statistics is, at least according to Moore and McCabe, "the science of collecting, organizing, and interpreting numerical facts, which we call data." Whether you are collecting data using python or R from some website's API or giving a survey to cancer survivors, you are collecting and possibly organizing the data. If you are making inferences with this data, you are doing statistics in either scenario.
IMO, all of statistics is inherently data science. I realize that the data science movement is trying to identify a part of statistics specific to Web 2.0-type problems. However, I think that identifying "science" with the Web 2.0 crowd, then we are indirectly de-emphasizing that statisticians working outside of Web 2.0 are doing statistical science as well, e.g. biostatistics, renewable energy, etc.
Having said that, I did recently author the R package for the Data Science Toolkit. :) The tools that Pete made available are pretty sweet and I felt that the R community could take advantage. A brief write-up can be found here:
http://thelogcabin.wordpress.com/2011/05/02/r-and-the-data-science-toolkit/
and the code is given at github.com/rtelmore.
Cheers!
Ryan
@rtelmore
Posted by: Ryan | May 11, 2011 at 09:44
Thanks for the comments, Ryan! (I made your link clickable so it's easier for others to follow -- thanks for sharing.)
I'd counter slightly to say that "all of *applied* statistics is inherently data science" - I think one of the benefits of the term "data science" is that it implies the application of statistics to real-world problems (and not just Web 2.0 problems -- I think Data Science is broader than that).
Posted by: David Smith | August 04, 2011 at 13:49
Although the name is nice and useful, I really think it should have been called something like "data literate scientist". I am in the Operations Research department and we have two types of people: those who feel comfortable with data and those who don't. We both do operations research but one of us doesn't speak the data language. and the group who knows that language calls itself "data scientists" but we have almost nothing in common with the data scientist who works at FDA, or the data scientist who works at facebook. We are passionate about a different set of problems.
Anyways, to me data literacy is just one new skill that anybody should master. just like language or communication skills. It is not another field or profession it is just one skill.
Posted by: Siah | August 04, 2011 at 13:57
Thank you. Very useful.
yes, the terms "Data Science" and "Data Scientist" have only been in common usage for a little over a year. And the terms have been used and started from many years ago. some references as follows,
[1] http://datascience.fudan.edu.cn/.
[2] Dataology and Data Science: Up to Now [OL]. [16 June 2011] http://www.paper.edu.cn/index.php/default/en_releasepaper/content/4432156.
[3] Data Explosion, Data Nature and Dataology. In Proceedings of International Conference on Brain Informatics (BI’09).2009.
[4] Dataology and Data Science. (in Chinese with English abstract). Fudan University Press. 2009. ISBN 978-7-309-06956-3 /T.350.
Posted by: Yun Xiong | May 17, 2012 at 14:05