« Five new local R user groups | Main | The R Graph Gallery goes social »

September 28, 2011

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b015391e8f133970b

Listed below are links to weblogs that reference Data Science: a literature review:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Thanks for the summary. I'm having a hard time adjusting to the term "data scientist" because, to me, it is what statisticians have been doing for years. I would argue that Tukey and Cleveland were "data scientists" way back when. Perhaps they are the "Fathers of Data Science"?

One thing that I hope: that we, as statisticians, do not surrender the moniker to other groups such as graphics designers. I hope that the terms "data scientist" and "statistician" remain linked in the eyes of the media and the public. We can all help achieve that goal by being active participants in the discussions and topics that you've presented here, and by making sure that the media understands that statistics is the science behind understanding data.

Data Science is what used to be called Operations Research. The current jargon is largely a creation of O'Reilly publishing, just like Web 2.0 was.

As such, it's about a way to sell stuff to marginally educated folk who hope to be well paid. We saw similar with ISO-9000 and Six Sigma. This is not much different.

If you were designing a curriculum for an undergraduate major in data science, what would it include? Here's what I'd have:

-Intro CS sequence, with a significant algorithms component (probably 2 courses)
-Intro Stats sequence (also 2 courses, at least when I was an undergrad, one for prob/stats, one for econometrics)
-Course on data structures, probably in the CS dept
-Course on model building, like an advanced econometrics or somethign with big time series component
-Course on Machine Learning (ideally designed to minimize overlap with model building)
-Course on Data Visualization (nothing like this was offered when I was an undergrad, but maybe this could be taught from Tufte books?)

At the end the person should somehow have a working knowledge of SQL, Java/C++, R and possibly hadoop, though that could be too advanced for an undergrad curriculum. Maybe not though.

Your thoughts?

> If you were designing a curriculum for an undergraduate major in data science, what would it include?

A Kaggle-in-Class competition: http://inclass.kaggle.com

> If you were designing a curriculum for an undergraduate major in data science, what would it include?

Nice list, Mike. Maybe something on text analytics/information retrieval, and I'd definitely add a Kaggle In Class competition: http://inclass.kaggle.com

I am rather surprised that the proponents of the data science movement do not pay homage to Cleveland's paper entitled "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics " in ISI Review. It's #6 (along with PS and PDF links) here: http://stat.bell-labs.com/wsc/publish.html

Notice that the term is not brand new, as this paper was written in 2001.

John, thank you for posting that! I wasn't aware of Cleveland's paper, and I agree with you that I'm surprised few others are either. Cleveland proposes a curriculum for applied statistics ("data science") that incorporates a wide-ranging and computationally sophisticated set of statistical topics. On the other hand, he mentions few ideas from other fields that I think a more current view of Data Science would want to incorporate. In particular, programming skills, the Machine Learning point of view, Operations Research, and Data Visualization make minimal contributions to hist proposed curriculum. On the other hand, his suggestion to explicitly include Pedagogy makes a lot of sense, as "communication" is a key component of data science projects.

It's been a few months, and Revolutions is dipping its toes into PL/R like integrators. Too bad that RDBMS hasn't been mentioned. There is no other actual data model out there. Hierarchy is ad hoc, as are the various "NoSql" implementation of same. If you want to corral exploding data, don't blow it up in the first place.

As a programmer, I find myself having to do the kind of operations outlined here, now I know what its called - Data Science.
Imagine how valuable this kind of research would be for large companies, to have skilled programmers and mathematicians analyzing thier data set's and identifying key metrics, patterns and reporting elements. Think I might add this as a point of interest on my resume.

The information provided by national science foundation of USA mentions about the term "Data Scientist" way back in their September 2005 article.
See the link: http://www.nsf.gov/pubs/2005/nsb0540/ or
http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdf

So it is wrong to say that DJ Patil and Jeff Hammerbacker coined this term in 2009 as claimed in few internet sites.

The comments to this entry are closed.


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog