by Joseph Rickert
Every month I look forward to getting my copy of AMSTATNEWS, the monthly magazine of the American Statistical Association, in the mail. This July, I was both pleased and bemused by ASA President Marie Davidian’s article Aren’t We Data Science?. I was pleased to see a follow up to last month’s article The ASA and Big Data, but mystified by the overall tone of the article. It portrays mainstream academic statisticians as being left behind by the rise of Big Data, some trapped in moribund departments that are unwilling to change, overlooked by university administrators who see them as “small data” scholars without the tools for “Big Data” and “Big Questions” and surprised to find out that they, indeed, are not data science. Midway through the article president Davidian asks: “What skills does a statistician need to engage in data science activities, and how should we be preparing statistics students?”. A good bit of what follows is a call to arms, exhorting statisticians to immerse themselves in “real-world” problems, participate in data science meetup groups and reach out to local business and research organizations to accumulate case studies for their students. The article closes with the dramatic and sobering suggestion that statisticians ask themselves: “How would you feel if there were no departments of statistics 50 years from now?”.
What really surprised me in President Davidian’s article was the mere passing reference to R. R was allocated less text than Python. I have nothing at all against Python, but R is a fundamental tool of modern computational statistics that provides the very bridge to data science that President Davidian is seeking. In recent years, survey after survey has singled out R as being one of the top tools for data scientists and R is the go-to tool for Kaggle data mining competitions.
Not only do statisticians already have a way into data science, but R developers are relentless in their drive to extend the reach of R into Big Data. This July’s useR! conference, for example, included presentations on: ffbase, a package for dealing with data sets too big for memory; bigvis a package for visualizing large data sets based on the sound statistical principles of aggregation and smoothing; a tutorial on the Rcpp package which makes it relatively easy for a statistician with intermediate level R skills to harness the speed C++ for computationally challenging data science problems; and a case study of a Big Data analytics firm using the GLM implementation in the RevoScaleR package to apply survival analysis techniques to massive internet marketing datasets. (Note that Revolution Analytics makes the production level RevoScaleR package available to academics for free. Students are just a download away from getting experience with the very tools used by industry to do statistics on Big Data.)
The fact that President Davidian overlooked the value of R to Big Data and data science is unfortunate but probably reflects the relatively low status accorded to software development in statistics departments. Academic statisticians are among the most tireless and prolific contributors to the open source R project. However, it is my impression that they receive little or no formal recognition for their contributions to R from their respective departments of statistics. Although my responsibilities with Revolution Analytics require that I have considerable contact with academic statisticians, I am not myself an academic and perhaps not in a position to see what really makes for a successful statistics department. Nevertheless, from the perspective of a person working on “real world” problems it is difficult to see why a paper cited a couple of hundred times over the course of several years should be reckoned to have more impact than an R package that sees daily use by thousands of statisticians and data scientists. Certainly it would be helpful for the ASA to sponsor a “conference on statistics and data science featuring top data scientists and statisticians as speakers” as President Davidian suggests. However, if departments of statistics want to improve their chances of being around in 50 years they could make a bigger investment in their future by recognizing and encouraging the contributions of R developers and other tool makers.
If I may be so bold: Data Science is about calculating the parameters of census data. Census in the technical meaning: all the individuals are measured. Baby stat books, in the first chapters, introduce the baby stat student to Descriptive Statistics. These really aren't statistics, but just the calculated parameters.
Big Data & Data Science are not engaged in statistics, by and large. They're just census takers, trying to make some sense of the tsunami of bytes they've unleashed on themselves. Big Data & Data Science are largely engaged in finding unknown needles in multitudes of haystacks. They aren't doing statistics. And statistics departments could cover what little "theory" might be involved in a week's worth of classes.
The Big Data & Data Science zealots have gotten all dewy eyed over the NSA vacuum cleaner, lining up to get some access to those Cray machines, using Map/Reduce to find some needle. Credit card companies and insurance companies and the like employ them to do similarly: find the 1 in a 1,000,000 client that might, just might, end up costing them a bundle. A complete waste of money and time and effort. Just because we can look for needles in haystacks doesn't mean that such digging is worthwhile. For the cost expended, those had better be some mighty large platinum needles.
Bah, humbug.
Posted by: Robert Young | August 01, 2013 at 18:22
Thanks for the post and highlighting the latest things in R presented at the user 2013 conference. I think one thing that might help also is to have presentations from useR conferences available for those not able to attend. For example, with Python conferences (pycon and scipy), videos of several tutorials and talks are available just after the conference and serve as very useful resources (www.pyvideo.org). Even if it is hard to have videos, just the presentations will be helpful (like the ones done by R in Insurance conference organizers where the presentations are posted in github).
Posted by: Shankar | August 01, 2013 at 21:37
To some degree, Data Science is simply the v2.0 name for Data Mining. Data Mining has a (reasonably-well-deserved) reputation for being ad hoc and over-promising, so smart practitioners had to distance themselves from that
A lot of people are making lots of money right now based on the hype of Big Data, because they know the "how" (Hadoop, et al), not necessarily the "what". As we've seen with other technologies, this kind of thing will eventually become a commodity. Hadoop makes a particular class of clustered tasks easy, and Pig, Mahout, etc, build on Hadoop. Eventually, the R 'foreach' package will build on one of these technologies, and Revolution will do that and more, and the pendulum will again swing back towards the "what" instead of the "how".
At which point, we'll go to name v3.0 and all the new field Massive Analytics or something like that.
Posted by: Wayne | August 02, 2013 at 06:30
The importance of R to modern statistics cannot be over-stated. Those R core members deserve to have their a place in the Statistics Hall of Fame.
Posted by: Jason Liao | August 02, 2013 at 07:40
Not to mention that every so called replacement for R, I mean Julia or Python with Pandas borrowed a lot from R (DataFrames, Factors, etc.).
Nowadays, I cannot imagine statistics or data mining without R.
I want also to thank Revolution for their job in promoting R everywhere and specially in the industry (hope that we'll have a linux version of Revolution pretty soon).
Posted by: dickoa | August 02, 2013 at 08:15
One item that would help further with promoting R is to have useR presentations available online for later viewing by folks who didn't attend the conference. For example, python conferences tend to have videos of tutorials and several talks right after the conference (PyCon, SciPy etc. in www.pyvideo.org). It will be nice to have at least the presentations (if not videos) of useR conference talks. R in Insurance conference is a good example where a github repo has slides from the conference.
Posted by: shankar | August 02, 2013 at 08:47
As a scientist with a non statistical background I was very thankful for all the R help I received from our department of Statistics.
They themselves did quite a bit of work in R and were happy to share code with us, and more importantly explain what happened. For us R helped us to handle our 'big data' from our -omics studies, so being versed in R was fairly essential in getting basic stuff done.
Their contribution on the statistical side was something we couldn't have done, but since we used the same language we could understand what they were doing and reproduce it as script kiddies ourselves.
So R avoids, for us, a black box called statistics and it allows us to be less dependent on the statistics department while having access to their toolbox and expertise.
Posted by: Nescio | August 02, 2013 at 09:04
I am MSc PhD statistician. For 20 years I have worked on real world problems, very away from the academic world. I'm a data scientist because I work with real problems and because I am statistician - if I had no academic background in statistics I could not be a data scientist. I am very grateful to the developers and contributors of R - for me it was the greatest contributions to the development of data analysis - I really cannot believe that a department of statistics would not recognize or encourage a researcher to contribute to R.
Posted by: Olga Yoshida | August 02, 2013 at 21:35
It depends what you mean by statistician. While I call myself data scientist, I am still a statistician. But I am very different from an ASA or university-trained statistician. So much different indeed, it's almost like comparing a physicist and a geographer. As a result, and to avoid confusion, I have stopped publicly calling myself a statistician, though I still do privately.
Posted by: Vincent Granville | August 03, 2013 at 17:18
What concerned me about the article was the constant mention of SAS and only a passing reference to R. I've seen the back room lobbyist like behavior or MATLAB and SAS to stay engrained in academia, but I wonder how deep SAS has reached into the ASA.
Posted by: Kevin Davenport (@KevinLDavenport) | August 06, 2013 at 10:26
-- I wonder how deep SAS has reached into the ASA.
"The American Statistical Association (ASA), an 18,000-member scientific and educational society based in Alexandria, VA, has elected a longtime employee of business analytics leader SAS as its future president. Robert N. Rodriguez, Senior Director of R&D, will serve as the ASA’s 107th president when his term begins on Jan. 1, 2012. Three other SAS employees were also recently elected to positions within the association."
here: http://www.businesswire.com/news/home/20100607005155/en/SAS-Director-President-American-Statistical-Association
Deep enough?
Posted by: Robert Young | August 07, 2013 at 18:20