By Fang Zhou, Microsoft Data Scientist; Hong Ooi, Microsoft Senior Data Scientist; and Graham Williams, Microsoft Director of Data Science
Education is a relatively late adopter of predictive analytics and machine learning as a management tool. A keen desire for improving educational outcomes for society is now leading universities and governments to perform student predictive analytics to provide better-informed and timely decision making.
Student predictive analytics often aims to solve two key problems:
Education systems face enormous diversity across regions and countries. Two case studies demonstrate the novel and unique landscape for machine learning in the education world.
Microsoft Data Scientists assisted with the analysis in both cases and we present details below with R code provided in a git repository to replicate the modelling on artificial data.
NAPLAN is a standardized testing system used by all schools in Australia to assess students’ basic skills—reading, writing, grammar, spelling and numeracy. A majority of students take the five tests in years 3, 5, 7, and 9. A goal of this use case is to identify talent based on NAPLAN test scores and to set individual targets across school cohorts.
The data were collected from 83,000 students across almost 140 schools in a major city. The data included information about yearly NAPLAN testing, student demographics, school records and school attributes.
We addressed the task as a regression problem, taking random effects for student and school into account. The lmer function from the R package lme4 was used to fit a mixed effects regression model to the NAPLAN test score data.
With this mixed-effects regression model we can measure the influence of the fixed effects in the presence of variation in students and schools, as well as fairly assess the quality of a student or a school while taking other factors into account. It is observed that students/schools with very similar characteristics can perform quite differently in NAPLAN tests. Also, a school/student with poor or with good NAPLAN scores can be characterized by combinations of variables exposed in the data.
The model was deployed into a cloud solution exposing this customized R model. Through the interface the education administrators can now easily gain insight into student performance. For example, trends can be detected for student scores over multiple years, key factors affecting academic achievement become apparent, comparative quality of education across schools can be explored, and talent can be identified and shared.
Many governments have a focus on reducing the number of school dropouts and thereby increase the overall skill levels of the citizens so as to increase the human capital. This is certainly the case in Andhra Pradesh and other states in India.
To achieve this objective complex data covering student performance, socio-economic situation, school infrastructure, and teacher skills is combined with external sources from NGOs and government agencies working in education.
Microsoft's solution has involved building and deploying machine learning models for binary classification that can predict the likelihood of a student dropping out, in addition to other educational outcomes at the school, district and state-levels.
R-based models using the latest advances in boosted decision trees were implemented within Azure Machine Learning Studio to achieve model performances with accuracy of 89%, precision of 94%, recall of 62%, F1 score of 75% and an AUC of 89%. With such high accuracy for predicting student drop-out governments are taking proactive measures to generate effective and targeted strategies for reducing student attrition. Non-academic characteristics often external to the school are also identified as playing significant roles in drop-out rates and support a focus for strategies which ensure a solid educational base for the future prosperity of the community.
The student predictive analytics solution runs in the Azure cloud and leverages Cortana Intelligence Suite components. Prior to using Azure for insights into student performance or drop-out, the education data has been acquired by data integration components in a prescribed format, transformed, merged and cleansed. Azure SQL Database is used to support storage of both academic year data as well as historic data. We then leverage an Azure Machine Learning model (with R customization) trained on historic data to predict a student’s NAPLAN test score or whether a student in current academic year is likely to drop out. An Azure Data Factory pipeline is developed and deployed to automatically drive the gathering of data, transforming data into a format suitable for Azure ML model, and loading the processed data (with prediction results) back to the target Azure SQL Database for reporting by Power BI dashboard.
The student predictive analytics solution we’ve shown here demonstrates how to extend the capabilities of R with the Cortana Intelligence Suite by integrating custom R code with Azure ML studio, to solve problems in the education world. For a quick guide on how to use R in Azure ML studio, see these instructions and this online tutorial. For additional examples of Cortana Intelligence based solutions, see the Cortana Intelligence Gallery.
Based on our experience with these use cases and indeed from future cases we learn of or are involved in we have developed and maintain the Data Science Design Pattern for Education Analytics. This includes implementations of both Student Score Modeling and Student Drop-Out Prediction. This pattern provides a starting point for a data scientist exploring a new dataset in the education world using R. The GitHub repository includes a sample dataset and R scripts to build the models described above. Data Scientists working in the Education domain can replicate this modelling approach using their own internal datasets.
By no means is this the endpoint of the data science journey. The pattern is under regular revision and improvement and is provided as-is. To try out this pattern please download the provided R Markdown and Jupyter Notebook files. We welcome feedback on this pattern, and you are welcome to comment and contribute at the GitHub repository linked below.
Fang Zhou (GitHub): Data Science Design Pattern for Educational Analytics
by Joseph Rickert
Over the last 6 years, thousands of students and faculty have downloaded Revolution R Enterprise (RRE) from Revolution Analytics for free, making it possible for them to do statistical modeling on large data sets with the same R language used by savvy statisticians and data scientists in business and industry. In addition to this individual scholar program (ISP), Revolution Analytics launched two initiatives in 2014 to provide academic institutions and non-profit public service companies with site licenses for the nominal annual licensing fee of $999. Both the Academic Institution Program (AIP) and Public Service program (PSP) enabled qualifying institutions to install RRE on servers and Hadoop clusters without restrictions. Now, seven months after Microsoft’s acquisition of Revolution Analytics, all three of these programs are being folded into Microsoft programs that will make it even easier for individual students and institutions to get started with the newest release of RRE, now known as Microsoft R Server.
On December 31, 2015 all three programs — ISP, AIP and PSP — came to an end. ISP participants may continue to use the software they have under the terms of the original license. Institutions currently participating in Revolution Analytics’ AIP and PSP programs will be contacted by Microsoft representatives to transition them to Microsoft programs.
Microsoft R Server is available for academic use under Microsoft’s DreamSpark programs. Students can download Microsoft R Server 2016 for free via DreamSpark for Students. Universities and other qualifying academic institutions will be able to obtain licenses for Microsoft R Server 2016 as part of Microsoft’s DreamSpark for academic institutions program. Academic institutions will have two choices for participating in the DreamSpark program. DreamSpark Standard is campus-wide and includes a subset of tools including Visual Studio Professional, Windows Server, and SQL Server and is available for an annual licensing fee of $99 (or $199 for 3 years). DreamSpark Premium is only for a single STEM-related department or school and contains premium titles including Windows 10 client, Visual Studio Enterprise, Visio, and Project. The annual licensing fee is $499.
Microsoft R Server 2016 runs on Windows and Linux operating systems, in Teradata databases and on a number of Hadoop platforms. The product names to look for on the DreamSpark web pages are:
Providing even more students with access to Microsoft R Server is pretty big deal. Microsoft R Server extends the reach of R into big data, distributed processing environments by providing a framework for manipulating large data sets on chunk at at time so that all of the data being analyzed does not have to simultaneously fit into memory. Moreover, the RevoScaleR package which ships only with Microsoft R Server provides a number of inherently parallel, distributed algorithms for statistical analysis and machine learning. These include a high performance implementations of Generalized Linear Models, K-means clustering, the Naïve Bayes classifier, decision trees, random forests and much more.
These algorithms automatically distribute computations across all of the available resources. Users need only specify a compute context that points to that data. When SQL Server 2016 becomes available midyear, students will be able to fit predictive models directly in a SQL database.
Microsoft DreamSpark: Download Microsoft R Server
by Joseph Rickert
I had barely begun reading Statistics Done Wrong: the Woefully Complete Guide by Alex Reinhart (no starch press 2015) when I stated to wonder about the origin of the aphorism "Don't shoot the messenger." It occurred to me that this might be a reference to a primitive emotion that wells up unbidden when you hear bad news in such a way that you know things are not going to get better any time soon.
It was on page 4 that I read: "Even properly done statistics can't be trusted." Ouch! Now, to be fair, the point the author is trying to make here is that it is often not possible, based solely on the evidence contained in a scientific paper, to determine if an author sifted through his data until he turned up something interesting. But, coming as it does after mentioning J.P.A Ioannidis' conclusion that most published research findings are probably false, that the average scores of medical school faculty on tests of basic statistical knowledge don’t get much better than 75%, and that both pharmaceutical companies and the scientific journals themselves bias research by failing to publish studies with negative results, Reinhart’s sentence really stings. Moreover, Reinhart is so zealous in his efforts to expose the numerous ways a practicing scientist can go wrong in attempting to "employ statistics" it is reasonable (despite the optimism he expresses in the final chapters) for a reader in the book’s target demographic of practicing scientists with little formal training in statistics to conclude that the subject is just insanely difficult.
Is the practice of statistics just too difficult? Before permitting myself a brief comment on this I’ll start with an easier and more immediate question: Is this book worth reading? To this question, the answer is an unqualified yes.
Anyone starting out on a journey would like to know ahead of time where the road is dangerous, were the hard climbs are, and most of all: where be the dragons? Statistics Done Wrong is as good a map to the traps lurking in statistical analysis adventures that you are ever likely to find. In less than 150 pages it covers the pitfalls of p-values, the perils of being underpowered, the disappointments of false discoveries, the follies mistaking correlation for causation, the evils of torturing data and the need for exploratory analysis to avoid Simpson’s paradox.
About three quarters of the way into the book (Chapter 8), Reinhart moves beyond the basic hypothesis testing to consider some of the problems associated with fitting linear models. There follows a succinct but lucid presentation some essential topics including over fitting, unnecessary dichotomization, variable selection via stepwise regression, the subtle ways in which one can be led into mistaking correlation for causation, the need for clarity in dealing with missing data and the difficulties of recognizing and accounting for bias.
That is a lot of ground to cover, but Reinhart manages it with some style and with an eye for relevant contemporary issues. For example, in his discussion on statistical significance Reinhart says:
And because any medication or intervention usually has some real effect, you can always get a statistically significant result by collecting so much data that you detect extremely tiny but relatively unimportant differences (p9).
And then, he follows up with a very amusing quote from Bruce Thomson's 1992 paper that wryly explains that significance tests on large data sets are often little more than confirmations of the fact that a lot of data was collected. Here we have a “Big Data” problem, deftly dealt with in 1992, but in a journal that no data scientist is ever likely to have read.
The bibliography contained in the notes to each chapter of Statistics Done Wrong is a major strength of the book. Nearly every transgression recorded and every lamentable tale of the sorry state of statistical practice is backed up with a reference to the literature. This impressive exercise at scholarly research adds some weight and depth to the book’s contents and increases it usefulness as a guide.
Also, to my surprise and great delight, Reinhart manages a short discussion that elucidates the differences between R. A. Fisher’s conception of p-values and the treatment given by Neyman and Pearson in their formal theory of hypothesis testing. The confounding of these two very different approaches in what Gigerenzer et al. call the “Null Ritual” is perhaps the root cause of most of the misuse and abuse of significance testing in the scientific literature. However, you can examine dozens of the most popular text books on elementary statistics and find no mention of it.
In the closing chapters of the Statistics Done Wrong Reinhart effects a change of tone and discusses some of the structural difficulties with the practice of statistics in the medical and health sciences that have contributed to the present pandemic of the publication of false, misleading or just plain useless results. Topics include the lack of incentives for researchers to publish inconclusive and negative results, the reluctance of many researchers to share data and the willingness of some to attempt to game the system by deliberately publishing “doctored” results. Reinhart handles these topics nicely and uses them to motivate contemporary work on reproducible research and the need to cultivate a culture of reproducible and open research. Reinhart ends the book with recommendations for the new researcher that allows him to finish the book on a surprisingly upbeat note. The bearer of bad news concludes by offering hope.
I highly recommend Statistics Done Wrong to be read as the author intended: as supplementary material. In the preface, Reinhart writes:
But this is not a textbook, so I will not teach you how to use these techniques in any technical detail. I only hope to make you aware of the most common problems so you are able to pick the statistical technique best suited to your question.
Statistics Done Wrong is the kind of study guide that I think could benefit almost anyone slogging through a statistical analysis for the first time. It seems to me that the author achieve his stated goal with admiral economy and just a few shortcomings. The book, which entirely avoids the use of mathematical symbolism, would have benefited from precise definitions of the key concepts presented (p-values, confidence intervals etc.) and from a little R code to back these definitions. These are, however, relatively minor failings.
Now, back to the big question: is the practice of statistics just too difficult? Yes, I think that the catalogue of errors and numerous opportunities for going wrong documented by Reinhart indicates that the practice of statistics is more difficult than it needs to be. My take on why this is so is expressed (perhaps inadvertently) by Reinhart in the statement of his of his goal for the book quoted above. As long as statistics is conceived and taught as the process of selecting the right technique to answer isolated questions, rather than as an integrated system for thinking with data, we are all going to have a difficult time of it.
by Joseph Rickert
Last week, I was fortunate enough to attend the R Summit & Workshop, an invitation only event, held at the Copenhagen Business School. The abstracts for the public talks presented are online and well worth a look. Collectively they provide a snapshot of the state of development of R and the R Community as well some insight into the directions in which researchers are moving to expand the boundaries of R.
Real highlights of the event were talks by Jennifer Bryan and Mine Çetinkaya-Rundel, two educators who are channeling enormous amounts of energy into teaching statistics and statistical programming, and into developing new pedagogical methods to improve the learning experience for both students and teachers alike. Both Mine and Jennifer are committed to R, as well as to using state of the art developer tools such as RStudio, R Markdown, Git and GitHub.
If you clicked on the link to Jennifer’s university home page above and expected to see more content there you are probably not running with the in crowd. Anybody who aspires to bask in the faintest glow of tech cool is hanging out on GitHub. So go to github.com/jennybc to find a place where social media, software development and best practices collide to generate a state-of-the-art learning platform.
What Jennifer has created there is the R version of a full immersion experience for learning a new language. Just like you can’t separate the highs and awkward lows of human interactions from tripping over the grammar while learning Japanese on the streets of Tokyo, at jennybc you have to cope with GitHub, R Markdown and complying with best practices while doing your homework, seeking help, and learning from your peers.
As Jennifer writes in her synopsis of her R Summit & Workshop talk:
I've formed strong opinions about workflows for R Markdown + GitHub and what the big wins are.
Mine, who focuses on undergraduate education, points out that: “R is attractive because, unlike software designed specifically for courses at this level, it is relevant beyond the introductory statistics classroom, and is more powerful and flexible.” Poke around Mine’s GitHub page and you will see that she is all about R, open source, reproducibility and teaching good habits and right values. In addition to her work in the classroom, Mine has developed the Coursera course: Data Analysis and Statistical Inference, is a coauthor of three, free R based textbooks and is a driving force behind the ASA Datafest competitions.
Students are often ambivalent as to whether they are looking for education or training. But if you are a student of either Mime or Jennifer you are going to get some of both, and have a real shot at launching a productive R fueled career.
by Bob Horton, Data Scientist, Revolution Analytics
From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.
This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.
As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.
Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as
plot(0:20, choose(n=20, k=0:20))
Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.
Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.
Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.
Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionanalytics.com if you are interested in working with us. Stay tuned for progress reports.
Johns Hopkins Biostatistics Professor (and presenter of Data Analysis at Coursera) Jeff Leek has published his list of awesome things other people did in 2014. It's well worth following the links in his 38 entries, where you'll find a wealth of useful resources in teaching, statistics, data science, and data visualization.
Many of the entries are related to R, including shout-outs to: the data wrangling, exploration, and analysis with R class at UBC; this paper on R Markdown and reproducible analysis; Hadley Wickham's R Packages; Hilary Parker's guide to writing R packages from scratch; the broom package (for tidying up statistical output in R); Karl Broman's hipsteR tutorial; Rocker (Docker containers for R); and Packrat and R markdown v2 from RStudio. I was also chuffed to see that this blog got a mention, too:
Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.
Thanks, Jeff! Check out Jeff's complete list of awesome things at SimplyStatistics by following the link below.
SimplyStatistics: A non-comprehensive list of awesome things other people did in 2014
by Joseph Rickert
The American Statistical Association (ASA) Undergraduate Guidelines Workgroup recently published the report Curriculum Guidelines for Undergraduate Programs in Statistical Science. Although intended for educators setting up or revamping Stats programs at colleges and universities, this concise, 17 page document should be good reading for anyone who wants to take charge of their own education in learning to "think with data". Whether you are just getting started with your education or you are a working professional contemplating what to learn next to expand your knowledge and update your skills you should find the ASA report helpful.
The report places good statistical practice firmly on the foundation of the scientific method and locates statistical knowledge and skills squarely in the center modern data analysis.
However, it is far from being a complacent panegyric to statistics. The ASA report challenges educators to help students see that the "discipline of statistics is more than a collection of unrelated tools" and explicitly calls for an increased emphasis on data science and big league computational skills. Graduates of statistical programs:
should be facile with professional statistical software and other appropriate tools for data exploration, cleaning, validation, analysis, and communication. They should be able to program in a higher-level language, to think algorithmically, to use simulation-based statistical techniques . . . Graduates should be able to manage and manipulate data, including joining data from different sources and formats and restructuring data into a form suitable for analysis.
The expectations for communication skills are particularly noteworthy. The report says:
Graduates should be expected to write clearly, speak fluently, and construct effective visual displays and compelling written summaries. They should demonstrate ability to collaborate in teams and to organize and manage projects.
One could argue about the details of the topics that should be included in an undergraduate program. But, clearly the committe is aiming for far more than producing minimly competent, employable graduates. They are outlining a way of life, a competent way of being in a data driven world.
Hidden among the white papers listed on the ASA curriculum guidelines page is a treasure: Tim Hesterberg's paper on What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Curriculum. This is a lucid and fairly deep explication of bootstrapping and resampling techniques that deserves wide circulation. Tim writes that he had three goals in producing the paper: (1) To show the enormous potential of bootstrapping and permutation tests to help students understand statistical concepts . . .(2) To dig deeper . . .(3) To change statistical practice . . ."
Point (3) may sound astoundingly ambitious. However, it is grounded in a revolution that has been quietly gaining strength and whose time has come. Textbooks that rely on R based simulations to teach probability (e.g. Baclawski) and statistics (e.g. Matloff) have been available for some time, and Tim points out that undergraduate textbooks such as Chihara and Hesterberg which use resampling as the fundamental unifying idea are beginning to appear. Moreover, data scientists outside of the community of academic statisticians are well aware that programming skills more than compensate for a traditional statistics education that presents the subject as a collection of unrelated tests and techniques as this Strata Hadoop world presentation from John Rauser makes clear.
It is very good, indeed, to see the ASA leading the charge for change.
If you're a current graduate or undergraduate student and have a knack for data visualization, why not submit a paper to the 2014 ASA Statistical Graphics Student Paper Competition? Many of the past winners used R to create interesting displays of data, or created a new package for R (general statistical computing applications are also eligible). Winners will get free registration and a travel allowance to attend the Joint Statistics Meetings, which in 2015 will be held in Seattle. (You'll also get to attend the Statistical Computing Mixer at JSM and participate in the famous door prize giveaway!)
To be considered, you'll need to submit a six-page manuscript by December 14, 2014. I've copied the complete details of the announcement below. Good luck!
The Statistical Computing and Statistical Graphics Sections of the ASA are co-sponsoring a student paper competition on the topics of Statistical Computing and Statistical Graphics. Students are encouraged to submit a paper in one of these areas, which might be original methodological research, some novel computing or graphical application in statistics, or any other suitable contribution (for example, a software-related project). The selected winners will present their papers in a topic-contributed session at the 2015 Joint Statistical Meetings. The Sections will pay registration fees for the winners as well as a substantial allowance for transportation to the meetings and lodging.
Anyone who is a student (graduate or undergraduate) on or after September 1, 2014 is eligible to participate. An entry must include an abstract, a six page manuscript (including figures, tables and references), blinded versions of the abstract and manuscript (with no authors and no references that easily lead to identifying the authors), a C.V., and a letter from a faculty member familiar with the student's work. The applicant must be the first author of the paper. The faculty letter must include a verification of the applicant's student status and, in the case of joint authorship, should indicate what fraction of the contribution is attributable to the applicant. We prefer that electronic submissions of papers be in Postscript or PDF. All materials must be in English.
Students may submit papers to no more than two sections and may accept only one section's award. Students must inform both sections applied to when he or she wins and accepts an award, thereby removing the student from the award competition for the second section.
All application materials MUST BE RECEIVED by 5:00 PM EST, Sunday, December 14, 2014 at the address below. They will be reviewed by the Student Paper Competition Award committee of the Statistical Computing and Graphics Sections. The selection criteria used by the committee will include innovation and significance of the contribution as well as the professional quality of the manuscript. Award announcements will be made by January 15th, 2015.
Additional important information on the competition can be accessed on ASA's "Student Paper Competition/Travel Award to Attend the Joint Statistical Meetings".
Inquiries and application materials should be emailed or mailed to:
Student Paper Competition
c/o Aarti Munjal
Colorado School of Public Health
University of Colorado Denver
aarti.munjal@ucdenver.edu
ASA Statistical Computing and Statistical Graphics Section: Student Paper Competition 2014
by Norman Matloff
The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:
I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”
Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.
This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”
In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R activist.
CS vs. Statistics
Let’s consider the CS issue first. Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.
I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Department at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups. However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric: What is poor Statistics to do?
Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.
That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.
Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.
Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:
All this matters – a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.
Making Statistics Attractive to Students
This of course is an age-old problem in Stat. Let’s face it–the very word statistics sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse – the Advanced Placement (AP) Statistics courses in high schools.
Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.
One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^2 , even though he had attended a top-quality high school in the heart of Silicon Valley. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting college-level instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.
The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides – exactly the kinds of things that motivate young people.
So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.
As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.
As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: An Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R. But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.
This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.
But given all these weighty problems, it certainly would be nice to do something, right? Switching to R would be doable–and should be done.
[Crossposted with permission from the Mad (Data) Scientist blog.]
Check out this tweet:
Just found out that @rdpeng @jtleek @bcaffo have enrolled 1,000,000 students in statistics classes on @coursera in last 2 years.
— Simply Statistics (@simplystats) May 30, 2014
Those courses include Roger Peng's Computing for Data Analysis and Jeff Leek's "Data Analysis", and they all use the R language. That means more than 1,000,000 new students have been exposed to R in the last two years! On this basis alone it seems clear that the estimate of two million R users wordwide is an underestimate.