by Norman Matloff
The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:
- The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
- Efforts to make the field attractive to students have largely been unsuccessful.
I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”
Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.
This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”
In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R activist.
CS vs. Statistics
Let’s consider the CS issue first. Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.
I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Department at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups. However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric: What is poor Statistics to do?
Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.
That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.
Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.
Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:
- CS, having grown out of a research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals. This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work. The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.
- Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is heavy pressure to bring in lots of research funding, and produce lots of PhD students. Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors, writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand. This is made even worse by the rapid change in the fashionable research topic du jour. making it difficult to go into a topic in any real depth. Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing.
- There is rampant “reinventing the wheel.” The above-mentioned lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature. This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of. For instance, consider a paper on the use of unlabeled training data in classification. (I’ll omit names.) One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.
- Again for historical reasons, CS research is largely empirical/experimental in nature. This causes what in my view is one of the most serious problems plaguing CS research in Stat – lack of rigor. Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine. But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; actually, one can add quadratic terms, and so on, to models like this.
- This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions. Most empirical work in CS doesn’t have any models to worry about. That’s entirely appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work. A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments. in the world. She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution. She couldn’t answer – she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.
- Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a crowd. But this mentality leads to an oversimplified view of things, with everything being viewed as a paradigm shift. Neural networks epitomize this problem. Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.(Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.) Among CS folks, there is a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem – not because fundamentally new technology has been invented.
All this matters – a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.
Making Statistics Attractive to Students
This of course is an age-old problem in Stat. Let’s face it–the very word statistics sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse – the Advanced Placement (AP) Statistics courses in high schools.
Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.
One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^2 , even though he had attended a top-quality high school in the heart of Silicon Valley. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting college-level instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.
The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides – exactly the kinds of things that motivate young people.
So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.
As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.
As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: An Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R. But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.
This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.
But given all these weighty problems, it certainly would be nice to do something, right? Switching to R would be doable–and should be done.
[Crossposted with permission from the Mad (Data) Scientist blog.]
> So, unlike the “CS usurpation problem,” whose solution is unclear,
If anybody think this "usurpation" is a "problem", then they've got the wrong attitude from the beginning. Computer programming is a very useful skill, and critical for anybody serious about a career in statistics. Just because those arrogant people in the CS department know how to code, it doesn't become "forbidden knowledge" that statisticians should avoid.
A broad understanding of stats includes a lot of things beyond mathematical formalism. For example, we often need to pick up domain knowledge relevant to the dataset we are working on - genetics for example. We should learn computer programming with the same enthusiasm.
CS, and in particular the skills of manipulating data structures and algorithms, should not be seen as the "enemy". Mathematically-minded people usually find it easy to learn how to code, *if* the are encouraged to do so. Stats departments should focus on improving their own skills, instead of jealously attacking the CS people.
Posted by: Aaron McDaid | August 26, 2014 at 14:46
@Aaron McDaid
Every statistician in academia already knows how to program. Nobody is saying that they shouldn't.
It's sort of mindblowing that that's how you interpreted the article.
Posted by: Steve Smith | August 26, 2014 at 16:13
This article sounds like a sore looser who just lost a grant and all his grad students
Posted by: Aman Kajaria | August 26, 2014 at 18:27
Link *destructive nature of AP Stat* doesn't work - it has an extra prefix.
Posted by: pdm55 | August 26, 2014 at 20:57
"machine learning really IS statistics"
I think that the main difference between the two fields is that machine learning is mostly done in areas where probabilistic modeling is merely a means to an end, and it is understood that one could have a model with almost no uncertainty given a sufficiently power model. This is largely the case in vision and speech recognition, where humans are able to make nearly perfect predictions and there is very little or no actual noise in the data.
There is still some value in estimating a probability distribution over the various classes, but it is just a tool for making more accurate predictions.
"neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification"
Yes they are. There is a world of difference between deep algorithms like neural networks and shallow algorithms like random forests and logistic regression both in theory and in practice. There are a number of problems where deep functions are more efficient and generalize better than shallow functions. N-bit parity is a common example.
In the ImageNet competitions, convolutional neural networks outperform other algorithms by large margins. This is not just a result of having faster computers or bigger data sets. You could run an SVM/logistic regression with hand-engineered or randomly generated features and far more computing power and get worse results.
"fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; actually, one can add quadratic terms, and so on, to models like this."
Presumably he meant that it cannot learn it automatically.
Posted by: Alex | August 26, 2014 at 22:34
> Switching to R would be doable–and should be done.
An article that complains about "statistics losing image among students" and then goes on and recommends a solution that will hurt just as much in the long term? One of the reason CS people at my institution are turned off by most statistics classes is R. Yes, it's still better and more engaging than say, STATA or SPSS. But as a programming language it is horrible and needs to die in a fire. This is a major turn-off point at least in my experience. If Statistics wants to stop losing ground on CS, I propose taking a page out of their book and switch to a nice, clean, all-purpose programming language. Python has grown to offer the same tools as R has (given pandas, statsmodels, matplotlib and sklearn), but does most of the work with much more class and is a same programming language.
Posted by: Tom | August 26, 2014 at 23:35
Really it is more informative blog. All content of this blog is too useful. I am so glad to go through a nice descriptive content.Thanks for sharing a wonderful blog....
Posted by: Sara Clark-Duff | August 27, 2014 at 03:05
I studied stats and economics in uni (and stats was one of my stronger subjects, later got did CS and work as a programmer and recently took a stats course as part of a data analytics course. Overall I would say now that statistics is not a natural subject, basically invented by a few guys like Fisher and Pearson and hence, though possible to pass, is impossible to remember a few weeks after the final exam. It's tyrannical in its formalism - without even mentioning one is usually learning frequentism. All my lecturers/tutors on the subject had 0 creativity and slaves to method (why they were good at it) and arrogant. People like CS because it allows ones creativity flourish rather than be imprisoned under a pile of rules where there is only one way and one answer.
Posted by: james | August 27, 2014 at 03:40
I'm about to start a Ph.D. in Robotics tomorrow. I'll be studying machine learning, computer vision, etc. among other things. I'm aware of this issue but don't know much statistics, but I'm interested in learning what I could do to be part of the solution. What would you suggest someone like me learn so I can hold my own? Which classes should I take? Should I try to get someone in the statistics department on my Ph.D. defense committee in a few years?
Posted by: Andrew Hundt | August 27, 2014 at 06:39
When I was a student and professor, introductory statistics was a less demanding alternative to calculus. It argued from authority rather than reason. I've listened to eminent statisticians argue that only accepted/vetted tests should be used. While my best collaborations are with statisticians, many of the statisticians I've known have no facility or respect for numerical techniques or programming. It is a sick culture that dominates an exciting important field.
Posted by: Andrew Fraser | August 27, 2014 at 06:39
Andrew Hundt: if you want a textbook, Larry Wasserman's _All of Statistics_ was basically designed for technically strong CS students who want to learn more about statistics.
In response to the article -- the bitterness on this side of the argument always puzzles me. I keep hearing statisticians complain that they invented cross-validation, nonparametric methods, etc. And they are right! So why don't statisticians *teach* these things in introductory courses? You can take two or three semesters of undergrad stats and never hear of those things. But a college sophomore can understand their usefulness and take away great insights from them. And indeed, ML courses teach them straight away, even to students with little statistical/mathematical background.
Apart from educational and cultural issues, there's also just differences in emphasis. You typically don't see much stochastic gradient descent, optimization theory, or parallelized computation, or, yes, neural networks, talked about too much in statistics journals or courses. These things are pretty important. Artificial intelligence sorts of problems -- images, speech, language, for example -- just lend themselves to a different set of emphases than Fisher-era problems in statistics. Of course there are many commonalities, but the fact commonalities exist isn't, by itself, enough to solve these problems or train students to make progress.
Posted by: Brendan O'Connor | August 27, 2014 at 07:13
Andrew Hundt: oh, you're at CMU! The book partly overlaps with Stat 705, which Larry used to teach. Though it's technically less in-depth and covers some other things too. But I'm not sure if they've changed the 705 curriculum (I took it a few years ago.)
Posted by: Brendan O'Connor | August 27, 2014 at 07:14
Thanks for the interesting comments, everyone. I can't reply in detail here, but I'll make a couple of quick remarks.
To the poster who insists that neural networks are fundamentally different from other methods, and who offers as evidence how well NNs did on some application, I would ask that he read my post again, especially the part about devoting enormous resources to a problem.
To the student about to start a doctorate in Robotics, with heavy doses of machine learning and the like, I say good for you! I'm sure you'll enjoy it. As to Stat, I'd recommend you learn the principles behind regression and classification well, and a bit of undergraduate level math stat; these could be done via coursework, or by reading. I of course would recommend my own book, open source, at heather.cs.ucdavis.edu/probstatbook
Concerning the poster who wants to murder R :-) I suspect that his students were simply responding to his own biases. I agree that Python is cleaner and more elegant, but so what? R gets the job done and is far more powerful than Python in what counts, namely Stat methods. I've used R in CS classes for years, and have never heard any complaints.
Posted by: Norm Matloff | August 27, 2014 at 09:38
1+1 >2, no one's losing the battle.
Posted by: Leah Jiang | August 27, 2014 at 09:41
Re: Tom | August 26, 2014 at 23:35
One of the reason CS people at my institution are turned off by most statistics classes is R. Yes, it's still better and more engaging than say, STATA or SPSS. But as a programming language it is horrible and needs to die in a fire.
Would "CS people" be equally turned off by using bash for system administration scripting? The relationship between R and Statistics and bash/Powershell* and IT can be easily seen to be analogous. Programming languages are more than just syntax, they are platforms that constrain our thought (similar to natural languages- see the Sapir-Whorf hypothesis in linguistics). When you use a general-purpose platform, such as Python, you are not limited in what you can do so you can easily write distracted code that falls outside of the paradigm you are working within- this can be a blessing for a creativity, but a curse for fields that require the use of a common internally consistent framework, such as statistics. Domain-specific languages, while certainly atrocious in their design (don't get me started on MATLAB), at least make sure that analyses are reasonably reproducible and homogeneous, since everyone must use the same limited subset of functions (and by "everyone" I mean the community of users for a given language, which is also crucial).
It is my conviction that both domain-specific and general purpose languages have their place- general languages like Python can be for more idiosyncratic, creative configurations and platforms like bash and R can be for more stereotyped tasks. Sometimes you only need a solid and well-developed pair of scissors, not a Swiss Army Knife with a functional, but lower quality pair of scissors. For mature fields, tools that do one thing and do it well are essential.
(* Yes I know, Perl/Python are also used for some system administration tasks, but I am assuming that those tasks are substantially more idiosyncratic than basic init scripts and the like. I am open to being proven wrong on this point.)
Posted by: Jaipelai | August 27, 2014 at 09:42
It's tyrannical in its formalism - without even mentioning one is usually learning frequentism. All my lecturers/tutors on the subject had 0 creativity and slaves to method (why they were good at it) and arrogant. People like CS because it allows ones creativity flourish rather than be imprisoned under a pile of rules where there is only one way and one answer.
That's a very strange experience you had with statistics- very different from my understanding of how the field works. There should be plenty of creativity involved in stats, as it is much less formal and rigorous than other branches of math.
My best understanding of statistics is that it is more mathematically-informed argumentation than anything else, and that statisticians are more like lawyers in some ways than mathematicians (there is an interesting etymology here in the sense that statistics is derived from a root word meaning "state affairs"). The chief difference would be that lawyers argue from man-made laws, while statisticians strive to base their arguments off of natural laws. If lawyers can be creative in their argumentation, then so can statisticians.
Posted by: Jaipelai | August 27, 2014 at 09:50
One last relevant cross-post before I disappear.
Posted by: Jaipelai | August 27, 2014 at 10:00
Statistics is to data science what astronomy is to physics. To understand the numerous and fundamental differences, read my article "16 analytic disciplines compared to data science" at http://www.datasciencecentral.com/profiles/blogs/17-analytic-disciplines-compared
Posted by: Vincent Granville | August 27, 2014 at 10:27
"For a long time I have thought I was a statistician" is JW Tukey's opening in "The future of data analysis”. If Tukey resigned from statistics in 1969, CS and Data Science are just scapegoats. James would be surprised to find his argument in that old paper.
Posted by: piccolbo | August 27, 2014 at 16:37
Good, resonant article (I took a Data Science course recently and was mildly horrified) but I am tempted to view all this as a kind of poetic justice: CS in the early 21st century giving (orthodox/frequentist) Statistics a taste of the medicine it gave Science in the 20th century.
@Jaipelai “Would "CS people" be equally turned off by using bash for system administration scripting?”
I wonder if this is less a case of "CS people" complaining about R and more a case of "blub programmers" complaining - I suspect the complaining would become even louder if R became “more powerful than they can possibly imagine”: https://www.stat.auckland.ac.nz/~ihaka/downloads/Compstat-2008.pdf
Posted by: phayes | August 27, 2014 at 17:05
I was intrigued by your article as I've been trying to understand why AP statistics has actually become so popular in recent years. A comparison with CS is also revealing.
Posted by: matt parry | August 28, 2014 at 00:40
"CS research is largely empirical/experimental in nature"
Please, do not confuse CS with engineering. Take a look of the Turing Awards people (http://amturing.acm.org/byyear.cfm), just to see where theory is!
Posted by: David | August 28, 2014 at 05:33
"machine learning really IS statistics."
That's a bit like saying physics IS mathematics. Machine learning APPLIES statistics. It's different. Statistics is an important tool in machine learning, perhaps one the most important right now. But neuroscience and biology are also historically quite important to the field of machine learning (yes: neural networks are directly inspired by the brain, and genetic algorithms by biological evolution).
Posted by: John | August 28, 2014 at 08:05
While working on her undergrad in Psych, my wife's first statistics professor was blind - literally. His TA was a Golden Retriever. Her notes were filled with the words Delta, Sigma and so on.
I am a programmer of many years experience and no formal statistics training (other than helping my wife in that course :)) and would respectfully suggest that the field of Statistics would do well to embrace the tools offered by CS and use them to develop, enhance and improve their field.
Posted by: Murray Whipps | August 28, 2014 at 09:18
Of course Computer Science is going to be more attractive than Statistics.
This is because of Sutton's Zeroth Law of Statistics: T=Sc, where:
S is the total amount of knowledge available in the field of Statistics;
c is a constant, greater than 6;
Let T be the amount of tedium experienced in assimilating that knowledge.
Posted by: Dan Sutton | August 28, 2014 at 09:26
As a biostatistician, I think the critique is important—though rather one-sided, and I agree with most of it. However, rather than viewing this as an us-versus-them problem, I think the solution is better intentional integration of the specialties of (and specialists from) Computer Science and Engineering (Big Data and Machine Learning ) and Statistics.
ANY recognition or prediction problem addressed by an algorithm of any sort (whether it's called a statistical model, AI, or something else)must be concerned with the probability that the recognition or predictions is accurate. For most problems, the likelihood of various degrees of relative inaccuracy is also of vital importance. Ultimately, these are questions of assessing probability and, by definition, are problems addressable by the fields of Probability and Statistics.
I think the lack of response to this article from other statisticians is telling and rather depressing. It illustrates the very apathy that you are writing about.
I’m concerned when analysis consumers are enamored of new buzz words coming from the former of these two fields. I’m concerned when anyone (statistician, engineer, or computer scientist) relies on methods and theory that they don’t adequately understand. I’m also concerned when statisticians bury their heads in the sand and don’t take an active role in better unifying the practice of these two fields—and statisticians are guilty of this way too often.
I am offended by some proponents of other fields who deny the relevance of Statistics as the underpinning of what they are doing or who show something between disdain and indifference at the idea of collaborating more closely with Statisticians. On the other hand, I view as foolish those statisticians who are also disinterested in collaboration, continuing to bury their heads and deny the relevance and importance of technical advances and potential applications that have arisen from the efforts of Machine Learning and Big Data science.
Posted by: Doug Langbehn | August 28, 2014 at 10:01
I'm a CS grad (PhD, Stanford), the builder of math/stat
modeling software (see www.civilized.com,) and a great admirer of the theory of statistics. I think part of the issue is the depth of statisics is rarely plumbed - how many non-statisticians can appreciate Feller, Vol 2, or
Kendall's Theory of Statistics? Well, some can, but
most CS students (and faculty) would benefit from
a few good courses introducing what "real" statistics
is and something about its deep mathematical roots.
(I realize you can say the same about algebra, analysis, etc.
as well. There is too much to know and too little time
to learn it.)
There is also the CS disease of novelty-questing (defined by new jargon applied to old domains,) that festers because
of the need to sell something (in the commercial world,)
or to get something funded (in the academic world.)
I recommend that Statistics faculty teach in the CS department or at least that interdiscplinary programs be
pursued that let CS students learn that many wheels have
already been invented. That doesn't mean they can't
be improved on, but at least, maybe the names will
not change so much.
(How many CS students are studying "The Art of Computer
programming these days? There is a lot of statistical
theory there, and properly credited too.)
Posted by: gary knott | August 28, 2014 at 14:01
I come from a stats background. In my opinion, the main reason CS wins over stats is because the CS folks are *getting sh*# done*. The stats people are stuck in the ivory tower arguing about Bayesian versus frequentist, and doing proofs. At the end of the day, the model is just one small part of making data actionable (not to burst your bubble, but probably the easiest part). The CS folks are able to do things soup to nuts -- capture data, model it, embed it in an application, scale the application, and push it out into the world. In general the stats guys only focus on the modeling, and are left doing mathematical masturbation on the iris data set while CS guys are out building start ups and high frequency trading algos off " bad models".
Posted by: Scotty Nelson | August 28, 2014 at 20:53
In response to Nelson's comment above, I agree that the bottom line is that CS folks have gotten things done when the field overtly called Statistics has often failed to heed a call for useful action. However, one of my biggest sources of frustration with CS is the disdain (rather obvious in the above case) with which many in that field view Statistics. Yes, CS folks get things done. I've had several experiences though where the things they're doing could clearly be done better with some tweaking based on a better understanding of statistical theory. I've personally had too many experiences where my nominal colleagues from CS (working on the same overall project) have rejected attempts at collaboration.
In my earlier posting, I called for better collaboration between the two disciplines, and this is still my main message. Statistics has dropped the ball too many times in trying to tackle important real-world problems. The more pragmatic attitude of many in the CS field has led to solutions—sometimes spectacularly successful solutions—to many of these problems. But in many other cases, the solutions could be better if the statisticians are willing to get their hands dirty and the CS folks explicitly acknowledge and take full advantage of the discipline from which many of their tools originate.
Posted by: Doug Langbehn | August 29, 2014 at 10:28
www.amstat.org/about/ethicalguidelines.cfm
Re: Stats vs CS
Both are necessary. Neither is sufficient. But before data mining to test theories about engineering the happiness of Facebook users I strongly advise reading the American Statistical Association's code of ethics. I'm sure CS and MBA programs have something similar but I've found this one particularly good at "adjusting" one's perspective.
Posted by: Jeff Zahir | August 30, 2014 at 14:21
Yoshua Bengio's take on how deep learning and other parametric curve fitting methods are fundamentally different:
"The BIG difference between deep learning and classical non-parametric statistical machine learning is that we go beyond the SMOOTHNESS assumption and add other priors such as
- the existence of these underlying generative factors (= distributed representations)
- assuming that they are organized hierarchically by composition (= depth)
- assuming that they are causes of the observed data (allows semi-supervised learning to work)
- assuming that different tasks share different subsets of factors (allows multi-task learning and transfer learning to tasks with very few labeled examples)
- assuming that the top-level factors are related in simple ways to each other (makes it possible to stick a simple classifier on top of unsupervised learning and get decent results, for example)"
https://plus.google.com/+YoshuaBengio/posts/GJY53aahqS8
Posted by: Alex Lamb | September 05, 2014 at 16:08
some good some bad here...
but author dude, this is just an idiotic statement to make "The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! "
Please come back to reality author dude. R is indeed free (not sure what the exclamation point is trying to prove-we all know that). Curiously the hardware which is solely needed to run R is not! [that's for emphasis].
According to amazon.com, I (or a high school student, or his/her family or school district) can pick up a used TI for $45 or so. That could be one family's daily take home pay for sadly a not 0% of people in America. What kind of R compatible laptop can I get for that? Does R run nicely on a chromebook? Do high school students want to run Linux? Do you honestly believe every school has tons of laptops available to their students for home and school use?
Technology is both great and expensive. Also, a high school student can get into a lot less trouble trying to use social media on a TI calculator (hint-they can't) then on a school issued laptop, if they are even issued one.
So please don't just shout and say, yup, the problem is these kids aren't getting the best technology out there. Fix that and we're done. Its content, enthusiasm, training, and teamwork.
Posted by: bob bobster | September 06, 2014 at 12:07
I'm surprised no one has mentioned engineers reinventing statistics under the moniker "uncertainty quantification." There are many similarly dissatisfied statisticians and relatively ill-informed engineers (relative to their engineering expertise, that is) in that developing discussion.
Posted by: Paul | September 18, 2014 at 12:40