« Woken up by an earthquake | Main | DataScienceLA interviews David Smith »

August 26, 2014


Feed You can follow this conversation by subscribing to the comment feed for this post.

> So, unlike the “CS usurpation problem,” whose solution is unclear,

If anybody think this "usurpation" is a "problem", then they've got the wrong attitude from the beginning. Computer programming is a very useful skill, and critical for anybody serious about a career in statistics. Just because those arrogant people in the CS department know how to code, it doesn't become "forbidden knowledge" that statisticians should avoid.

A broad understanding of stats includes a lot of things beyond mathematical formalism. For example, we often need to pick up domain knowledge relevant to the dataset we are working on - genetics for example. We should learn computer programming with the same enthusiasm.

CS, and in particular the skills of manipulating data structures and algorithms, should not be seen as the "enemy". Mathematically-minded people usually find it easy to learn how to code, *if* the are encouraged to do so. Stats departments should focus on improving their own skills, instead of jealously attacking the CS people.

@Aaron McDaid

Every statistician in academia already knows how to program. Nobody is saying that they shouldn't.

It's sort of mindblowing that that's how you interpreted the article.

This article sounds like a sore looser who just lost a grant and all his grad students

Link *destructive nature of AP Stat* doesn't work - it has an extra prefix.

"machine learning really IS statistics"

I think that the main difference between the two fields is that machine learning is mostly done in areas where probabilistic modeling is merely a means to an end, and it is understood that one could have a model with almost no uncertainty given a sufficiently power model. This is largely the case in vision and speech recognition, where humans are able to make nearly perfect predictions and there is very little or no actual noise in the data.

There is still some value in estimating a probability distribution over the various classes, but it is just a tool for making more accurate predictions.

"neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification"

Yes they are. There is a world of difference between deep algorithms like neural networks and shallow algorithms like random forests and logistic regression both in theory and in practice. There are a number of problems where deep functions are more efficient and generalize better than shallow functions. N-bit parity is a common example.

In the ImageNet competitions, convolutional neural networks outperform other algorithms by large margins. This is not just a result of having faster computers or bigger data sets. You could run an SVM/logistic regression with hand-engineered or randomly generated features and far more computing power and get worse results.

"fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; actually, one can add quadratic terms, and so on, to models like this."

Presumably he meant that it cannot learn it automatically.

> Switching to R would be doable–and should be done.

An article that complains about "statistics losing image among students" and then goes on and recommends a solution that will hurt just as much in the long term? One of the reason CS people at my institution are turned off by most statistics classes is R. Yes, it's still better and more engaging than say, STATA or SPSS. But as a programming language it is horrible and needs to die in a fire. This is a major turn-off point at least in my experience. If Statistics wants to stop losing ground on CS, I propose taking a page out of their book and switch to a nice, clean, all-purpose programming language. Python has grown to offer the same tools as R has (given pandas, statsmodels, matplotlib and sklearn), but does most of the work with much more class and is a same programming language.

Really it is more informative blog. All content of this blog is too useful. I am so glad to go through a nice descriptive content.Thanks for sharing a wonderful blog....

I studied stats and economics in uni (and stats was one of my stronger subjects, later got did CS and work as a programmer and recently took a stats course as part of a data analytics course. Overall I would say now that statistics is not a natural subject, basically invented by a few guys like Fisher and Pearson and hence, though possible to pass, is impossible to remember a few weeks after the final exam. It's tyrannical in its formalism - without even mentioning one is usually learning frequentism. All my lecturers/tutors on the subject had 0 creativity and slaves to method (why they were good at it) and arrogant. People like CS because it allows ones creativity flourish rather than be imprisoned under a pile of rules where there is only one way and one answer.

I'm about to start a Ph.D. in Robotics tomorrow. I'll be studying machine learning, computer vision, etc. among other things. I'm aware of this issue but don't know much statistics, but I'm interested in learning what I could do to be part of the solution. What would you suggest someone like me learn so I can hold my own? Which classes should I take? Should I try to get someone in the statistics department on my Ph.D. defense committee in a few years?

When I was a student and professor, introductory statistics was a less demanding alternative to calculus. It argued from authority rather than reason. I've listened to eminent statisticians argue that only accepted/vetted tests should be used. While my best collaborations are with statisticians, many of the statisticians I've known have no facility or respect for numerical techniques or programming. It is a sick culture that dominates an exciting important field.

Andrew Hundt: if you want a textbook, Larry Wasserman's _All of Statistics_ was basically designed for technically strong CS students who want to learn more about statistics.

In response to the article -- the bitterness on this side of the argument always puzzles me. I keep hearing statisticians complain that they invented cross-validation, nonparametric methods, etc. And they are right! So why don't statisticians *teach* these things in introductory courses? You can take two or three semesters of undergrad stats and never hear of those things. But a college sophomore can understand their usefulness and take away great insights from them. And indeed, ML courses teach them straight away, even to students with little statistical/mathematical background.

Apart from educational and cultural issues, there's also just differences in emphasis. You typically don't see much stochastic gradient descent, optimization theory, or parallelized computation, or, yes, neural networks, talked about too much in statistics journals or courses. These things are pretty important. Artificial intelligence sorts of problems -- images, speech, language, for example -- just lend themselves to a different set of emphases than Fisher-era problems in statistics. Of course there are many commonalities, but the fact commonalities exist isn't, by itself, enough to solve these problems or train students to make progress.

Andrew Hundt: oh, you're at CMU! The book partly overlaps with Stat 705, which Larry used to teach. Though it's technically less in-depth and covers some other things too. But I'm not sure if they've changed the 705 curriculum (I took it a few years ago.)

Thanks for the interesting comments, everyone. I can't reply in detail here, but I'll make a couple of quick remarks.

To the poster who insists that neural networks are fundamentally different from other methods, and who offers as evidence how well NNs did on some application, I would ask that he read my post again, especially the part about devoting enormous resources to a problem.

To the student about to start a doctorate in Robotics, with heavy doses of machine learning and the like, I say good for you! I'm sure you'll enjoy it. As to Stat, I'd recommend you learn the principles behind regression and classification well, and a bit of undergraduate level math stat; these could be done via coursework, or by reading. I of course would recommend my own book, open source, at heather.cs.ucdavis.edu/probstatbook

Concerning the poster who wants to murder R :-) I suspect that his students were simply responding to his own biases. I agree that Python is cleaner and more elegant, but so what? R gets the job done and is far more powerful than Python in what counts, namely Stat methods. I've used R in CS classes for years, and have never heard any complaints.

1+1 >2, no one's losing the battle.

Re: Tom | August 26, 2014 at 23:35
One of the reason CS people at my institution are turned off by most statistics classes is R. Yes, it's still better and more engaging than say, STATA or SPSS. But as a programming language it is horrible and needs to die in a fire.

Would "CS people" be equally turned off by using bash for system administration scripting? The relationship between R and Statistics and bash/Powershell* and IT can be easily seen to be analogous. Programming languages are more than just syntax, they are platforms that constrain our thought (similar to natural languages- see the Sapir-Whorf hypothesis in linguistics). When you use a general-purpose platform, such as Python, you are not limited in what you can do so you can easily write distracted code that falls outside of the paradigm you are working within- this can be a blessing for a creativity, but a curse for fields that require the use of a common internally consistent framework, such as statistics. Domain-specific languages, while certainly atrocious in their design (don't get me started on MATLAB), at least make sure that analyses are reasonably reproducible and homogeneous, since everyone must use the same limited subset of functions (and by "everyone" I mean the community of users for a given language, which is also crucial).

It is my conviction that both domain-specific and general purpose languages have their place- general languages like Python can be for more idiosyncratic, creative configurations and platforms like bash and R can be for more stereotyped tasks. Sometimes you only need a solid and well-developed pair of scissors, not a Swiss Army Knife with a functional, but lower quality pair of scissors. For mature fields, tools that do one thing and do it well are essential.

(* Yes I know, Perl/Python are also used for some system administration tasks, but I am assuming that those tasks are substantially more idiosyncratic than basic init scripts and the like. I am open to being proven wrong on this point.)

It's tyrannical in its formalism - without even mentioning one is usually learning frequentism. All my lecturers/tutors on the subject had 0 creativity and slaves to method (why they were good at it) and arrogant. People like CS because it allows ones creativity flourish rather than be imprisoned under a pile of rules where there is only one way and one answer.

That's a very strange experience you had with statistics- very different from my understanding of how the field works. There should be plenty of creativity involved in stats, as it is much less formal and rigorous than other branches of math.

My best understanding of statistics is that it is more mathematically-informed argumentation than anything else, and that statisticians are more like lawyers in some ways than mathematicians (there is an interesting etymology here in the sense that statistics is derived from a root word meaning "state affairs"). The chief difference would be that lawyers argue from man-made laws, while statisticians strive to base their arguments off of natural laws. If lawyers can be creative in their argumentation, then so can statisticians.

One last relevant cross-post before I disappear.

Statistics is to data science what astronomy is to physics. To understand the numerous and fundamental differences, read my article "16 analytic disciplines compared to data science" at http://www.datasciencecentral.com/profiles/blogs/17-analytic-disciplines-compared

"For a long time I have thought I was a statistician" is JW Tukey's opening in "The future of data analysis”. If Tukey resigned from statistics in 1969, CS and Data Science are just scapegoats. James would be surprised to find his argument in that old paper.

Good, resonant article (I took a Data Science course recently and was mildly horrified) but I am tempted to view all this as a kind of poetic justice: CS in the early 21st century giving (orthodox/frequentist) Statistics a taste of the medicine it gave Science in the 20th century.

@Jaipelai “Would "CS people" be equally turned off by using bash for system administration scripting?”

I wonder if this is less a case of "CS people" complaining about R and more a case of "blub programmers" complaining - I suspect the complaining would become even louder if R became “more powerful than they can possibly imagine”: https://www.stat.auckland.ac.nz/~ihaka/downloads/Compstat-2008.pdf

I was intrigued by your article as I've been trying to understand why AP statistics has actually become so popular in recent years. A comparison with CS is also revealing.

"CS research is largely empirical/experimental in nature"

Please, do not confuse CS with engineering. Take a look of the Turing Awards people (http://amturing.acm.org/byyear.cfm), just to see where theory is!

"machine learning really IS statistics."

That's a bit like saying physics IS mathematics. Machine learning APPLIES statistics. It's different. Statistics is an important tool in machine learning, perhaps one the most important right now. But neuroscience and biology are also historically quite important to the field of machine learning (yes: neural networks are directly inspired by the brain, and genetic algorithms by biological evolution).

While working on her undergrad in Psych, my wife's first statistics professor was blind - literally. His TA was a Golden Retriever. Her notes were filled with the words Delta, Sigma and so on.

I am a programmer of many years experience and no formal statistics training (other than helping my wife in that course :)) and would respectfully suggest that the field of Statistics would do well to embrace the tools offered by CS and use them to develop, enhance and improve their field.

Of course Computer Science is going to be more attractive than Statistics.

This is because of Sutton's Zeroth Law of Statistics: T=Sc, where:

S is the total amount of knowledge available in the field of Statistics;
c is a constant, greater than 6;
Let T be the amount of tedium experienced in assimilating that knowledge.

As a biostatistician, I think the critique is important—though rather one-sided, and I agree with most of it. However, rather than viewing this as an us-versus-them problem, I think the solution is better intentional integration of the specialties of (and specialists from) Computer Science and Engineering (Big Data and Machine Learning ) and Statistics.

ANY recognition or prediction problem addressed by an algorithm of any sort (whether it's called a statistical model, AI, or something else)must be concerned with the probability that the recognition or predictions is accurate. For most problems, the likelihood of various degrees of relative inaccuracy is also of vital importance. Ultimately, these are questions of assessing probability and, by definition, are problems addressable by the fields of Probability and Statistics.

I think the lack of response to this article from other statisticians is telling and rather depressing. It illustrates the very apathy that you are writing about.

I’m concerned when analysis consumers are enamored of new buzz words coming from the former of these two fields. I’m concerned when anyone (statistician, engineer, or computer scientist) relies on methods and theory that they don’t adequately understand. I’m also concerned when statisticians bury their heads in the sand and don’t take an active role in better unifying the practice of these two fields—and statisticians are guilty of this way too often.

I am offended by some proponents of other fields who deny the relevance of Statistics as the underpinning of what they are doing or who show something between disdain and indifference at the idea of collaborating more closely with Statisticians. On the other hand, I view as foolish those statisticians who are also disinterested in collaboration, continuing to bury their heads and deny the relevance and importance of technical advances and potential applications that have arisen from the efforts of Machine Learning and Big Data science.

I'm a CS grad (PhD, Stanford), the builder of math/stat
modeling software (see www.civilized.com,) and a great admirer of the theory of statistics. I think part of the issue is the depth of statisics is rarely plumbed - how many non-statisticians can appreciate Feller, Vol 2, or
Kendall's Theory of Statistics? Well, some can, but
most CS students (and faculty) would benefit from
a few good courses introducing what "real" statistics
is and something about its deep mathematical roots.
(I realize you can say the same about algebra, analysis, etc.
as well. There is too much to know and too little time
to learn it.)

There is also the CS disease of novelty-questing (defined by new jargon applied to old domains,) that festers because
of the need to sell something (in the commercial world,)
or to get something funded (in the academic world.)

I recommend that Statistics faculty teach in the CS department or at least that interdiscplinary programs be
pursued that let CS students learn that many wheels have
already been invented. That doesn't mean they can't
be improved on, but at least, maybe the names will
not change so much.

(How many CS students are studying "The Art of Computer
programming these days? There is a lot of statistical
theory there, and properly credited too.)

I come from a stats background. In my opinion, the main reason CS wins over stats is because the CS folks are *getting sh*# done*. The stats people are stuck in the ivory tower arguing about Bayesian versus frequentist, and doing proofs. At the end of the day, the model is just one small part of making data actionable (not to burst your bubble, but probably the easiest part). The CS folks are able to do things soup to nuts -- capture data, model it, embed it in an application, scale the application, and push it out into the world. In general the stats guys only focus on the modeling, and are left doing mathematical masturbation on the iris data set while CS guys are out building start ups and high frequency trading algos off " bad models".

In response to Nelson's comment above, I agree that the bottom line is that CS folks have gotten things done when the field overtly called Statistics has often failed to heed a call for useful action. However, one of my biggest sources of frustration with CS is the disdain (rather obvious in the above case) with which many in that field view Statistics. Yes, CS folks get things done. I've had several experiences though where the things they're doing could clearly be done better with some tweaking based on a better understanding of statistical theory. I've personally had too many experiences where my nominal colleagues from CS (working on the same overall project) have rejected attempts at collaboration.
In my earlier posting, I called for better collaboration between the two disciplines, and this is still my main message. Statistics has dropped the ball too many times in trying to tackle important real-world problems. The more pragmatic attitude of many in the CS field has led to solutions—sometimes spectacularly successful solutions—to many of these problems. But in many other cases, the solutions could be better if the statisticians are willing to get their hands dirty and the CS folks explicitly acknowledge and take full advantage of the discipline from which many of their tools originate.

Re: Stats vs CS
Both are necessary. Neither is sufficient. But before data mining to test theories about engineering the happiness of Facebook users I strongly advise reading the American Statistical Association's code of ethics. I'm sure CS and MBA programs have something similar but I've found this one particularly good at "adjusting" one's perspective.

Yoshua Bengio's take on how deep learning and other parametric curve fitting methods are fundamentally different:

"The BIG difference between deep learning and classical non-parametric statistical machine learning is that we go beyond the SMOOTHNESS assumption and add other priors such as

- the existence of these underlying generative factors (= distributed representations)

- assuming that they are organized hierarchically by composition (= depth)

- assuming that they are causes of the observed data (allows semi-supervised learning to work)

- assuming that different tasks share different subsets of factors (allows multi-task learning and transfer learning to tasks with very few labeled examples)
- assuming that the top-level factors are related in simple ways to each other (makes it possible to stick a simple classifier on top of unsupervised learning and get decent results, for example)"


some good some bad here...

but author dude, this is just an idiotic statement to make "The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! "

Please come back to reality author dude. R is indeed free (not sure what the exclamation point is trying to prove-we all know that). Curiously the hardware which is solely needed to run R is not! [that's for emphasis].

According to amazon.com, I (or a high school student, or his/her family or school district) can pick up a used TI for $45 or so. That could be one family's daily take home pay for sadly a not 0% of people in America. What kind of R compatible laptop can I get for that? Does R run nicely on a chromebook? Do high school students want to run Linux? Do you honestly believe every school has tons of laptops available to their students for home and school use?

Technology is both great and expensive. Also, a high school student can get into a lot less trouble trying to use social media on a TI calculator (hint-they can't) then on a school issued laptop, if they are even issued one.

So please don't just shout and say, yup, the problem is these kids aren't getting the best technology out there. Fix that and we're done. Its content, enthusiasm, training, and teamwork.

I'm surprised no one has mentioned engineers reinventing statistics under the moniker "uncertainty quantification." There are many similarly dissatisfied statisticians and relatively ill-informed engineers (relative to their engineering expertise, that is) in that developing discussion.

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr