« Analysis of airline performance | Main | A web-based application for generalized linear mixed-effects models »

September 30, 2009


Feed You can follow this conversation by subscribing to the comment feed for this post.

The following post by John Langford illustrates some additional differences between the two fields: http://hunch.net/?p=318

But by far the best description of the differences between ML and Stats is by Leo Breiman in his Wald Lectures: http://www.stat.berkeley.edu/~breiman/ Breiman is in a unique position. Born a probabilist, he went into statistical consulting for a decade or so, and emerged at the intersection of ML and Statistics.

I see three major differences between the two fields.

1. ML focuses on finite-sample bounds; that's why bayesian and PAC approaches are so popular in the literature. Conversely, a lot of Statistics works on "endless asymptotics" (Breiman again).

2. A majority of Statistics research works on structural models in a parametric framework. You *know* something about your problem, and use custom tools for that problem. E.g., Econometrics, Industrial Statistics etc. As a result, it suffers from overspecialization and inertia (a lot of established areas are not so interesting anymore, but still attract brains). ML has very few, well-defined canonical problems in mind: supervised/unsupervised learning, active learning, and a few others.

3. ML explicitly includes computational issues in the analysis, whereas in Statistics this is acquired taste.

The last point is relevant to R, since it was designed by computational statisticians, not computer scientists, and the initial design choices (weak typing, single-threading, interpreter) are affecting its ability to keep up. I remember Brian Ripley asking for what problem is R inadequate because it manages too small amounts of data. The answer is: a lot! Many real-world data sets are in the range 1GB-1TB, and some are bigger. The Netflix prize is just a case in point, but the KDD cup (large data set) this year wasn't suitable for R either. On the plus side, R has all the flexibility and versatility a statistician needs.

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr