« Analysis of airline performance | Main | A web-based application for generalized linear mixed-effects models »

September 30, 2009

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b0120a601b3ce970c

Listed below are links to weblogs that reference The difference between Statistics and Machine Learning:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

The following post by John Langford illustrates some additional differences between the two fields: http://hunch.net/?p=318

But by far the best description of the differences between ML and Stats is by Leo Breiman in his Wald Lectures: http://www.stat.berkeley.edu/~breiman/ Breiman is in a unique position. Born a probabilist, he went into statistical consulting for a decade or so, and emerged at the intersection of ML and Statistics.

I see three major differences between the two fields.

1. ML focuses on finite-sample bounds; that's why bayesian and PAC approaches are so popular in the literature. Conversely, a lot of Statistics works on "endless asymptotics" (Breiman again).

2. A majority of Statistics research works on structural models in a parametric framework. You *know* something about your problem, and use custom tools for that problem. E.g., Econometrics, Industrial Statistics etc. As a result, it suffers from overspecialization and inertia (a lot of established areas are not so interesting anymore, but still attract brains). ML has very few, well-defined canonical problems in mind: supervised/unsupervised learning, active learning, and a few others.

3. ML explicitly includes computational issues in the analysis, whereas in Statistics this is acquired taste.


The last point is relevant to R, since it was designed by computational statisticians, not computer scientists, and the initial design choices (weak typing, single-threading, interpreter) are affecting its ability to keep up. I remember Brian Ripley asking for what problem is R inadequate because it manages too small amounts of data. The answer is: a lot! Many real-world data sets are in the range 1GB-1TB, and some are bigger. The Netflix prize is just a case in point, but the KDD cup (large data set) this year wasn't suitable for R either. On the plus side, R has all the flexibility and versatility a statistician needs.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Revolution Analytics
Information about Revolution R

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
R Contest - Click to Enter

R links

Recommended Sites

Search Revolutions Blog