I get my daily R fortune by following Rfortunes on Twitter. This one came up the other day:

To paraphrase provocatively, 'machine learning is statistics minus any checking of models and assumptions'. Brian D. Ripley.

In a similar vein, back in December Brendan O'Connor remarked upon Rob Tibshirani's comparison of machine learning and statistics, reproduced here:

Glossary | |

Machine learning | Statistics |
---|---|

network, graphs | model |

weights | parameters |

learning | fitting |

generalization | test set performance |

supervised learning | regression/classiﬁcation |

unsupervised learning | density estimation, clustering |

large grant = $1,000,000 | large grant = $50,000 |

nice place to have a meeting: Snowbird, Utah, French Alps | nice place to have a meeting: Las Vegas in August |

It's certainly a pithy comparison. Brendan O'Connor concurs that the differences between the two are more superficial than substantive, and his thoughts on the cultural differences between the two disciplines are very interesting. Amongst other things, his comparison of two similar courses in Stanford (one from the Computer Science department, one from Statistics) leads him to conclude:

ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.

So, do statisticians "merely" have an image problem in this field, or is there something more substantive at play? Perhaps protests like this are in our future...

*CMU machine learning students "protest" at the G20 summit in Pittsburg, September 25 2009. Photo by Arthur Gretton on Flickr.*

AI and Social Science: Statistics vs. Machine Learning, fight! (via @Cmastication)

The following post by John Langford illustrates some additional differences between the two fields: http://hunch.net/?p=318

But by far the best description of the differences between ML and Stats is by Leo Breiman in his Wald Lectures: http://www.stat.berkeley.edu/~breiman/ Breiman is in a unique position. Born a probabilist, he went into statistical consulting for a decade or so, and emerged at the intersection of ML and Statistics.

I see three major differences between the two fields.

1. ML focuses on finite-sample bounds; that's why bayesian and PAC approaches are so popular in the literature. Conversely, a lot of Statistics works on "endless asymptotics" (Breiman again).

2. A majority of Statistics research works on structural models in a parametric framework. You *know* something about your problem, and use custom tools for that problem. E.g., Econometrics, Industrial Statistics etc. As a result, it suffers from overspecialization and inertia (a lot of established areas are not so interesting anymore, but still attract brains). ML has very few, well-defined canonical problems in mind: supervised/unsupervised learning, active learning, and a few others.

3. ML explicitly includes computational issues in the analysis, whereas in Statistics this is acquired taste.

The last point is relevant to R, since it was designed by computational statisticians, not computer scientists, and the initial design choices (weak typing, single-threading, interpreter) are affecting its ability to keep up. I remember Brian Ripley asking for what problem is R inadequate because it manages too small amounts of data. The answer is: a lot! Many real-world data sets are in the range 1GB-1TB, and some are bigger. The Netflix prize is just a case in point, but the KDD cup (large data set) this year wasn't suitable for R either. On the plus side, R has all the flexibility and versatility a statistician needs.

Posted by: gappy | September 30, 2009 at 08:46