Comments on The difference between Statistics and Machine LearningTypePad2009-09-30T00:00:08ZBlog Administratorhttps://blog.revolutionanalytics.com/tag:typepad.com,2003:https://blog.revolutionanalytics.com/2009/09/the-difference-between-statistics-and-machine-learning/comments/atom.xml/gappy commented on 'The difference between Statistics and Machine Learning'tag:typepad.com,2003:6a010534b1db25970b0120a5ad8640970b2009-09-30T15:46:32Z2009-09-30T15:46:32Zgappyhttp://www.twitter.com/gappy3000The following post by John Langford illustrates some additional differences between the two fields: http://hunch.net/?p=318 But by far the best...<p>The following post by John Langford illustrates some additional differences between the two fields: http://hunch.net/?p=318</p>
<p>But by far the best description of the differences between ML and Stats is by Leo Breiman in his Wald Lectures: http://www.stat.berkeley.edu/~breiman/ Breiman is in a unique position. Born a probabilist, he went into statistical consulting for a decade or so, and emerged at the intersection of ML and Statistics. </p>
<p>I see three major differences between the two fields.</p>
<p>1. ML focuses on finite-sample bounds; that's why bayesian and PAC approaches are so popular in the literature. Conversely, a lot of Statistics works on "endless asymptotics" (Breiman again).</p>
<p>2. A majority of Statistics research works on structural models in a parametric framework. You *know* something about your problem, and use custom tools for that problem. E.g., Econometrics, Industrial Statistics etc. As a result, it suffers from overspecialization and inertia (a lot of established areas are not so interesting anymore, but still attract brains). ML has very few, well-defined canonical problems in mind: supervised/unsupervised learning, active learning, and a few others.</p>
<p>3. ML explicitly includes computational issues in the analysis, whereas in Statistics this is acquired taste. </p>
<p><br />
The last point is relevant to R, since it was designed by computational statisticians, not computer scientists, and the initial design choices (weak typing, single-threading, interpreter) are affecting its ability to keep up. I remember Brian Ripley asking for what problem is R inadequate because it manages too small amounts of data. The answer is: a lot! Many real-world data sets are in the range 1GB-1TB, and some are bigger. The Netflix prize is just a case in point, but the KDD cup (large data set) this year wasn't suitable for R either. On the plus side, R has all the flexibility and versatility a statistician needs.</p>