By Mike Bowles
Ensemble methods are the backbone of machine learning techniques. However, it can be a daunting subject for someone approaching it for the first time, so we asked Mike Bowles, machine learning expert and serial entrepreneur to provide some context.
Ensemble Methods are among the most powerful and easiest to use of predictive analytics algorithms and R programming language has an outstanding collection that includes the best performers – Random Forest, Gradient Boosting and Bagging as well as big data versions that are available through Revolution Analytics.
The phrase “Ensemble Methods” generally refers to building a large number of somewhat independent predictive models and then combining them by voting or averaging to yield very high performance. Ensemble methods have been called crowd sourcing for machines. Bagging, Boosting and Random Forest all have the objective of improving performance beyond what’s achievable with a binary decision tree, but the algorithms take different approaches to improving performance.
Bagging and Random Forests were developed to overcome variance and stability issues with binary decision trees. The term “Bagging” was coined by the late Professor Leo Breiman of Berkeley. Professor Breiman was instrumental in the development of decision trees for statistical learning and recognized that training and averaging a multitude of trees on different random subsets of data would reduce variance and improve stability. The term comes from a shortening of “Bootstrap Aggregating” and the relation to bootstrap sampling is obvious.
Tin Kam Ho of Bell Labs developed Random Decision Forests as an example of a random subspace method. The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data). Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns. The randomForest package in R was written by Professor Breiman and Adele Cutler.
Boosting methods grew out of work on computational learning theory. The first algorithm of this type was called AdaBoost by its authors Freund and Shapire. In the introduction to their paper they use the example of friends going to the race track regularly and betting on the horses. One of the friends decides to devise a method of betting a fraction of his money with each of his friends and adjusting the fractions based on results so that his performance over time approaches the performance of his most winning friend. The goal with Boosting is maximum predictive performance.
AdaBoost stood for a long time as the best example of a black box algorithm. A practitioner could apply it without much parameter tweaking and it would yield superior performer while almost never overfitting. It was a little mysterious. In some of Professor Breiman’s papers on Random Forests, he compares performance with AdaBoost. Professor Jerome Friedman and his Stanford colleagues Professors Hastie and Tibshirani authored a paper in 2000 that attempted to understand why AdaBoost was so successful. The paper caused a storm of controversy. The comments on the paper were longer than the paper itself. Most of the comments centered around whether boosting was just another way of reducing variance or was doing something different by focusing on error reduction. Professor Friedman offered several arguments and examples to demonstrate that boosting is more than just another variance reduction technique, but commenters did not reach a consensus.
The understanding that Professor Friedman and his colleagues developed from analyzing AdaBoost led him to formulate the boosting method more directly. That led to a number of several valuable extensions and improvements beyond AdaBoost – the ability to handle regression and multiclass problems, other performance measures besides squared error etc. These features (and new ones being developed) are all included in the excellent R package gbm by Greg Ridgeway.
Today, ensemble methods form the backbone of many data science applications. Random Forests has become particularly popular with modelers competing in Kaggle competitions and according to google trends Random Forests has surpassed AdaBoost in popularity.
In a future post we will explore several of these algorithms in R.
References
Breiman Leo, Bagging Predictors, Technical Report No. 421, Sept 1994, Dept of Statistics University of California, Berkeley.
Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995
Breiman, Leo Random Forest – Random Features, Technical Report No. 567, Sept 1999, Dept of Statistics University of California, Berkeley.
Freund, Yoav and Schapire , Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55. 1997.
Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert Additive Logistic Regression: A Statistical View of Boosting, Ann Stat, Vol 28, Number 2, (2000), 337-655
Great post .. Awaiting the exploration of these algorithms !!
Posted by: Ankur | March 25, 2014 at 23:35
"The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data). Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns."
I think that the most important idea behind Random Forest is that by columns sampling you decorrelate trees, generating a better committe classifier. At your text this idea isn't so clear. And after all, Random Forest is just a bagging with a heuristics.
Posted by: Flavio | March 26, 2014 at 07:17
Octopus CIP (http://www.octopuscip.comcastbiz.net) supports the same approach to optimize data analysis. Models are the foundation of Octopus Cloud Interactive Platform (CIP) and they can be executed in parallel. We call our approach (n + 1)-models. n models run simultaneously to perform data analysis and one (+1) monitors performance of those n models. The final decision (result)is the result of the best performing model or a weighted average on results of all n models. Weight of each model is recalculated based on it's performance on each iteration of an analysis.
Posted by: Alex Mylnikov | March 26, 2014 at 12:15