by Joseph Rickert

The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed environments.

The basic idea underlying the gbm is that weak learners may be combined in an additive, iterative fashion to produce highly accurate classifiers that are resistant to overfitting. While in theory any weak learner can be incorporated into a gbm, classification and regression trees, or decision trees as they are called in RRE, are particularly convenient. As Kuhn and Johnson (p 205) point out:

- They have the flexibility to be weak learners by simply restricting their depth
- Separate trees can be easily added together
- Trees can be generated very quickly

rxBTrees() scales gbm() by building the boosting algorithm around RRE’s rxDTree() decision tree implementation. As I described in a previous post, this algorithm scales to work with large data sets by building trees on histograms, a summary of the data, and not on the data itself. rxBTrees(), like rxDTree() is also a proper parallel external memory algorithm (PEMA) This means two things:

- It uses a “chunking” strategy to work with a block of data at a time that has been read from disk. Since there is no need to keep all of the data in memory at one time, rxBtrees() can build a gbm on arbitrarily large files.
- PEMA’s are parallel algorithms built on top of RRE’s distributed computing infrastructure. This allows them to take full advantage of the opportunities for parallelism implicit in the underlying hardware by setting up a “compute context”. In the example below, the compute context is set to local parallel to take advantage of the cores on my PC. RRE also allows establishing compute contexts that enable PEMAs to run in parallel on clusters.

The following function fits a rxBTrees() model to a training data subset of the a mortgage data set. rxBTrees() contains many input parameters. Some are common to all RevoScaleR PEMAs, others are inherited from rxDTree(), and still others such as nTree and learningRate have direct counterparts in R's gbm() function. In this function call, I have explicitly called out parameters that pertain to fitting a gbm model

# Fit a Boosted Trees Mode on the training data set form <- formula(default ~ houseAge + creditScore + yearsEmploy + ccDebt) BT_model <- rxBTrees(formula = form, data= "mdTrain", lossFunction = "bernoulli", minSplit = NULL, # Min num of obs at node before a split is attempted minBucket = NULL, # Min num of obs at a terminal node maxDepth = 1, # 2^maxDepth = number of leaves on tree cp = 0, # Complexity parameter maxSurrogate = 0, # Max surrogate splits retained, 0 improves perf useSurrogate = 2, # Determines how surrogates are used surrogateStyle = 0, # Sets surrogate algorithm nTree = 100, # Num of trees = num of boosting iterations mTry = NULL, # Num vars to sample as a split candidate sampRate = NULL, # % of observations to sample for each tree # default = 0.632 importance = TRUE, # Importance of predictors to be assessed learningRate = 0.1, # Also known as shrinkage maxNumBins = NULL, # Max num bins in histograms # default: min(1001,max(sqrt(num of obs))) verbose = 1, # Amount of status sent ot console computeContext="RxLocalParallel", # use all cores available in PC overwrite = TRUE) # Allow xdf output file to be overwritten

Some especially noteworthy parameters are maxDepth, maxNumBins and computeContext. maxDepth accomplishes the same purpose as interaction.depth in the gbm() function. However, they are not the same parmeter. Setting interaction.depth = 1 produces stump trees with two leaves each because the number of leaves = interaction.depth + 1. In rxBTRees(), stumps are produced by setting maxDepth = 1. But, this is because 2^{maxDepth} gives the number of leaves.

MaxNumBins determines the number of bins to be used in making histograms of the data. If anyone ever wanted to compare gbm() directly with rxBTrees() on a small data set for example, setting MaxNumBins to the number of points in the data set would mimic gbm(). Note, however, that this would saddle rxBTrees() with all of the overhead of computing histograms while receiving none of the benefits.

The computeContext parameter is the mechanism for connecting a PEMA to the underlying hardware and that enables functions such as rxBTrees() to run in parallel on a various clusters using distributed data. Some of the details of how this happens have been described in a previous post, but I hope to be able to follow up this post with an example of rxBTrees() running on Hadoop in the not too distant future.

The data set used to build the model above is available on Revolution Anaytics' data set web page both as an xdf file and as multiple csv files. The following command shows what the data looks like.

> rxGetInfo(mortData,getVarInfo=TRUE,numRows=3)

File name: C:\DATA\Mortgage Data\mortDefault\mortDefault2.xdf Number of observations: 1e+07 Number of variables: 6 Number of blocks: 20 Compression type: none Variable information: Var 1: creditScore, Type: integer, Low/High: (432, 955) Var 2: houseAge, Type: integer, Low/High: (0, 40) Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) Var 4: ccDebt, Type: integer, Low/High: (0, 15566) Var 5: year, Type: integer, Low/High: (2000, 2009) Var 6: default, Type: integer, Low/High: (0, 1) Data (3 rows starting with row 1): creditScore houseAge yearsEmploy ccDebt year default 1 615 10 5 2818 2000 0 2 780 34 5 3575 2000 0 3 735 12 1 3184 2000 0

In building the model, I divided the mortgage default data randomly into a training and test data sets. Then, I fit an rxBTrees() model on the training data and used the test data to compute the ROC curve shown below.

Along the way, the model also produced the out of bag error versus number of trees plot which shows that for this data set one could probably get by with only building 70 or 80 trees.

The code I used to build the model is available here: Download Post_model. If you do not have Revolution R Enterprise available to you at your company, you can try out this model for yourself in the AWS cloud by taking advantage of Revolution Analytics' 14 day free trial.

Take the first step towards building a gbm on big data.