by Joseph RIckert

Last July, I blogged about rxDTree() the RevoScaleR function for building classification and regression trees on very large data sets. As I explaned then, this function is an implementation of the algorithm introduced by Ben-Haim and Yom-Tov in their 2010 paper that builds trees on histograms of data and not on the raw data itself. This algorithm is designed for parallel and distributed computing. Consequently, rxDTree() provides the best performance when it is running on a cluster: either an Microsoft HPC cluster or a Linux LSF cluster.

rxDForest() (new with Revolution R Enterprise 7.0) uses rxDTree() to take the next logical step and implement a random forest type algorithm for building both classification and regression forests. Each tree of the ensemble constructed by rxDForest() is built with a bootstrap sample that uses about 2/3 of the original data. The data not used in builting a particular tree is used to make predictions with that tree. Each point of the original data set is fed through all of the trees that were built without it. The decision forest prediction for that data point is the statistical mode of the individual tree predictions. (For classification problems the prediction is a majority vote, for regression problems the prediction is the mean of the predictions.)

Only a couple of parameters need to be set to fit a decision forest. nTree specifies the number of trees to grow and mTry spedifies the number of variables to sample as split candidates at each tree node. Of course, many more parameters can be set to control the algorithm, including the parameters that control the underlying rxDTree() algorithm.

The following is a small example of the rxDForest() fucntion using the mortgage default dataset that can be downloaded from Revolution Analytic's website. Here are the first three lines of data.

creditScore houseAge yearsEmploy ccDebt year default

1 615 10 5 2818 2000 0

2 780 34 5 3575 2000 0

3 735 12 1 3184 2000 0

The idea is to see if the variables creditScore, houseAge etc. are useful in predicting a default. The RevoScaleR R code in the file Download RxDForest reads in the mortgage data, splits the data into a training file and a test file, uses rxDTree() to build a single tree (just to see what one looks like for this file) and plots the tree. Then rxDForest() is run against the training file to to build an ensemble model and this model run against the test file to make predictions. Finally, the code plots the ROC curve for the decision forest ensemble model.

Here is what the first few nodes of the tree looks like. (The full tree is printed at the bottom of the code in the file above.)

Call:

rxDTree(formula = form1, data = "mdTrain", maxDepth = 5)

File: C:\Users\Joe.Rickert\Documents\Revolution\RevoScaleR\mdTrain.xdf

Number of valid observations: 8000290

Number of missing observations: 0

Tree representation:

n= 8000290

node), split, n, deviance, yval

* denotes terminal node

1) root 8000290 39472.30000 4.958445e-03

2) ccDebt< 9085.5 7840182 21402.25000 2.737309e-03

4) ccDebt< 7844 7384170 8809.46500 1.194447e-03

He is a plot of the right part of the tree drawn with RevoScaleR's creatTreeView() function that enables plot() to put the graph in your browser.

And, finally, here is the ROC curve for the decision Forest model. (The text output describing the model is also in the file containing the code.)

I plan to try rxDForest() out on a cluster with a bigger data set. When I do, I will let you know.