by Błażej Moska, computer science student and data science intern
One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.
Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:
\[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]
\[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]
Bias is defined as follows:
\[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]
One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.
\(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.
We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over nsampl
times, where nsampl
is the parameter describing number of bootstrap samples.
Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.
Because bias and variance could be controlled by parameters sent to the rpart
function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are cp
(complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and minsplit
, which defines minimum number of observations needed to attempt a split.
Operations mentioned above is rather exhaustive in computational terms: we need to create nsampl bootstrap samples, grow nsampl
trees, calculate nsampl
predictions, nrow
variances, nrow
biases and repeat those operations for the number of parameters (length of the vector cp
or minsplit
). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.
So, summing up, the procedure looks as follows:
- Create bootstrap samples (by bootstrapping original dataset)
- Train model on each of these bootstrap datasets
- Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
- Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
- Calculate mean bias/absolute bias and mean variance
Hi, I'm David. A data scientist from Spain. I'm starting to test lightgbm in mi models at work.
I modified that script to use Microsoft's lightgbm and test three different training parameters to test Bias-Variance tradeoff, such as nrounds, num_leaves and feature_fraction. I also add cv to choose best nround with the specific parameters of each sample. Finally, I add mse calculation too.
Please, feel free to tell me if you see any bug or any kind of problem about this script.
I'll continue to improve this code in the future if I have time.
Regards,
David.
https://gist.github.com/demillan/0b7754edcacf5020b400b5c8e0ed10a3
Posted by: David E. Millán | November 16, 2017 at 02:52
Thanks for sharing the code, David!
Posted by: David Smith | November 16, 2017 at 08:55