by Thomas Dinsmore
This is the third in a series of posts highlighting new features in Revolution R Enterprise Release 6.2, which is scheduled for General Availability April 22. This week's post features our new Stepwise Regression capability.
The Stepwise process starts with a specified model and then sequentially adds into or removes from the model the variable that improves the fit most based on a selection criterion until no further improvement is possible or it hits a specified model boundary. By automating the process of selecting feature candidates for use in a predictive model, Stepwise Regression significantly accelerates the model building process.
One of our customers, for example, builds more than a thousand models every week for targeted marketing. At that scale of activity, traditional model-fitting techniques are simply too slow. Starting with a feature set of more than 500 candidate variables, this customer runs fast feature selection techniques to reduce the number of variables, then runs Stepwise Regression to finalize the model.
In designing the Stepwise Regression capability, we relied on customer feedback, and also reviewed similar capabilties in open source R, such as stepAIC() in the MASS package. Since many of our customers are actively converting from SAS, we looked at the Stepwise capabilities in SAS as well.
In Release 6.2, we support the following Stepwise methods for Linear Regression:
- Forward selection
- Backwards elimination
- Bidirectional search
We support three different user-specifiable selection criteria:
- AIC
- BIC
- Mallows' Cp
Coming up later this year in our Release 7.0, we plan to expand the Stepwise capabilities to Logistic Regression and General Linear Models.
Your comments and suggestions are welcome. If you like to use Stepwise Regression and you are interested in a feature that you don't see mentioned in this post, let us know what you think in the Comments section below.
Is this stepwise regression available for the RevoScaleR class of datasets (Big Data)? Will it be available for parallel computation
An example/demonstration as and when is available would be nice. Stepwise regression have computational advantages but some theorists discourage them on grounds of reliability.
How about other ensemble methods and data mining for Big Data.
Posted by: Ajay Ohri | March 26, 2013 at 09:07
Thanks for the comment! Stepwise Regression will be a feature of ScaleR, our scaleable package for Big Data.
Good idea about a demonstration. Agree with you that Stepwise is not universally accepted (among theorists or working analysts). In light of the demand for it from our existing customers, we think it best to support it and let users decide.
We have ensemble methods and other advanced techniques (such as random forests) in our roadmap for future releases. I'll write about them in this blog as we refine our release schedule.
Posted by: Thomas W Dinsmore | March 26, 2013 at 10:09
Yeah, in my experience, stepwise tends to overfit massively. Its OK if you use a hold out set, but even then I find that lasso or ridge will give much more stable and useful solutions. You tend not to get a p value, but that's part of the charm for me, at least.
Posted by: disgruntledphd | March 26, 2013 at 11:37