« Setting up an Azure Resource Manager virtual machine with RStudio | Main | Microsoft's new Data Science Virtual Machine »

December 03, 2015


Feed You can follow this conversation by subscribing to the comment feed for this post.

Hi Joseph,

Thank you very much for this interesting article. It seems pretty useful on multiple levels, like an introduction to GA, or the usage of carets feature selection methods.
While reading I was wondering if a) if the number of generations is high enough and b) if a subset size of 20 might be chosen to big.
RandomForest e.g. Start (as default) with a subset size of sqrt(variables), for classification. Using ~50% of all variables might be a bit to much. Maybe the variable importance from the internal random forests would shed more light here. Is something known about the 'optimal' subset of features for this data set?

Holger Fröhlich did nearly the same in his theses, which might be interesting for you:



Thanks for the GA part, but I am afraid that you are applying the whole pipeline in an incorrect way - namely "feature selection (using all data)" => "model assessment using the selected features (again using all data)". This is well known to lead to selection bias - please see more details here: http://www.nodalpoint.com/not-perform-feature-selection/

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid