December 03, 2015


Hi Joseph,

Thank you very much for this interesting article. It seems pretty useful on multiple levels, like an introduction to GA, or the usage of carets feature selection methods.
While reading I was wondering if a) if the number of generations is high enough and b) if a subset size of 20 might be chosen to big.
RandomForest e.g. Start (as default) with a subset size of sqrt(variables), for classification. Using ~50% of all variables might be a bit to much. Maybe the variable importance from the internal random forests would shed more light here. Is something known about the 'optimal' subset of features for this data set?

Holger Fröhlich did nearly the same in his theses, which might be interesting for you:



Thanks for the GA part, but I am afraid that you are applying the whole pipeline in an incorrect way - namely "feature selection (using all data)" => "model assessment using the selected features (again using all data)". This is well known to lead to selection bias - please see more details here: http://www.nodalpoint.com/not-perform-feature-selection/

