As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model:
- Categorical variables can include infrequently-used levels, which will cause problems if sampling leaves them unrepresented in the training set.
- Numerical variables can be in wildly different scales, which can cause instability when fitting models.
- The data set may include several highly-correlated columns, some of which could be pruned from the data without sacrificing predictive power.
- The data set may include missing values that need to be dealt with before analysis can begin.
- ... and many others
The vtreat package is designed to counter common data problems like these in a statistically sound manner. It's a data frame preprocessor which applies a number of data cleaning processes to the input data before analysis, using techniques such as impact coding and categorical variable encoding (the methods are described in detail in this paper). Further details can be found on the vtreat github page, where authors John Mount and Nina Zumel note:
Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.
One final note: the main function in the package, prepare, is a little like model.matrix in that categorical variables are converted into numeric variables using contrast codings. This means that the output is suitable for many machine-learning functions (like xgboost) that don't accept categorical variables.
The vtreat package is available on CRAN now, and you can find a worked example using vtreat in the blog post linked below.
Win-Vector Blog: vtreat: prepare data