by Fang Zhou, Data Scientist; and Graham Williams, Director of Data Science, all at Microsoft
Rattle — the R Analytical Tool To Learn Easily — is a popular open-source GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. All of the underlying R code is presented as a script for learning R and for running independent of Rattle.
Collaborating with IGD Data Insight Team and under the guidance of the author of Rattle, Graham Williams, we took the challenging task to understand the existing code base and re-engineer it to support the latest machine learning algorithms in Open Source R and Microsoft R Server for model development and evaluation.
Extreme Gradient Boosting algorithm from the R package xgboost, is one of the newly added features, to provide alternative option for implementing boosting model. The main effort in integrating xgboost into Rattle lies in three aspects:
- Define generic functions to provide a formula interface to streamline the process of fitting extreme gradient boosting model.
- Define the main R script to build, display and evaluate the model.
- Update the Rattle GUI to support the choice of xgboost using Glade Interface Designer and interactive R commands.
Now we demonstrate the usage of Rattle for xgboost on the credit card data set from Kaggle Competition- Credit Card Fraud Detection.
After loading the credit card data in CSV file from Rattle’s Data Tab, we can click on Model Tab to navigate to Boosting Model. By choosing the Model Builder xgb and a set of hyper-parameters, we can easily build a xgboost model without coding.
The measure and visualization of feature importance as well as training error can be generated by clicking the Importance and Errors buttons.
Performance evaluation is also supported. By navigating to the Evaluate Tab, we can calculate the confusion matrix and draw various statistical plots for model evaluation, such as ROC curve, Risk chart and Lift chart.
Do check the Log Tab to review the commands that were executed underneath.
Inspired by the work of IGD Data Insight Team (see this blog Microsoft R Server support for Rattle) and the latest release of LightGBM, mxnet, MicrosoftML etc, we could extend Rattle to expose plenty of functionality in the near future.
The latest release of Rattle (Version 5.0.18) is available on Bitbucket.
You can try this new version out using either Microsoft R Client on Windows or fire up an Azure Linux Data Science Virtual Machine which comes with the developer version of Microsoft R Server installed. Then upgrade the pre-installed Rattle to this new release.
togaware: Rattle: A Graphical User Interface for Data Mining using R
This looks interesting,
can you give me any hints as to what qualities (numeric, size, number of variables) my dataset must conform to for the boost choice to be usable? My rattle version matches yours. I am choosing only numeric variables and a numeric target, but unfortunately the boost choice is still greyed out on the model tab.
Thanks for the help,
E
Posted by: Eason Jostad | July 08, 2017 at 15:09