by Fang Zhou, Data Scientist at Microsoft and Wee Hyong Tok, Senior Data Scientist Manager at Microsoft
In today’s fast-paced world, mobile phone customers have many choices and can easily switch between service providers. Improving customer attrition rates and enhancing a customer’s experience are valuable ways to reduce customer acquisition costs and maintain a high-quality service.
Many industries, including mobile phone service providers, use churn models to predict which customers are most likely to leave, and to understand which factors cause customers to stop using their service. In this blog post, we sketch a solution to help providers, especially telecommunication companies, predict customer churn. This solution provides guidance on how to harness advanced analytics in SQL Server 2016, leveraging its new feature R Services and familiar SQL tools and technologies. To be specific, it covers the following phases:
- How a data scientist can make use of SQL Server R Services to do data science, and
- How to operationalize the telco churn models via T-SQL stored procedures and build an intelligent application.
The usage and comparison of CRAN R and RevoScaleR packages are also highlighted here.
Data Science with R Services
In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process -- data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation -- in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried various tree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.
All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.
In order to explore and visualize the data, we could leverage a branch of CRAN R packages for manipulation and visualization, such as dplyr, ggplot and d3heatmap. To avoid the memory limitations of CRAN R and avoid moving the raw data out from the database, we instead used RevoScaleR functions for data aggregation. The rxCube and rxCrossTabs functions, used under the in-sql compute context, is all that's needed fir the visualization work. Below is a sample interactive visualization dashboard created by adding the power of leaflet and shiny.
Non-churn |
churn |
|
Predicted non-churn |
5138 |
405 |
Predicted churn |
300 |
279 |
Confusion Table for XGBoost. Note: Accuracy=0.885; Precision=0.407; Recall=0.482; F-Score=0.441; AUC=0.85
The comparison among different tree-based algorithms are summarized in the following table.
Package |
Algorithm |
Model Performance Metrics |
Computation Location |
||||
Accuracy |
Precision |
Recall |
F-score |
AUC |
|||
CRAN R |
randomForest |
0.907 |
0.552 |
0.117 |
0.194 |
0.76 |
Memory |
xgboost |
0.885 |
0.407 |
0.482 |
0.441 |
0.85 |
Memory |
|
ScaleR |
rxDForest |
0.903 |
0.475 |
0.145 |
0.222 |
0.77 |
SQL Server |
rxBTrees |
0.912 |
0.616 |
0.210 |
0.314 |
0.85 |
SQL Server |
Comparison of Different Tree-based Classification Models. Note: the parameter settings for each algorithm are as shown in the R code.
Operationalizing R Models via T-SQL
After the R script or model is ready for production use, the data scientist or a database developer can embed the code or model in system stored procedures and invoke the saved code from an application.
When using R with SQL Server, we can leverage the familiar Transact-SQL interface, and make sure that all computations take place in the database, avoiding unnecessary data movement. When the R code is deployed to production, SQL Server R Services provides the best of the R and SQL worlds.
We can use R for statistical computations that are difficult to implement using SQL, and also leverage the power of SQL Server to achieve maximum performance. For example, you can now leverage the power of the in-memory database engine and column store indexes.
In this telco churn example, we used Transact-SQL to train a churn model and generate scores from this predictive model in production, or return plots generated by R and present them in an application such as Reporting Services. The stored procedure ‘generate_cdr_rx_forest’ demonstrates an example of how we harness the power of R to deliver a churn model to the application developers via T-SQL.
Putting all stored procedures that have been created for training, evaluating and prediction together, we can then leverage the stored procedures in an intelligent application.
Here, we built an example application by using SQL Server Report Builder to display the telco customer churn prediction results.
Resources
- To try out the telco churn example, don’t hesitate to download all the materials you need from this Github repository.
- To get a general understanding of how to do data science with SQL Server 2016, you could refer to the webinar Data Science for Database Professionals.
- If a cloud-based implemtation is what you need, the webinar Harness Predictive Customer Churn Models with Cortana Analytics Suite offers an implementation guide.
Comments
You can follow this conversation by subscribing to the comment feed for this post.