by Fang Zhou, Data Scientist at Microsoft and Wee Hyong Tok, Senior Data Scientist Manager at Microsoft
In today’s fast-paced world, mobile phone customers have many choices and can easily switch between service providers. Improving customer attrition rates and enhancing a customer’s experience are valuable ways to reduce customer acquisition costs and maintain a high-quality service.
Many industries, including mobile phone service providers, use churn models to predict which customers are most likely to leave, and to understand which factors cause customers to stop using their service. In this blog post, we sketch a solution to help providers, especially telecommunication companies, predict customer churn. This solution provides guidance on how to harness advanced analytics in SQL Server 2016, leveraging its new feature R Services and familiar SQL tools and technologies. To be specific, it covers the following phases:
- How a data scientist can make use of SQL Server R Services to do data science, and
- How to operationalize the telco churn models via T-SQL stored procedures and build an intelligent application.
Data Science with R Services
In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process -- data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation -- in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried various tree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.
All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.
In order to explore and visualize the data, we could leverage a branch of CRAN R packages for manipulation and visualization, such as dplyr, ggplot and d3heatmap. To avoid the memory limitations of CRAN R and avoid moving the raw data out from the database, we instead used RevoScaleR functions for data aggregation. The rxCube and rxCrossTabs functions, used under the in-sql compute context, is all that's needed fir the visualization work. Below is a sample interactive visualization dashboard created by adding the power of leaflet and shiny.