by Yuzhou Song, Microsoft Data Scientist
R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.
In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.
Step 1: Import the Data
At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.
library(RevoScaleR)
data.path <- "./"
file.name <- "ccFraud.csv"
fraud_csv_path <- file.path(data.path,file.name)
Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.
colClasses <- c("integer","factor","factor","integer",
"numeric","integer","integer","numeric","factor")
names(colClasses)<-c("custID","gender","state","cardholder",
"balance","numTrans","numIntlTrans",
"creditLine","fraudRisk")
fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)
With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:
rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)
and you will get the following:
Step 2: Process Data
Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.
Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:
fraud_flag_csv_source <- RxTextData(file=file.path(data.path,
"ccFraudFlag.csv"))
Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:
make_train_flag <- function(data){
data <- as.data.frame(data)
set.seed(34)
data$trainFlag <- sample(c(0,1),size=nrow(data),
replace=TRUE,prob=c(0.3,0.7))
return(data)
}
Then, use rxDateStep to complete the transformation:
rxDataStep(inData=fraud_csv_source,
outFile=fraud_flag_csv_source,
transformFunc=make_train_flag,overwrite = TRUE)
again, we can check the output file information by using rxGetInfo function:
rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)
we can find the trainFlag column has been appended to the last:
Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:
train_csv_source <- RxTextData(
file=file.path(data.path,"train.csv"))
test_csv_source <- RxTextData(
file=file.path(data.path,"test.csv"))
Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:
rxDataStep(inData=fraud_flag_csv_source,
outFile=train_csv_source,
reportProgress = 0,
rowSelection = (trainFlag == 1),
overwrite = TRUE)
rxDataStep(inData=fraud_flag_csv_source,
outFile=test_csv_source,
reportProgress = 0,
rowSelection = (trainFlag == 0),
overwrite = TRUE)
A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.
down_sample <- function(data){
data <- as.data.frame(data)
data_bad <- subset(data,fraudRisk == 1)
data_good <- subset(data,fraudRisk == 0)
# good to bad ratio 4:1
rate <- nrow(data_bad)*4/nrow(data_good)
set.seed(34)
data_good$keepTag <- sample(c(0,1),replace=TRUE,
size=nrow(data_good),prob=c(1-rate,rate))
data_good_down <- subset(data_good,keepTag == 1)
data_good_down$keepTag <- NULL
data_down <- rbind(data_bad,data_good_down)
data_down$trainFlag <- NULL
return(data_down)
}
Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:
train_downsample_csv_source <- RxTextData(
file=file.path(data.path,
"train_downsample.csv"),
colClasses = colClasses)
rxDataStep(inData=train_csv_source,
outFile=train_downsample_csv_source,
transformFunc = down_sample,
reportProgress = 0,
overwrite = TRUE)
Step 3: Training
In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:
training_vars <- rxGetVarNames(train_downsample_csv_source)
training_vars <- training_vars[!(training_vars %in%
c("fraudRisk","custID"))]
formula <- as.formula(paste("fraudRisk~",
paste(training_vars, collapse = "+")))
The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:
boosted_fit <- rxBTrees(formula = formula,
data = train_downsample_csv_source,
learningRate = 0.2,
minSplit = 10,
minBucket = 10,
# small number of tree for testing purpose
nTree = 20,
seed = 5,
lossFunction ="multinomial",
reportProgress = 0)
Step 4: Prediction and Evaluation
We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:
test_data <- rxImport(test_csv_source)
Then, we take the imported testing set and fitted model object as input for rxPredict function:
predictions <- rxPredict(modelObject = boosted_fit,
data = test_data,
type = "response",
overwrite = TRUE,
reportProgress = 0)
In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:
threshold <- 0.5
predictions <- data.frame(predictions$X1_prob)
names(predictions) <- c("Boosted_Probability")
predictions$Boosted_Prediction <- ifelse(
predictions$Boosted_Probability > threshold, 1, 0)
predictions$Boosted_Prediction <- factor(
predictions$Boosted_Prediction,
levels = c(1, 0))
scored_test_data <- cbind(test_data, predictions)
evaluate_model <- function(data, observed, predicted) {
confusion <- table(data[[observed]],
data[[predicted]])
print(confusion)
tp <- confusion[1, 1]
fn <- confusion[1, 2]
fp <- confusion[2, 1]
tn <- confusion[2, 2]
accuracy <- (tp+tn)/(tp+fn+fp tn)
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
fscore <- 2*(precision*recall)/(precision+recall)
metrics <- c("Accuracy" = accuracy,
"Precision" = precision,
"Recall" = recall,
"F-Score" = fscore)
return(metrics)
}
roc_curve <- function(data, observed, predicted) {
data <- data[, c(observed, predicted)]
data[[observed]] <- as.numeric(
as.character(data[[observed]]))
rxRocCurve(actualVarName = observed,
predVarNames = predicted,
data = data)
}
boosted_metrics <- evaluate_model(data = scored_test_data,
observed = "fraudRisk",
predicted = "Boosted_Prediction")
roc_curve(data = scored_test_data,
observed = "fraudRisk",
predicted = "Boosted_Probability")
The confusion Matrix:
0 1
0 2752288 67659
1 77117 101796
ROC curve is (AUC=0.95):
Summary:
In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.
Comments
You can follow this conversation by subscribing to the comment feed for this post.