by Lourdes O. Montenegro

*Lourdes O. Montenegro is a PhD candidate at the Lee Kuan Yew School of Public Policy, National University of Singapore. Her research interests cover the intersection of applied data science, technology, economics and public policy.*

Many of us now find it hard to live without a good quality internet connection. As a result, there is growing interest in characterizing and comparing internet performance metrics. For example, when planning to switch internet service providers or considering a move to a new city or country, internet users may want to research in advance what to expect in terms of download speed or latency. Cloud companies may want to provision adequately for different markets with varying levels of internet quality. And governments may want to benchmark their communications infrastructure and invest accordingly. Whatever the purpose, a consortium of research, industry and public interest organizations called Measurement Lab has made available the largest open and verifiable internet performance dataset in the planet. With the help of a combination of packages, R users can easily query, explore and visualize this large dataset at no cost.

In the example that follows, we use the *bigrquery* package to query and download results from the Network Diagnostic Tool (NDT) used by the U.S. FCC. The *bigrquery* package provides an interface to Google BigQuery which hosts NDT results along with several other Measurement Lab (M-Lab) datasets. However, R users will need to first set-up a BigQuery account and join the M-Lab mailing list to authenticate. Detailed instructions are provided on the M-Lab website. Once done, SQL-like queries can be run from within R. The results are saved as a dataframe on which further analysis can be performed. Aside from the convenience of working within the R environment, the *bigrquery* package has another advantage: the only limitation to the size of the query results that can be saved for further exploration is the amount of available RAM. In contrast, the BigQuery web interface only allows export to .csv format of query results which are at or below 16,000 rows.

The following R script gives us the average download speed (in Mbps) per country in 2015.[1] The SQL-like query can be modified to return other internet performance metrics that may be of interest to the R user such as upload speed, round-trip time (latency) and packet re-transmission rates.

# Querying average download speed per country in 2015 require(bigrquery) downquery_template <- "SELECT connection_spec.client_geolocation.country_code AS country, AVG(8 * web100_log_entry.snap.HCThruOctetsAcked/ (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd)) AS downloadThroughput, COUNT(DISTINCT test_id) AS tests, FROM plx.google:m_lab.ndt.all WHERE IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.remote_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.connection_spec.local_ip) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.HCThruOctetsAcked) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeRwin) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeCwnd) AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.SndLimTimeSnd) AND project = 0 AND IS_EXPLICITLY_DEFINED(connection_spec.data_direction) AND connection_spec.data_direction = 1 AND web100_log_entry.snap.HCThruOctetsAcked >= 8192 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) >= 9000000 AND (web100_log_entry.snap.SndLimTimeRwin + web100_log_entry.snap.SndLimTimeCwnd + web100_log_entry.snap.SndLimTimeSnd) < 3600000000 AND IS_EXPLICITLY_DEFINED(web100_log_entry.snap.CongSignals) AND web100_log_entry.snap.CongSignals > 0 AND (web100_log_entry.snap.State == 1 OR (web100_log_entry.snap.State >= 5 AND web100_log_entry.snap.State <= 11)) AND web100_log_entry.log_time >= PARSE_UTC_USEC('2015-01-01 00:00:00') / POW(10, 6) AND web100_log_entry.log_time < PARSE_UTC_USEC('2016-01-01 00:00:00') / POW(10, 6) GROUP BY country ORDER BY country ASC;" downresult <- query_exec(downquery_template, project="measurement-lab", max_pages=Inf)

Once we have the query results in a dataframe, we can proceed to visualize and map average download speeds for each country. To do this, we can use the *rworldmap* package which offers a relatively simple way to map country level and gridded user datasets. Mapping is done mainly through two functions: (1) *joinCountryData2Map* joins the query results with shapefiles of country boundaries; (2) *mapCountryData* plots the chloropleth map. Note that the join is best effected using either two or three-letter ISO country codes, although the *rworldmap* package also allows join columns filled with country names.

In order to make the chloropleth map prettier and more comprehensible, we can augment with a combination of the *classInt* package to calculate natural breaks in the range of download speed results and the *RColorBrewer* package for a wider selection of color schemes. In the succeeding R script, we specify the Jenks method to cluster download speed results in such a way that minimizes deviation from the class mean within each class but maximizes deviation across class means. Compared to other methods for clustering download speed results, the Jenks method draws a sharper picture of countries clocking greater than 25 Mbps on average.

require(rworldmap) require(classInt) require(RColorBrewer) downloadmap <- joinCountryData2Map(downresult , joinCode='ISO2' , nameJoinColumn='country' , verbose='TRUE') par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i") #getting class intervals using a 'jenks' classification in classInt package classInt <- classInt::classIntervals( downloadmap[["downloadThroughput"]], n=5, style="jenks") catMethod = classInt[["brks"]] #getting a colour scheme from the RColorBrewer package colourPalette <- RColorBrewer::brewer.pal(5,'RdPu') mapParams <- mapCountryData(downloadmap, nameColumnToPlot="downloadThroughput", mapTitle="Download Speed (mbps)", catMethod=catMethod, colourPalette=colourPalette, addLegend=FALSE) do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendLabels="all", legendIntervals="data", legendMar = 2))

Looking at the map, we see that the UK, Japan, Romania, Sweden, Taiwan, The Netherlands, Denmark and Singapore (if we squint!) are the best places to be for internet speed addicts. Until further investigation, we can safely discount the suspiciously high results for North Korea since the number of observations are too low. In contrast, average download speeds in South Korea might be grossly underestimated when measured from foreign servers, as may be the case with NDT results, since most Koreans access locally hosted content. There are, of course, a number of caveats worth mentioning before drawing any conclusions regarding the causes of varying internet performance between countries. Confounding factors such as distance from the client to the test server, the client's operating system, and the proportion of fixed broadband to wireless connections will need to be controlled for. Despite these caveats, this tentative exploration already reveals interesting patterns in global internet performance that is worth a closer look.

[1] Thanks to Stephen McInerney and Chris Ritzo for code advice.

While there are many admirable efforts to increase participation by women in STEM fields, in many programming teams men still outnumber women, often by a significant margin. Specifically by how much is a fraught question, and accurate statistics are hard to come by. Another interesting question is whether the gender disparity varies by language, and how to define a "typical programmer" for a given language.

Jeff Allen from Trestle Tech recently took an interesting approach using R to gather data on gender ratios for programmers: get a list of the top coders for each programming language, and then count the number of men and women in each list. Neither task is trivial. For a list of coders, Jeff scraped GitHub's list of trending repositories over the past month by programming language, and then extracted the avatars for the listed contributors. Then, he used the Microsoft Cognitive Services Face API on the avatar to determine the apparent gender of each contributor, and then tally up the results. You can find the R code he used on GitHub.

I used Jeff's code to re-run his results based on GitHub's latest monthly rankings. The first thing I needed to do was to request an API Key; a trial key is free with a Microsoft account. (The number of requests per second, but the R code is written to limit the rate of requests accordingly.) I limited my search to the languages C++, C#, Java, Javascript, Python, R and Ruby. The percentage of contributors identified as female, within each language, are shown below:

According to this analysis, none of the contributors top C++ projects on GitHub are female; by contrast, almost 10% of contributors to R projects are female.

Now, these data need to be taken with a grain of salt. The main issue is numbers: fewer than 100 programmers per language are identified as "top programmers" via this method, and sometimes significantly fewer (just 45 top C++ contributors were identified). Part of the reason for this is that not all programmers use their face as an avatar; those that used a symbol, logo or cartoon were not counted. Furthermore, it's reasonable to assume that there's a disparity in the rate at which women use their own face as an avatar compared to men, which would add bias to the above results in addition to the variability from the small numbers. Finally, the gender determination is based on an algorithm which classifies faces as only male or female, and isn't guaranteed to match the gender identity of the programmer (or their avatar).

Nonetheless, it's an interesting example of using social network data in conjunction with cognitive APIs to conduct demographic studies. You can examples of using other data from the facial analysis, including apparent happiness by language, at the link below.

(**Update** June 15: re-ran the analysis and updated the chart above to actually display percentages, not ratios, on the y-axis. The numbers changed slightly as the GitHub data changed. The old chart is here.)

Trestle Tech: EigenCoder: Programming Stereotypes

by Dmitry Pechyoni, Microsoft Data Scientist

The New York City taxi dataset is one of the largest publicly available datasets. It has about 1.1 billion taxi rides in New York City. Previously this dataset was explored and visualized in a number of blog posts, where the authors used various technologies (e.g., PostgreSQL and Apache Elastic Search). Moreoever, in a recent blog post our colleagues showed how to build machine learning models over one year of this dataset using Microsoft R Server (MRS) running in a 4-node Hadoop cluster. In this blog post we will use a single commodity machine to show an end-to-end process of downloading and cleaning 4 years of data, as well building and evaluating a machine learning model. This entire process takes around 12 hours.

We used the Data Science Virtual Machine in Azure cloud. Our virtual machine runs Windows and has Standard A4 configuration (8 cores and 14Gb of memory). This machine comes with 126Gb of hard disk, which is not enough to store NYC taxi dataset. To store the dataset locally, we followed these instructions and attached a 1Tb hard disk to our virtual machine.

New York City taxi dataset has data from 2009-2015 for yellow taxis and from 2013-2015 for green taxis. Yellow and green taxis have different data schema. Also, we have found that the yellow taxi rides from 2010-2013 have exactly the same schema, whereas the schema of the yellow taxi rides from 2009, 2014, 2015 is different. To simplify our code, we have focused on yellow rides from 2010-2013 and discarded other ones. These are more than 600 million yellow taxi rides from 2010-2013, which take 115Gb. Our goal is to build a binary classification model that will predict if a passenger will pay a tip.

We start with downloading the data. We used httr package. This code downloads 115Gb of data to our virtual machine. It takes 1 hour and 10 minutes.

To preprocess a big volume of data in a short time, we used parallel programming capabilities of Microsoft R Server (MRS), in particular the rxExec function. This function allows us to distribute a data processing over multiple cores. Since our machine has 8 cores, by using rxExec we were able to speed up the data processing significantly.

The original dataset has one csv file per month. Initially we convert each csv file to the xdf file format. Xdf files store data in compressed and optimized format and are input to all subsequent Microsoft R Server functions. This code does the above conversion, removes unnecessary columns, removes rows with noisy data, computes several new columns (duration of the trip, weekday, hour and label column) and finally converts several columns to factors.

By leveraging rxExec function and 8 cores, we are able to process 8 csv files in parallel. The data cleaning step takes 3 hours 50 minutes. The resulting xdf files occupy 26 Gb of disk space, a significant reduction from the original 115 Gb.

In the next step we combine multiple xdf files into a single training and a single test file. We used 2010-2012 data for training and 2013 data for testing. When merging the files, we leverage again multiple cores and rxExec function. The merge is done using binary tree scheme, the merges at the same level are executed in parallel. The following picture shows an example of merging 6 files:

In this example initially 3 pairs of files (1+2, 3+4 and 5+6) are merged in parallel. Then 1+2 is merged with 3+4, and finally 1+2+3+4 is merged with 5+6. This code implements binary tree merge.

Merging of small files and creation of train.xdf and test.xdf files takes 5.5 hours. The resulting train.xdf file has 506M row and takes 18 Gb. The test file test.xdf has 148M row and takes 6 Gb.

*An alternative way of preparing the data for modeling would be to convert csv files to a single composite xdf file and then to use it as an input to machine learning algorithm. In this case we can skip the merging step. However, the conversion from csv to composite xdf is done by a single call to rxImport function. This function processes csv files sequentially and is much slower than parallel creation of xdf files and their subsequent merge.*

At this point we are ready to do training and scoring. Notice that the training file has 18 Gb and does not fit into memory (which is 14 Gb). Fortunately, implementations of machine learning algorithms in MRS can handle such big datasets. We ran a single iteration of logistic regression over the training set, which took 1 hour and 10 minutes. Then we ran scoring over the test set, which took 10 minutes.

We performed experiments with 3 feature sets. In the first feature set we used passenger type, trip distance, pickup latitude, pickup longitude, drop-off latitude, drop-off longitude, fare amount, MTA tax, tolls amount, surcharge, duration, hour as numeric features. We also used payment type, rate code, weekday as categorical features. Surprisingly, our model has AUC of 0.984 over the test set. Such a high AUC looks too good to be true and after additional data exploration we have found that the payment type is an excellent predictor is a passenger will leave a tip:

> rxSummary(formula = label ~ payment_type, data = ".\\xdf/train.xdf", summaryStats = c("Mean","ValidObs"), reportProgress = 0) Statistics by category (2 categories): Category payment_type Means ValidObs label for payment_type=Cash Cash 0.0005490915 289816908 label for payment_type=Credit Card Credit Card 0.9720573755 216974429

As we see in the training set, 97% of passengers who pay by credit card, leave a tip. But less than 0.1% of passengers who paid by cash left a tip. The test set has similar statistics. Hence predicting the tip at the time of payment is a very easy task. A more interesting task would be to predict if a passenger will leave a tip before the actual payment. We excluded the payment type feature, reran the experiment and got AUC of only 0.553 . This is better than the baseline AUC of 0.5, but not that impressive. At the next step we defined the hour feature as categorical variable and improved slightly the test set AUC to 0.564 . This relatively low AUC indicates that there is lots of room for improving our model.

This code runs training and scoring with our latest feature set.

The following table summarizes the running times of all steps of our experiment:

By leveraging Microsoft R Server, we were able to perform the entire process of building and evaluating machine learning model over hundreds of millions of examples using a single commodity machine within a number of hours. Our model is a simple logistic regression with a small set of features. In the future, it will be interesting to develop a model with larger feature set that will have more accurate prediction of tips before the actual payment.

The code used to create this experiment is in Github repository.

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or *tuning* parameters that cannot be directly estimated from the data. For example, when using *K*-nearest neighbor model, there is no analytical estimator for *K* (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for *K* and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. `x`

’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20

The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the `caret`

package via a function called `SLC14_1`

. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

```
> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)
```

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as `sigma`

. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:

There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, `caret`

’s `train`

function defines wide as cost values between `2^c(-5, 10)`

and `sigma`

values inside the range produced by the `sigest`

function in the `kernlab`

package. This code will do 20 random sub-models in this range:

```
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ ## Create 20 random parameter values
+ tuneLength = 20,
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = rand_ctrl)
```

`> rand_search`

```
Support Vector Machines with Radial Basis Function Kernel
250 samples
20 predictor
Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared
0.01161955 42.75789360 10.50838 0.7299837
0.01357777 67.97672171 10.71276 0.7212605
0.01392676 828.08072944 10.75235 0.7195869
0.01394119 0.18386619 18.56921 0.2109284
0.01538656 0.05224914 19.33310 0.1890599
0.01711920 228.59215128 11.09522 0.7047713
0.01790202 0.78835920 16.78597 0.3217203
0.01936110 0.91401289 16.45485 0.3492278
0.02023763 0.07658831 19.03987 0.2081059
0.02690269 0.04128731 19.33974 0.2126950
0.02780880 0.64865483 16.52497 0.3545042
0.02920113 974.08943821 12.22906 0.6508754
0.02963586 1.19350198 15.46690 0.4407725
0.03370625 31.45179445 12.60653 0.6314384
0.03561750 0.04970422 19.23564 0.2306298
0.03752561 0.06592800 19.07130 0.2375616
0.03783570 398.44599747 12.92958 0.6143790
0.04534046 3.91017571 13.56612 0.5798001
0.05171719 296.65916049 13.88865 0.5622445
0.06482201 47.31716568 14.66904 0.5192667
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789.
```

`> ggplot(rand_search) + scale_x_log10() + scale_y_log10()`

> getTrainPerf(rand_search)

```
TrainRMSE TrainRsquared method
1 10.50838 0.7299837 svmRadial
```

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and `sigma`

. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and `sigma`

parameters. The Bayesian machinery allows of this prediction to have a *distribution*; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:

The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:

The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The `GPfit`

package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the *confidence bound* approach. For any combination of cost and `sigma`

, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:

Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

```
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+ ## Use the same model code but for a single (C, sigma) pair.
+ txt <- capture.output(
+ mod <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ preProc = c("center", "scale"),
+ metric = "RMSE",
+ trControl = ctrl,
+ tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+ )
+ ## The function wants to _maximize_ the outcome so we return
+ ## the negative of the resampled RMSE value. `Pred` can be used
+ ## to return predicted values but we'll avoid that and use zero
+ list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+ logSigma = c(lower_bounds[2], upper_bounds[2]))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
>
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev.
>
> library(rBayesianOptimization)
>
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+ bounds = bounds,
+ init_grid_dt = initial_grid,
+ init_points = 0,
+ n_iter = 30,
+ acq = "ucb",
+ kappa = 1,
+ eps = 0.0,
+ verbose = TRUE)
```

```
20 points in hyperparameter space were pre-sampled
elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148
elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936
elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128
elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530
elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516
elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571
elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619
elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167
elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351
elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243
elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410
elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139
elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580
elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183
elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224
elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728
elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058
elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799
elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596
elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150
elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728
elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461
elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728
elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455
elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576
elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186
elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820
elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837
elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914
Best Parameters Found:
Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956
```

Animate the search!

The final settings were found at iteration 44 with a cost setting of 485,165,195 and `sigma`

=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?

We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:

```
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+ method = "svmRadial",
+ tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+ sigma = exp(ba_search$Best_Par["logSigma"])),
+ metric = "RMSE",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> compare_models(final_search, rand_search)
```

```
One Sample t-test
data: x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.34640 -1.49626
sample estimates:
mean of x
-1.92133
```

`> postResample(predict(rand_search, large_dat), large_dat$y)`

```
RMSE Rsquared
10.1112280 0.7648765
```

`> postResample(predict(final_search, large_dat), large_dat$y)`

```
RMSE Rsquared
8.2668843 0.8343405
```

Much Better!

Thanks to Yachen Yan for making the `rBayesianOptimization`

package.

by Joseph Rickert

The model table on the caret package website lists more that 200 variations of predictive analytics models that are available withing the caret framework. All of these models may be prepared, tuned, fit and evaluated with a common set of caret functions. All on its own, the table is an impressive testament to the utility and scope of the R language as data science tool.

For the past year or so xgboost, the extreme gradient boosting algorithm, has been getting a lot of attention. The code below compares gbm with xgboost using the segmentationData set that comes with caret. The analysis presented here is far from the last word on comparing these models, but it does show how one might go about setting up a serious comparison using caret's functions to sweep through parameter space using parallel programming, and then used synchronized bootstrap samples to make a detailed comparison.

After reading in the data and dividing it into training and test data sets, caret's trainControl() and expand.grid() functions are used to set up to train the gbm model on all of the combinations of represented in the data frame built by expand.grid(). Then train() function does the actual training and fitting of the model. Notice that all of this happens in parallel. The Task Manager on my Windows 10 laptop shows all four cores maxed out at 100%.

After model fitting, predictions on the test data are computed and an ROC curve is drawn in the usual way. The AUC for gbm was computed to be 0.8731. Here is the ROC curve.

Next, a similar process for xgboost computes the AUC to be 0.8857, a fair improvement. The following plot shows how the ROC measure behaves with increasing tree depth for the two different values of the shrinkage parameter.

The final section of code shows how to caret can be used to compare the two models using the bootstrap samples that were created in the process of constructing the two models. The boxplots show xgboost has the edge although the gbm has a tighter distribution.

The next step, which I hope to take soon, is to rerun the analysis with more complete grids of tuning parameters. For a very accessible introduction to caret have a look at Max Kuhn's 2013 useR! tutorial.

#COMPARE XGBOOST with GBM ### Packages Required library(caret) library(corrplot) # plot correlations library(doParallel) # parallel processing library(dplyr) # Used by caret library(gbm) # GBM Models library(pROC) # plot the ROC curve library(xgboost) # Extreme Gradient Boosting ### Get the Data # Load the data and construct indices to divied it into training and test data sets. data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Pull out the dependent variable testX <- testData[,-1] sapply(trainX,summary) # Look at a summary of the training data ## GENERALIZED BOOSTED RGRESSION MODEL (BGM) # Set up training control ctrl <- trainControl(method = "repeatedcv", # 10fold cross validation number = 5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE, allowParallel = TRUE) # Use the expand.grid to specify the search space # Note that the default search grid selects multiple values of each tuning parameter grid <- expand.grid(interaction.depth=c(1,2), # Depth of variable interactions n.trees=c(10,20), # Num trees to fit shrinkage=c(0.01,0.1), # Try 2 values for learning rate n.minobsinnode = 20) # set.seed(1951) # set the seed # Set up to do parallel processing registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE) # Look at the tuning results # Note that ROC was the performance criterion used to select the optimal model. gbm.tune$bestTune plot(gbm.tune) # Plot the performance of the training models res <- gbm.tune$results res ### GBM Model Predictions and Performance # Make predictions using the test data set gbm.pred <- predict(gbm.tune,testX) #Look at the confusion matrix confusionMatrix(gbm.pred,testData$Class) #Draw the ROC curve gbm.probs <- predict(gbm.tune,testX,type="prob") head(gbm.probs) gbm.ROC <- roc(predictor=gbm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) gbm.ROC$auc #Area under the curve: 0.8731 plot(gbm.ROC,main="GBM ROC") # Plot the propability of poor segmentation histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") ##---------------------------------------------- ## XGBOOST # Some stackexchange guidance for xgboost # http://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees # Set up for parallel procerssing set.seed(1951) registerDoParallel(4,cores=4) getDoParWorkers() # Train xgboost xgb.grid <- expand.grid(nrounds = 500, #the maximum number of iterations eta = c(0.01,0.1), # shrinkage max_depth = c(2,6,10)) xgb.tune <-train(x=trainX,y=trainData$Class, method="xgbTree", metric="ROC", trControl=ctrl, tuneGrid=xgb.grid) xgb.tune$bestTune plot(xgb.tune) # Plot the performance of the training models res <- xgb.tune$results res ### xgboostModel Predictions and Performance # Make predictions using the test data set xgb.pred <- predict(xgb.tune,testX) #Look at the confusion matrix confusionMatrix(xgb.pred,testData$Class) #Draw the ROC curve xgb.probs <- predict(xgb.tune,testX,type="prob") #head(xgb.probs) xgb.ROC <- roc(predictor=xgb.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) xgb.ROC$auc # Area under the curve: 0.8857 plot(xgb.ROC,main="xgboost ROC") # Plot the propability of poor segmentation histogram(~xgb.probs$PS|testData$Class,xlab="Probability of Poor Segmentation") # Comparing Multiple Models # Having set the same seed before running gbm.tune and xgb.tune # we have generated paired samples and are in a position to compare models # using a resampling technique. # (See Hothorn at al, "The design and analysis of benchmark experiments # -Journal of Computational and Graphical Statistics (2005) vol 14 (3) # pp 675-699) rValues <- resamples(list(xgb=xgb.tune,gbm=gbm.tune)) rValues$values summary(rValues) bwplot(rValues,metric="ROC",main="GBM vs xgboost") # boxplot dotplot(rValues,metric="ROC",main="GBM vs xgboost") # dotplot #splom(rValues,metric="ROC")

Apparently, people have strong feelings about how pasta carbonara should be made. A 45-second French video showing a one-pot preparation of the dish with farfalle instead of spaghetti and substituting crème fraîche for most of the cheese — and not even stirring the egg into the pasta to cook it! and it was just the yolk! — caused outrage in Italy: “It’s as if we had made bourguignon with lamb and white wine!”, said one French-Italian chef.

But while some may prefer the original Italian preparation (yours truly included — writing this is making me hungry), there are clearly plenty of variations on the theme in the culinary world. Salvino A. Salvaggio (who clearly falls into the traditionalist camp) scoured YouTube for all of the videos explaining how to make pasta carbonara. After eliminating those with less than 10,000 views and those straying too far from traditional carbonara (e.g. vegan preparations), Salvaggio used the R language to tally up the ingredients used in the 165 videos that remained. Here are the results:

Salvaggio forgives the use of bacon plus some olive oil instead of the traditional guanciale (pork cheeks), and parmesan is a close substitute for peccorino. But he has some strong words for the other interlopers:

However, using cream, garlic, egg yolks (instead of full eggs), onion, butter or white wine would be considered as a culinary heresy by a vast majority of cuisine amateurs. And it is shocking to notice that almost 50% of all recipes suggest to use some kind of cream or milk, while this is by no means part of the traditional recipe.

You can find Salvaggio's preferred recipe for pasta carbonara, along with the R code and data supporting the analysis, in his complete post linked below.

Salvino A. Salvaggio: It’s Not Rocket Science, Just Pasta Carbonara

*by Katherine Zhao, Hong Lu, Zhongmou Li, Data Scientists at Microsoft*

Bicycle rental has become popular as a convenient and environmentally friendly transportation option. Accurate estimation of bike demand at different locations and different times would help bicycle-sharing systems better meet rental demand and allocate bikes to locations.

In this blog post, we walk through how to use Microsoft R Server (MRS) to build a regression model to predict bike rental demand. In the example below, we demonstrate an end-to-end machine learning solution development process in MRS, including data importing, data cleaning, feature engineering, parameter sweeping, and model training and evaluation.

The Bike Rental UCI dataset is used as the input raw data for this sample. This dataset is based on real-world data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States.

The dataset contains 17,379 rows and 17 columns, with each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) are included in the raw data, and the dates are categorized as holiday vs. weekday, etc.

The field to predict is ** cnt**, which contains a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour.

In this example, we use historical bike rental counts as well as the weather condition data to predict the number of bike rentals within a specific hour in the future. We approach this problem as a regression problem, since the label column (number of rentals) contains continuous real numbers.

Along this line, we split the raw data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. Specifically, we employ the Decision Forest Regression algorithm as the regression model and build two models on different feature sets. Finally, we evaluate their prediction performance. We will elaborate the details in the following sections.

We build the models using the `RevoScaleR`

library in MRS. The `RevoScaleR`

library provides extremely fast statistical analysis on terabyte-class datasets without needing specialized hardware. `RevoScaleR`

's distributed computing capabilities can use a different (possibly remote) computing context using the same `RevoScaleR`

commands to manage and analyze local data. A wide range of `rx`

prefixed functions provide functionality for:

- Accessing external data sets (SAS, SPSS, ODBC, Teradata, and delimited and fixed format text) for analysis in R.
- Efficiently storing and retrieving data in a high-performance data file.
- Cleaning, exploring, and manipulating data.
- Fast, basic statistical analysis.
- Train and score advanced machine learning models.

Overall, there are five major steps of building this example using Microsoft R Server:

- Step 1: Import and Clean Data
- Step 2: Perform Feature Engineering
- Step 3: Prepare Training, Test and Score Datasets
- Step 4: Sweep Parameters and Train Regression Models
- Step 5: Test, Evaluate, and Compare Models

First, we import the Bike Rental UCI dataset. Since there are a small portion of missing records within the dataset, we use `rxDataStep()`

to replace the missing records with the latest non-missing observations. `rxDataStep()`

is a commonly used function for data manipulation. It transforms the input dataset chunk by chunk and saves the results to the output dataset.

# Define the tranformation function for the rxDataStep. xform <- function(dataList) { # Identify the features with missing values. featureNames <- c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt") # Use "na.locf" function to carry forward last observation. dataList[featureNames] <- lapply(dataList[featureNames], zoo::na.locf) # Return the data list. return(dataList) } # Use rxDataStep to replace missings with the latest non-missing observations. cleanXdf <- rxDataStep(inData = mergeXdf, outFile = outFileClean, overwrite = TRUE, # Apply the "last observation carried forward" operation. transformFunc = xform, # Identify the features to apply the tranformation. transformVars = c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt"), # Drop the "dteday" feature. varsToDrop = "dteday") Step 2: Perform Feature Engineering

In addition to the original features in the raw data, we add number of bikes rented in each of the previous 12 hours as features to provide better predictive power. We create a `computeLagFeatures()`

helper function to compute the 12 lag features and use it as the transformation function in `rxDataStep()`

.

Note that `rxDataStep()`

processes data chunk by chunk and lag feature computation requires data from previous rows. In `computLagFeatures()`

, we use the internal function `.rxSet()`

to save the last *n* rows of a chunk to a variable ** lagData**. When processing the next chunk, we use another internal function

`.rxGet()`

to retrieve the values stored in # Add number of bikes that were rented in each of # the previous 12 hours as 12 lag features. computeLagFeatures <- function(dataList) { # Total number of lags that need to be added. numLags <- length(nLagsVector) # Lag feature names as lagN. varLagNameVector <- paste("cnt_", nLagsVector, "hour", sep="") # Set the value of an object "storeLagData" in the transform environment. if (!exists("storeLagData")) { lagData <- mapply(rep, dataList[[varName]][1], times = nLagsVector) names(lagData) <- varLagNameVector .rxSet("storeLagData", lagData) } if (!.rxIsTestChunk) { for (iL in 1:numLags) { # Number of rows in the current chunk. numRowsInChunk <- length(dataList[[varName]]) nlags <- nLagsVector[iL] varLagName <- paste("cnt_", nlags, "hour", sep = "") # Retrieve lag data from the previous chunk. lagData <- .rxGet("storeLagData") # Concatenate lagData and the "cnt" feature. allData <- c(lagData[[varLagName]], dataList[[varName]]) # Take the first N rows of allData, where N is # the total number of rows in the original dataList. dataList[[varLagName]] <- allData[1:numRowsInChunk] # Save last nlag rows as the new lagData to be used # to process in the next chunk. lagData[[varLagName]] <- tail(allData, nlags) .rxSet("storeLagData", lagData) } } return(dataList) } # Apply the "computeLagFeatures" on the bike data. lagXdf <- rxDataStep(inData = cleanXdf, outFile = outFileLag, transformFunc = computeLagFeatures, transformObjects = list(varName = "cnt", nLagsVector = seq(12)), transformVars = "cnt", overwrite = TRUE)

Before training the regression model, we split data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. In order to obtain the best combination of parameters for regression models, we further divide year 2011 data into training and test datasets: 80% of the records are randomly selected to train regression models with various combinations of parameters, and the remaining 20% are used to evaluate the models obtained and determine the optimal combination.

# Split data by "yr" so that the training and test data contains records # for the year 2011 and the score data contains records for 2012. rxSplit(inData = lagXdf, outFilesBase = paste0(td, "/modelData"), splitByFactor = "yr", overwrite = TRUE, reportProgress = 0, verbose = 0) # Point to the .xdf files for the training & test and score set. trainTest <- RxXdfData(paste0(td, "/modelData.yr.0.xdf")) score <- RxXdfData(paste0(td, "/modelData.yr.1.xdf")) # Randomly split records for the year 2011 into training and test sets # for sweeping parameters. # 80% of data as training and 20% as test. rxSplit(inData = trainTest, outFilesBase = paste0(td, "/sweepData"), outFileSuffixes = c("Train", "Test"), splitByFactor = "splitVar", overwrite = TRUE, transforms = list(splitVar = factor(sample(c("Train", "Test"), size = .rxNumRows, replace = TRUE, prob = c(.80, .20)), levels = c("Train", "Test"))), rngSeed = 17, consoleOutput = TRUE) # Point to the .xdf files for the training and test set. train <- RxXdfData(paste0(td, "/sweepData.splitVar.Train.xdf")) test <- RxXdfData(paste0(td, "/sweepData.splitVar.Test.xdf"))

In this step, we construct two training datasets based on the same raw input data, but with different sets of features:

- Set A = weather + holiday + weekday + weekend features for the predicted day
- Set B = Set A + number of bikes rented in each of the previous 12 hours, which captures very recent demand for the bikes

In order to perform parameter sweeping, we create a helper function to evaluate the performance of a model trained with a given combination of number of trees and maximum depth. We use *Root Mean Squared Error (RMSE)* as the evaluation metric.

# Define a function to train and test models with given parameters # and then return Root Mean Squared Error (RMSE) as the performance metric. TrainTestDForestfunction <- function(trainData, testData, form, numTrees, maxD) { # Build decision forest regression models with given parameters. dForest <- rxDForest(form, data = trainData, method = "anova", maxDepth = maxD, nTree = numTrees, seed = 123) # Predict the the number of bike rental on the test data. rxPredict(dForest, data = testData, predVarNames = "cnt_Pred", residVarNames = "cnt_Resid", overwrite = TRUE, computeResiduals = TRUE) # Calcualte the RMSE. result <- rxSummary(~ cnt_Resid, data = testData, summaryStats = "Mean", transforms = list(cnt_Resid = cnt_Resid^2) )$sDataFrame # Return lists of number of trees, maximum depth and RMSE. return(c(numTrees, maxD, sqrt(result[1,2]))) }

The following is another helper function to sweep and select the optimal parameter combination. Under local parallel compute context (`rxSetComputeContext(RxLocalParallel())`

), `rxExec()`

executes multiple runs of model training and evaluation with different parameters in parallel, which significantly speeds up parameter sweeping. When used in a compute context with multiple nodes, e.g. high-performance computing clusters and Hadoop, `rxExec()`

can be used to distribute a large number of tasks to the nodes and run the tasks in parallel.

# Define a function to sweep and select the optimal parameter combination. findOptimal <- function(DFfunction, train, test, form, nTreeArg, maxDepthArg) { # Sweep different combination of parameters. sweepResults <- rxExec(DFfunction, train, test, form, rxElemArg(nTreeArg), rxElemArg(maxDepthArg)) # Sort the nested list by the third element (RMSE) in the list in ascending order. sortResults <- sweepResults[order(unlist(lapply(sweepResults, `[[`, 3)))] # Select the optimal parameter combination. nTreeOptimal <- sortResults[[1]][1] maxDepthOptimal <- sortResults[[1]][2] # Return the optimal values. return(c(nTreeOptimal, maxDepthOptimal)) }

A large number of parameter combinations are usually swept through in modeling process. For demonstration purpose, we use 9 combinations of parameters in this example.

Next, we find the best parameter combination and get the optimal regression model for each training dataset. For simplicity, we only present the process for Set A.

# Set A = weather + holiday + weekday + weekend features for the predicted day. # Build a formula for the regression model and remove the "yr", # which is used to split the training and test data. newHourFeatures <- paste("cnt_", seq(12), "hour", sep = "") # Define the hourly lags. formA <- formula(train, depVars = "cnt", varsToDrop = c("splitVar", newHourFeatures)) # Find the optimal parameters for Set A. optimalResultsA <- findOptimal(TrainTestDForestfunction, train, test, formA, numTreesToSweep, maxDepthToSweep) # Use the optimal parameters to fit a model for feature Set A. nTreeOptimalA <- optimalResultsA[[1]] maxDepthOptimalA <- optimalResultsA[[2]] dForestA <- rxDForest(formA, data = trainTest, method = "anova", maxDepth = maxDepthOptimalA, nTree = nTreeOptimalA, importance = TRUE, seed = 123)

Finally, we plot the dot charts of the variable importance and the out-of-bag error rates for the two optimal decision forest models.

In this step, we use the `rxPredict()`

function to predict the bike rental demand on the score dataset, and compare the two regression models over three performance metrics - *Mean Absolute Error (MAE)*, *Root Mean Squared Error (RMSE)*, and *Relative Absolute Error (RAE)*.

# Set A: Predict the probability on the test dataset. rxPredict(dForestA, data = score, predVarNames = "cnt_Pred_A", residVarNames = "cnt_Resid_A", overwrite = TRUE, computeResiduals = TRUE) # Set B: Predict the probability on the test dataset. rxPredict(dForestB, data = score, predVarNames = "cnt_Pred_B", residVarNames = "cnt_Resid_B", overwrite = TRUE, computeResiduals = TRUE) # Calculate three statistical metrics: # Mean Absolute Error (MAE), # Root Mean Squared Error (RMSE), and # Relative Absolute Error (RAE). sumResults <- rxSummary(~ cnt_Resid_A_abs + cnt_Resid_A_2 + cnt_rel_A + cnt_Resid_B_abs + cnt_Resid_B_2 + cnt_rel_B, data = score, summaryStats = "Mean", transforms = list(cnt_Resid_A_abs = abs(cnt_Resid_A), cnt_Resid_A_2 = cnt_Resid_A^2, cnt_rel_A = abs(cnt_Resid_A)/cnt, cnt_Resid_B_abs = abs(cnt_Resid_B), cnt_Resid_B_2 = cnt_Resid_B^2, cnt_rel_B = abs(cnt_Resid_B)/cnt) )$sDataFrame # Add row names. features <- c("baseline: weather + holiday + weekday + weekend features for the predicted day", "baseline + previous 12 hours demand") # List all metrics in a data frame. metrics <- data.frame(Features = features, MAE = c(sumResults[1, 2], sumResults[4, 2]), RMSE = c(sqrt(sumResults[2, 2]), sqrt(sumResults[5, 2])), RAE = c(sumResults[3, 2], sumResults[6, 2]))

Based on all three metrics listed below, the regression model built on feature set B outperforms the one built on feature set A. This result is not surprising, since from the variable importance chart we can see, the lag features play a critical part in the regression model. Adding this set of features can lead to better performance.

Feature Set | MAE | RMSE | RAE |
---|---|---|---|

A | 101.34848 | 146.9973 | 0.9454142 |

B | 62.48245 | 105.6198 | 0.3737669 |

Follow this link for source code and datasets: Bike Rental Demand Estimation with Microsoft R Server

by Yuzhou Song, Microsoft Data Scientist

R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.

In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.

**Step 1: Import the Data**

At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.

library(RevoScaleR)

data.path <- "./"

file.name <- "ccFraud.csv"

fraud_csv_path <- file.path(data.path,file.name)

Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.

colClasses <- c("integer","factor","factor","integer",

"numeric","integer","integer","numeric","factor")

names(colClasses)<-c("custID","gender","state","cardholder",

"balance","numTrans","numIntlTrans",

"creditLine","fraudRisk")

fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)

With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:

rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)

and you will get the following:

**Step 2: Process Data**

Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.

Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:

fraud_flag_csv_source <- RxTextData(file=file.path(data.path,

"ccFraudFlag.csv"))

Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:

make_train_flag <- function(data){

data <- as.data.frame(data)

set.seed(34)

data$trainFlag <- sample(c(0,1),size=nrow(data),

replace=TRUE,prob=c(0.3,0.7))

return(data)

}

Then, use rxDateStep to complete the transformation:

rxDataStep(inData=fraud_csv_source,

outFile=fraud_flag_csv_source,

transformFunc=make_train_flag,overwrite = TRUE)

again, we can check the output file information by using rxGetInfo function:

rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)

we can find the trainFlag column has been appended to the last:

Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:

train_csv_source <- RxTextData(

file=file.path(data.path,"train.csv"))

test_csv_source <- RxTextData(

file=file.path(data.path,"test.csv"))

Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:

rxDataStep(inData=fraud_flag_csv_source,

outFile=train_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 1),

overwrite = TRUE)

rxDataStep(inData=fraud_flag_csv_source,

outFile=test_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 0),

overwrite = TRUE)

A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.

down_sample <- function(data){

data <- as.data.frame(data)

data_bad <- subset(data,fraudRisk == 1)

data_good <- subset(data,fraudRisk == 0)

# good to bad ratio 4:1

rate <- nrow(data_bad)*4/nrow(data_good)

set.seed(34)

data_good$keepTag <- sample(c(0,1),replace=TRUE,

size=nrow(data_good),prob=c(1-rate,rate))

data_good_down <- subset(data_good,keepTag == 1)

data_good_down$keepTag <- NULL

data_down <- rbind(data_bad,data_good_down)

data_down$trainFlag <- NULL

return(data_down)

}

Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:

train_downsample_csv_source <- RxTextData(

file=file.path(data.path,

"train_downsample.csv"),

colClasses = colClasses)

rxDataStep(inData=train_csv_source,

outFile=train_downsample_csv_source,

transformFunc = down_sample,

reportProgress = 0,

overwrite = TRUE)

**Step 3: Training**

In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:

training_vars <- rxGetVarNames(train_downsample_csv_source)

training_vars <- training_vars[!(training_vars %in%

c("fraudRisk","custID"))]

formula <- as.formula(paste("fraudRisk~",

paste(training_vars, collapse = "+")))

The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:

boosted_fit <- rxBTrees(formula = formula,

data = train_downsample_csv_source,

learningRate = 0.2,

minSplit = 10,

minBucket = 10,

# small number of tree for testing purpose

nTree = 20,

seed = 5,

lossFunction ="multinomial",

reportProgress = 0)

**Step 4: Prediction and Evaluation**

We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:

test_data <- rxImport(test_csv_source)

Then, we take the imported testing set and fitted model object as input for rxPredict function:

predictions <- rxPredict(modelObject = boosted_fit,

data = test_data,

type = "response",

overwrite = TRUE,

reportProgress = 0)

In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:

threshold <- 0.5

predictions <- data.frame(predictions$X1_prob)

names(predictions) <- c("Boosted_Probability")

predictions$Boosted_Prediction <- ifelse(

predictions$Boosted_Probability > threshold, 1, 0)

predictions$Boosted_Prediction <- factor(

predictions$Boosted_Prediction,

levels = c(1, 0))

scored_test_data <- cbind(test_data, predictions)

evaluate_model <- function(data, observed, predicted) {

confusion <- table(data[[observed]],

data[[predicted]])

print(confusion)

tp <- confusion[1, 1]

fn <- confusion[1, 2]

fp <- confusion[2, 1]

tn <- confusion[2, 2]

accuracy <- (tp+tn)/(tp+fn+fp tn)

precision <- tp / (tp + fp)

recall <- tp / (tp + fn)

fscore <- 2*(precision*recall)/(precision+recall)

metrics <- c("Accuracy" = accuracy,

"Precision" = precision,

"Recall" = recall,

"F-Score" = fscore)

return(metrics)

}

* *

roc_curve <- function(data, observed, predicted) {

data <- data[, c(observed, predicted)]

data[[observed]] <- as.numeric(

as.character(data[[observed]]))

rxRocCurve(actualVarName = observed,

predVarNames = predicted,

data = data)

}

* *

boosted_metrics <- evaluate_model(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Prediction")

roc_curve(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Probability")

The confusion Matrix:

0 1

0 2752288 67659

1 77117 101796

ROC curve is (AUC=0.95):

**Summary:**

In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.

You might think literary criticism is no place for statistical analysis, but given digital versions of the text you can, for example, use sentiment analysis to infer the dramatic arc of an Oscar Wilde novel. Now you can apply similar techniques to the works of Jane Austen thanks to Julia Silge's R package janeaustenr (available on CRAN). The package includes the full text the 6 Austen novels, including *Pride and Prejudice* and *Sense and Sensibility*.

With the novels' text in hand, Julia then applied Bing sentiment analysis (as implemented in R's syuzhet package), shown here with annotations marking the major dramatic turns in the book:

There's quite a lot of noise in that chart, so Julia took the elegant step of using a low-pass fourier transform to smooth the sentiment for all six novels, which allows for a comparison of the dramatic arcs:

An apparent Austen afficionada, Julia interprets the analysis:

This is super interesting to me.

EmmaandNorthanger Abbeyhave the most similar plot trajectories, with their tales of immature women who come to understand their own folly and grow up a bit.Mansfield ParkandPersuasionalso have quite similar shapes, which also is absolutely reasonable; both of these are more serious, darker stories with main characters who are a little melancholic.Persuasionalso appears unique in starting out with near-zero sentiment and then moving to more dramatic shifts in plot trajectory; it is a markedly different story from Austen’s other works.

For more on the techniques of the analysis, including all the R code (plus some clever Austen-based puns), check out Julia's complete post linked below.

data science ish: If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More

As I mentioned yesterday, Microsoft R Server now available for HDInsight, which means that you can now run R code (including the big-data algorithms of Microsoft R Server) on a managed, cloud-based Hadoop instance.

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

The data behind the analysis is public, so if you'd like to try it out yourself the Microsoft R Server code for the analysis is available on Github, and you can read more details about the analysis in the detailed writeup, linked below. The link also contains details about data exploration and modeling, including references to additional distributed machine learning functions in R, which may be explored to improve model performance.

Scalable Data Analysis using Microsoft R Server (MRS) on Hadoop MapReduce: Using MRS on Azure HDInsight (Premium) for Exploring and Modeling the 2013 New York City Taxi Trip and Fare Data