Apparently, people have strong feelings about how pasta carbonara should be made. A 45-second French video showing a one-pot preparation of the dish with farfalle instead of spaghetti and substituting crème fraîche for most of the cheese — and not even stirring the egg into the pasta to cook it! and it was just the yolk! — caused outrage in Italy: “It’s as if we had made bourguignon with lamb and white wine!”, said one French-Italian chef.

But while some may prefer the original Italian preparation (yours truly included — writing this is making me hungry), there are clearly plenty of variations on the theme in the culinary world. Salvino A. Salvaggio (who clearly falls into the traditionalist camp) scoured YouTube for all of the videos explaining how to make pasta carbonara. After eliminating those with less than 10,000 views and those straying too far from traditional carbonara (e.g. vegan preparations), Salvaggio used the R language to tally up the ingredients used in the 165 videos that remained. Here are the results:

Salvaggio forgives the use of bacon plus some olive oil instead of the traditional guanciale (pork cheeks), and parmesan is a close substitute for peccorino. But he has some strong words for the other interlopers:

However, using cream, garlic, egg yolks (instead of full eggs), onion, butter or white wine would be considered as a culinary heresy by a vast majority of cuisine amateurs. And it is shocking to notice that almost 50% of all recipes suggest to use some kind of cream or milk, while this is by no means part of the traditional recipe.

You can find Salvaggio's preferred recipe for pasta carbonara, along with the R code and data supporting the analysis, in his complete post linked below.

Salvino A. Salvaggio: It’s Not Rocket Science, Just Pasta Carbonara

*by Katherine Zhao, Hong Lu, Zhongmou Li, Data Scientists at Microsoft*

Bicycle rental has become popular as a convenient and environmentally friendly transportation option. Accurate estimation of bike demand at different locations and different times would help bicycle-sharing systems better meet rental demand and allocate bikes to locations.

In this blog post, we walk through how to use Microsoft R Server (MRS) to build a regression model to predict bike rental demand. In the example below, we demonstrate an end-to-end machine learning solution development process in MRS, including data importing, data cleaning, feature engineering, parameter sweeping, and model training and evaluation.

The Bike Rental UCI dataset is used as the input raw data for this sample. This dataset is based on real-world data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States.

The dataset contains 17,379 rows and 17 columns, with each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) are included in the raw data, and the dates are categorized as holiday vs. weekday, etc.

The field to predict is ** cnt**, which contains a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour.

In this example, we use historical bike rental counts as well as the weather condition data to predict the number of bike rentals within a specific hour in the future. We approach this problem as a regression problem, since the label column (number of rentals) contains continuous real numbers.

Along this line, we split the raw data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. Specifically, we employ the Decision Forest Regression algorithm as the regression model and build two models on different feature sets. Finally, we evaluate their prediction performance. We will elaborate the details in the following sections.

We build the models using the `RevoScaleR`

library in MRS. The `RevoScaleR`

library provides extremely fast statistical analysis on terabyte-class datasets without needing specialized hardware. `RevoScaleR`

's distributed computing capabilities can use a different (possibly remote) computing context using the same `RevoScaleR`

commands to manage and analyze local data. A wide range of `rx`

prefixed functions provide functionality for:

- Accessing external data sets (SAS, SPSS, ODBC, Teradata, and delimited and fixed format text) for analysis in R.
- Efficiently storing and retrieving data in a high-performance data file.
- Cleaning, exploring, and manipulating data.
- Fast, basic statistical analysis.
- Train and score advanced machine learning models.

Overall, there are five major steps of building this example using Microsoft R Server:

- Step 1: Import and Clean Data
- Step 2: Perform Feature Engineering
- Step 3: Prepare Training, Test and Score Datasets
- Step 4: Sweep Parameters and Train Regression Models
- Step 5: Test, Evaluate, and Compare Models

First, we import the Bike Rental UCI dataset. Since there are a small portion of missing records within the dataset, we use `rxDataStep()`

to replace the missing records with the latest non-missing observations. `rxDataStep()`

is a commonly used function for data manipulation. It transforms the input dataset chunk by chunk and saves the results to the output dataset.

# Define the tranformation function for the rxDataStep. xform <- function(dataList) { # Identify the features with missing values. featureNames <- c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt") # Use "na.locf" function to carry forward last observation. dataList[featureNames] <- lapply(dataList[featureNames], zoo::na.locf) # Return the data list. return(dataList) } # Use rxDataStep to replace missings with the latest non-missing observations. cleanXdf <- rxDataStep(inData = mergeXdf, outFile = outFileClean, overwrite = TRUE, # Apply the "last observation carried forward" operation. transformFunc = xform, # Identify the features to apply the tranformation. transformVars = c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt"), # Drop the "dteday" feature. varsToDrop = "dteday") Step 2: Perform Feature Engineering

In addition to the original features in the raw data, we add number of bikes rented in each of the previous 12 hours as features to provide better predictive power. We create a `computeLagFeatures()`

helper function to compute the 12 lag features and use it as the transformation function in `rxDataStep()`

.

Note that `rxDataStep()`

processes data chunk by chunk and lag feature computation requires data from previous rows. In `computLagFeatures()`

, we use the internal function `.rxSet()`

to save the last *n* rows of a chunk to a variable ** lagData**. When processing the next chunk, we use another internal function

`.rxGet()`

to retrieve the values stored in # Add number of bikes that were rented in each of # the previous 12 hours as 12 lag features. computeLagFeatures <- function(dataList) { # Total number of lags that need to be added. numLags <- length(nLagsVector) # Lag feature names as lagN. varLagNameVector <- paste("cnt_", nLagsVector, "hour", sep="") # Set the value of an object "storeLagData" in the transform environment. if (!exists("storeLagData")) { lagData <- mapply(rep, dataList[[varName]][1], times = nLagsVector) names(lagData) <- varLagNameVector .rxSet("storeLagData", lagData) } if (!.rxIsTestChunk) { for (iL in 1:numLags) { # Number of rows in the current chunk. numRowsInChunk <- length(dataList[[varName]]) nlags <- nLagsVector[iL] varLagName <- paste("cnt_", nlags, "hour", sep = "") # Retrieve lag data from the previous chunk. lagData <- .rxGet("storeLagData") # Concatenate lagData and the "cnt" feature. allData <- c(lagData[[varLagName]], dataList[[varName]]) # Take the first N rows of allData, where N is # the total number of rows in the original dataList. dataList[[varLagName]] <- allData[1:numRowsInChunk] # Save last nlag rows as the new lagData to be used # to process in the next chunk. lagData[[varLagName]] <- tail(allData, nlags) .rxSet("storeLagData", lagData) } } return(dataList) } # Apply the "computeLagFeatures" on the bike data. lagXdf <- rxDataStep(inData = cleanXdf, outFile = outFileLag, transformFunc = computeLagFeatures, transformObjects = list(varName = "cnt", nLagsVector = seq(12)), transformVars = "cnt", overwrite = TRUE)

Before training the regression model, we split data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. In order to obtain the best combination of parameters for regression models, we further divide year 2011 data into training and test datasets: 80% of the records are randomly selected to train regression models with various combinations of parameters, and the remaining 20% are used to evaluate the models obtained and determine the optimal combination.

# Split data by "yr" so that the training and test data contains records # for the year 2011 and the score data contains records for 2012. rxSplit(inData = lagXdf, outFilesBase = paste0(td, "/modelData"), splitByFactor = "yr", overwrite = TRUE, reportProgress = 0, verbose = 0) # Point to the .xdf files for the training & test and score set. trainTest <- RxXdfData(paste0(td, "/modelData.yr.0.xdf")) score <- RxXdfData(paste0(td, "/modelData.yr.1.xdf")) # Randomly split records for the year 2011 into training and test sets # for sweeping parameters. # 80% of data as training and 20% as test. rxSplit(inData = trainTest, outFilesBase = paste0(td, "/sweepData"), outFileSuffixes = c("Train", "Test"), splitByFactor = "splitVar", overwrite = TRUE, transforms = list(splitVar = factor(sample(c("Train", "Test"), size = .rxNumRows, replace = TRUE, prob = c(.80, .20)), levels = c("Train", "Test"))), rngSeed = 17, consoleOutput = TRUE) # Point to the .xdf files for the training and test set. train <- RxXdfData(paste0(td, "/sweepData.splitVar.Train.xdf")) test <- RxXdfData(paste0(td, "/sweepData.splitVar.Test.xdf"))

In this step, we construct two training datasets based on the same raw input data, but with different sets of features:

- Set A = weather + holiday + weekday + weekend features for the predicted day
- Set B = Set A + number of bikes rented in each of the previous 12 hours, which captures very recent demand for the bikes

In order to perform parameter sweeping, we create a helper function to evaluate the performance of a model trained with a given combination of number of trees and maximum depth. We use *Root Mean Squared Error (RMSE)* as the evaluation metric.

# Define a function to train and test models with given parameters # and then return Root Mean Squared Error (RMSE) as the performance metric. TrainTestDForestfunction <- function(trainData, testData, form, numTrees, maxD) { # Build decision forest regression models with given parameters. dForest <- rxDForest(form, data = trainData, method = "anova", maxDepth = maxD, nTree = numTrees, seed = 123) # Predict the the number of bike rental on the test data. rxPredict(dForest, data = testData, predVarNames = "cnt_Pred", residVarNames = "cnt_Resid", overwrite = TRUE, computeResiduals = TRUE) # Calcualte the RMSE. result <- rxSummary(~ cnt_Resid, data = testData, summaryStats = "Mean", transforms = list(cnt_Resid = cnt_Resid^2) )$sDataFrame # Return lists of number of trees, maximum depth and RMSE. return(c(numTrees, maxD, sqrt(result[1,2]))) }

The following is another helper function to sweep and select the optimal parameter combination. Under local parallel compute context (`rxSetComputeContext(RxLocalParallel())`

), `rxExec()`

executes multiple runs of model training and evaluation with different parameters in parallel, which significantly speeds up parameter sweeping. When used in a compute context with multiple nodes, e.g. high-performance computing clusters and Hadoop, `rxExec()`

can be used to distribute a large number of tasks to the nodes and run the tasks in parallel.

# Define a function to sweep and select the optimal parameter combination. findOptimal <- function(DFfunction, train, test, form, nTreeArg, maxDepthArg) { # Sweep different combination of parameters. sweepResults <- rxExec(DFfunction, train, test, form, rxElemArg(nTreeArg), rxElemArg(maxDepthArg)) # Sort the nested list by the third element (RMSE) in the list in ascending order. sortResults <- sweepResults[order(unlist(lapply(sweepResults, `[[`, 3)))] # Select the optimal parameter combination. nTreeOptimal <- sortResults[[1]][1] maxDepthOptimal <- sortResults[[1]][2] # Return the optimal values. return(c(nTreeOptimal, maxDepthOptimal)) }

A large number of parameter combinations are usually swept through in modeling process. For demonstration purpose, we use 9 combinations of parameters in this example.

Next, we find the best parameter combination and get the optimal regression model for each training dataset. For simplicity, we only present the process for Set A.

# Set A = weather + holiday + weekday + weekend features for the predicted day. # Build a formula for the regression model and remove the "yr", # which is used to split the training and test data. newHourFeatures <- paste("cnt_", seq(12), "hour", sep = "") # Define the hourly lags. formA <- formula(train, depVars = "cnt", varsToDrop = c("splitVar", newHourFeatures)) # Find the optimal parameters for Set A. optimalResultsA <- findOptimal(TrainTestDForestfunction, train, test, formA, numTreesToSweep, maxDepthToSweep) # Use the optimal parameters to fit a model for feature Set A. nTreeOptimalA <- optimalResultsA[[1]] maxDepthOptimalA <- optimalResultsA[[2]] dForestA <- rxDForest(formA, data = trainTest, method = "anova", maxDepth = maxDepthOptimalA, nTree = nTreeOptimalA, importance = TRUE, seed = 123)

Finally, we plot the dot charts of the variable importance and the out-of-bag error rates for the two optimal decision forest models.

In this step, we use the `rxPredict()`

function to predict the bike rental demand on the score dataset, and compare the two regression models over three performance metrics - *Mean Absolute Error (MAE)*, *Root Mean Squared Error (RMSE)*, and *Relative Absolute Error (RAE)*.

# Set A: Predict the probability on the test dataset. rxPredict(dForestA, data = score, predVarNames = "cnt_Pred_A", residVarNames = "cnt_Resid_A", overwrite = TRUE, computeResiduals = TRUE) # Set B: Predict the probability on the test dataset. rxPredict(dForestB, data = score, predVarNames = "cnt_Pred_B", residVarNames = "cnt_Resid_B", overwrite = TRUE, computeResiduals = TRUE) # Calculate three statistical metrics: # Mean Absolute Error (MAE), # Root Mean Squared Error (RMSE), and # Relative Absolute Error (RAE). sumResults <- rxSummary(~ cnt_Resid_A_abs + cnt_Resid_A_2 + cnt_rel_A + cnt_Resid_B_abs + cnt_Resid_B_2 + cnt_rel_B, data = score, summaryStats = "Mean", transforms = list(cnt_Resid_A_abs = abs(cnt_Resid_A), cnt_Resid_A_2 = cnt_Resid_A^2, cnt_rel_A = abs(cnt_Resid_A)/cnt, cnt_Resid_B_abs = abs(cnt_Resid_B), cnt_Resid_B_2 = cnt_Resid_B^2, cnt_rel_B = abs(cnt_Resid_B)/cnt) )$sDataFrame # Add row names. features <- c("baseline: weather + holiday + weekday + weekend features for the predicted day", "baseline + previous 12 hours demand") # List all metrics in a data frame. metrics <- data.frame(Features = features, MAE = c(sumResults[1, 2], sumResults[4, 2]), RMSE = c(sqrt(sumResults[2, 2]), sqrt(sumResults[5, 2])), RAE = c(sumResults[3, 2], sumResults[6, 2]))

Based on all three metrics listed below, the regression model built on feature set B outperforms the one built on feature set A. This result is not surprising, since from the variable importance chart we can see, the lag features play a critical part in the regression model. Adding this set of features can lead to better performance.

Feature Set | MAE | RMSE | RAE |
---|---|---|---|

A | 101.34848 | 146.9973 | 0.9454142 |

B | 62.48245 | 105.6198 | 0.3737669 |

Follow this link for source code and datasets: Bike Rental Demand Estimation with Microsoft R Server

by Yuzhou Song, Microsoft Data Scientist

R is an open source, statistical programming language with millions of users in its community. However, a well-known weakness of R is that it is both single threaded and memory bound, which limits its ability to process big data. With Microsoft R Server (MRS), the enterprise grade distribution of R for advanced analytics, users can continue to work in their preferred R environment with following benefits: the ability to scale to data of any size, potential speed increases of up to one hundred times faster than open source R.

In this article, we give a walk-through on how to build a gradient boosted tree using MRS. We use a simple fraud data data set having approximately 1 million records and 9 columns. The last column “fraudRisk” is the tag: 0 stands for non-fraud and 1 stands for fraud. The following is a snapshot of the data.

**Step 1: Import the Data**

At the very beginning, load “RevoScaleR” package and specify directory and name of data file. Note that this demo is done in local R, thus, I need load “RevoScaleR” package.

library(RevoScaleR)

data.path <- "./"

file.name <- "ccFraud.csv"

fraud_csv_path <- file.path(data.path,file.name)

Next, we make a data source using RxTextData function. The output of RxTextData function is a data source declaring type of each field, name of each field and location of the data.

colClasses <- c("integer","factor","factor","integer",

"numeric","integer","integer","numeric","factor")

names(colClasses)<-c("custID","gender","state","cardholder",

"balance","numTrans","numIntlTrans",

"creditLine","fraudRisk")

fraud_csv_source <- RxTextData(file=fraud_csv_path,colClasses=colClasses)

With the data source (“fraud_csv_source” above), we are able to take it as input to other functions, just like inputting a data frame into regular R functions. For example, we can put it into rxGetInfo function to check the information of the data:

rxGetInfo(fraud_csv_source, getVarInfo = TRUE, numRows = 5)

and you will get the following:

**Step 2: Process Data**

Next, we demonstrate how to use an important function, rxDataStep, to process data before training. Generally, rxDataStep function transforms data from an input data set to an output data set. Basically there are three main arguments of that function: inData, outFile and transformFunc. inData can take either a data source (made by RxTextData shown in step 1) or a data frame. outFile takes data source to specify output file name, schema and location. If outFile is empty, rxDataStep function will return a data frame. transformFunc takes a function as input which will be used to do transformation by rxDatastep function. If the function has arguments more than input data source/frame, you may specify them in the transformObjects argument.

Here, we make a training flag using rxDataStep function as an example. The data source fraud_csv_source made in step 1 will be used for inData. We create an output data source specifying output file name “ccFraudFlag.csv”:

fraud_flag_csv_source <- RxTextData(file=file.path(data.path,

"ccFraudFlag.csv"))

Also, we create a simple transformation function called “make_train_flag”. It creates training flag which will be used to split data into training and testing set:

make_train_flag <- function(data){

data <- as.data.frame(data)

set.seed(34)

data$trainFlag <- sample(c(0,1),size=nrow(data),

replace=TRUE,prob=c(0.3,0.7))

return(data)

}

Then, use rxDateStep to complete the transformation:

rxDataStep(inData=fraud_csv_source,

outFile=fraud_flag_csv_source,

transformFunc=make_train_flag,overwrite = TRUE)

again, we can check the output file information by using rxGetInfo function:

rxGetInfo(fraud_flag_csv_source, getVarInfo = TRUE, numRows = 5)

we can find the trainFlag column has been appended to the last:

Based on the trainFlag, we split the data into training and testing set. Thus, we need specify the data source for output:

train_csv_source <- RxTextData(

file=file.path(data.path,"train.csv"))

test_csv_source <- RxTextData(

file=file.path(data.path,"test.csv"))

Instead of creating a transformation function, we can simply specify the rowSelection argument in rxDataStep function to select rows satisfying certain conditions:

rxDataStep(inData=fraud_flag_csv_source,

outFile=train_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 1),

overwrite = TRUE)

rxDataStep(inData=fraud_flag_csv_source,

outFile=test_csv_source,

reportProgress = 0,

rowSelection = (trainFlag == 0),

overwrite = TRUE)

A well-known problem for fraud data is the extremely skewed distribution of labels, i.e., most transactions are legitimate while only a very small proportion are fraud transactions. In the original data, good/bad ratio is about 15:1. Directly using original data to train a model will result in a poor performance, since the model is unable to find the proper boundary between “good” and “bad”. A simple but effective solution is to randomly down sample the majority. The following is the down_sample transformation function down sampling majority to a good/bad ratio 4:1. The down sample ratio is pre-selected based on prior knowledge but can be finely tuned based on cross validation as well.

down_sample <- function(data){

data <- as.data.frame(data)

data_bad <- subset(data,fraudRisk == 1)

data_good <- subset(data,fraudRisk == 0)

# good to bad ratio 4:1

rate <- nrow(data_bad)*4/nrow(data_good)

set.seed(34)

data_good$keepTag <- sample(c(0,1),replace=TRUE,

size=nrow(data_good),prob=c(1-rate,rate))

data_good_down <- subset(data_good,keepTag == 1)

data_good_down$keepTag <- NULL

data_down <- rbind(data_bad,data_good_down)

data_down$trainFlag <- NULL

return(data_down)

}

Then, we specify the down sampled training data source and use rxDataStep function again to complete the down sampling process:

train_downsample_csv_source <- RxTextData(

file=file.path(data.path,

"train_downsample.csv"),

colClasses = colClasses)

rxDataStep(inData=train_csv_source,

outFile=train_downsample_csv_source,

transformFunc = down_sample,

reportProgress = 0,

overwrite = TRUE)

**Step 3: Training**

In this step, we take the down sampled data to train a gradient boosted tree. We first use rxGetVarNames function to get all variable names in training set. The input is still the data source of down sampled training data. Then we use it to create a formula which will be used later:

training_vars <- rxGetVarNames(train_downsample_csv_source)

training_vars <- training_vars[!(training_vars %in%

c("fraudRisk","custID"))]

formula <- as.formula(paste("fraudRisk~",

paste(training_vars, collapse = "+")))

The rxBTrees function is used for building gradient boosted tree model. formula argument is used to specify label column and predictor columns. data argument takes a data source as input for training. lossFunction argument specifies the distribution of label column, i.e., “bernoulli” for numerical 0/1 regression, “gaussian” for numerical regression, and “multinomial” for two or more class classification. Here we choose “multinomial” as 0/1 classification problem. Other parameters are pre-selected, not finely tuned:

boosted_fit <- rxBTrees(formula = formula,

data = train_downsample_csv_source,

learningRate = 0.2,

minSplit = 10,

minBucket = 10,

# small number of tree for testing purpose

nTree = 20,

seed = 5,

lossFunction ="multinomial",

reportProgress = 0)

**Step 4: Prediction and Evaluation**

We use rxPredict function to predict on testing data set, but first use rxImport function to import testing data set:

test_data <- rxImport(test_csv_source)

Then, we take the imported testing set and fitted model object as input for rxPredict function:

predictions <- rxPredict(modelObject = boosted_fit,

data = test_data,

type = "response",

overwrite = TRUE,

reportProgress = 0)

In rxPrediction function, type=”response” will output predicted probabilities. Finally, we pick 0.5 as the threshold and evaluate the performance:

threshold <- 0.5

predictions <- data.frame(predictions$X1_prob)

names(predictions) <- c("Boosted_Probability")

predictions$Boosted_Prediction <- ifelse(

predictions$Boosted_Probability > threshold, 1, 0)

predictions$Boosted_Prediction <- factor(

predictions$Boosted_Prediction,

levels = c(1, 0))

scored_test_data <- cbind(test_data, predictions)

evaluate_model <- function(data, observed, predicted) {

confusion <- table(data[[observed]],

data[[predicted]])

print(confusion)

tp <- confusion[1, 1]

fn <- confusion[1, 2]

fp <- confusion[2, 1]

tn <- confusion[2, 2]

accuracy <- (tp+tn)/(tp+fn+fp tn)

precision <- tp / (tp + fp)

recall <- tp / (tp + fn)

fscore <- 2*(precision*recall)/(precision+recall)

metrics <- c("Accuracy" = accuracy,

"Precision" = precision,

"Recall" = recall,

"F-Score" = fscore)

return(metrics)

}

* *

roc_curve <- function(data, observed, predicted) {

data <- data[, c(observed, predicted)]

data[[observed]] <- as.numeric(

as.character(data[[observed]]))

rxRocCurve(actualVarName = observed,

predVarNames = predicted,

data = data)

}

* *

boosted_metrics <- evaluate_model(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Prediction")

roc_curve(data = scored_test_data,

observed = "fraudRisk",

predicted = "Boosted_Probability")

The confusion Matrix:

0 1

0 2752288 67659

1 77117 101796

ROC curve is (AUC=0.95):

**Summary:**

In this article, we demonstrate how to use MRS in a fraud data. It includes how to create a data source by RxTextData function, how to make transformation using rxDataStep function, how to import data using rxImport function, how to train a gradient boosted tree model using rxBTrees function, and how to predict using rxPredict function.

You might think literary criticism is no place for statistical analysis, but given digital versions of the text you can, for example, use sentiment analysis to infer the dramatic arc of an Oscar Wilde novel. Now you can apply similar techniques to the works of Jane Austen thanks to Julia Silge's R package janeaustenr (available on CRAN). The package includes the full text the 6 Austen novels, including *Pride and Prejudice* and *Sense and Sensibility*.

With the novels' text in hand, Julia then applied Bing sentiment analysis (as implemented in R's syuzhet package), shown here with annotations marking the major dramatic turns in the book:

There's quite a lot of noise in that chart, so Julia took the elegant step of using a low-pass fourier transform to smooth the sentiment for all six novels, which allows for a comparison of the dramatic arcs:

An apparent Austen afficionada, Julia interprets the analysis:

This is super interesting to me.

EmmaandNorthanger Abbeyhave the most similar plot trajectories, with their tales of immature women who come to understand their own folly and grow up a bit.Mansfield ParkandPersuasionalso have quite similar shapes, which also is absolutely reasonable; both of these are more serious, darker stories with main characters who are a little melancholic.Persuasionalso appears unique in starting out with near-zero sentiment and then moving to more dramatic shifts in plot trajectory; it is a markedly different story from Austen’s other works.

For more on the techniques of the analysis, including all the R code (plus some clever Austen-based puns), check out Julia's complete post linked below.

data science ish: If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More

As I mentioned yesterday, Microsoft R Server now available for HDInsight, which means that you can now run R code (including the big-data algorithms of Microsoft R Server) on a managed, cloud-based Hadoop instance.

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

The data behind the analysis is public, so if you'd like to try it out yourself the Microsoft R Server code for the analysis is available on Github, and you can read more details about the analysis in the detailed writeup, linked below. The link also contains details about data exploration and modeling, including references to additional distributed machine learning functions in R, which may be explored to improve model performance.

Scalable Data Analysis using Microsoft R Server (MRS) on Hadoop MapReduce: Using MRS on Azure HDInsight (Premium) for Exploring and Modeling the 2013 New York City Taxi Trip and Fare Data

If you want to train a statistical model on very large amounts of data, you'll need three things: a storage platform capable of holding all of the training data, a computational platform capable of efficently performing the heavy-duty mathematical computations required, and a statistical computing language with algorithms that can take advantage of the storage and computation power. Microsoft R Server, running on HDInsight with Apache Spark provides all three.

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

Many of the details are in the slides embdedded above, but to see a demonstration of Microsoft R Server running on Spark with HDInsight, click on the link below for access to the recorded webinar.

Microsoft Azure On-Demand Webinar: Building A Scalable Data Science Platform with R and Hadoop

Buzzfeed's Peter Aldhous and Charles Seife broke a major news story last week: the US Federal Bureau of Investigation and Department of Homeland Security operate more than 200 small aircraft (mainly Cessnas and some helicopters) which routinely circle various sites near US cities, presumably to gather data with onboard cameras and electonic equipment. The data behind the story weren't provided from the government; rather, the data was culled from public flight-tracking service FlightRadar24 and cross-referenced against FAA registrations for planes with transponders registered to those departments. The story includes an interactive map where you can navigate to a location of your choice and see all of the surveillance flights during the survey period of August 17 to December 31, 2015. (DHS flights are shown in red blue; FBI flights are red.)

All of the data gathering, analysis and visualization was done with the R language. The authors have provided a complete writeup of the analysis, including data and code, so that anyone can replicate the results. The analysis also includes details not included in the Buzzfeed story, such as this analysis of flights by hour of day. (This appears to be mostly a nine-to-five operation; there were also very few flights on weekends.)

For those interested in the details of the R code, much of the spatial data analysis was handled by the rgdal and rgeos packages, and the data preparation by the dplyr package (with extensive use of the magrittr `%>%`

pipe). Data graphs (other than maps) were mostly handled by the ggplot2 package, and the interactive data tables were generated with the DT package. And as noted by story author Peter Aldhous in the comments below, the maps made with CartoDB, after the data was processed in R.

All in all, it's a magnificent piece of data journalism. I hightly recommend reading the full story and the accompanying detailed analysis at the links below.

*Updated April 11 to correct the color assignments and to note use of CartoDB.*

Buzzfeed: Spies in the Sky (feature); Spies in the Sky (detailed analysis)

By Srini Kumar, Director of Data Science at Microsoft

Who does not hate being stopped and given a traffic ticket? Invariably, we think that something is not fair that we got it and everyone else did not. I am no different, and living in the SF Bay Area, I have often wondered if I could get the data about traffic tickets, particularly since there may be some unusual patterns.

While I could not find one for any data for the SF Bay Area, I supposed I could get some vicarious pleasure out of analyzing it for some place for which I could get data. As luck would have it, I did find something. It turns out that Montgomery County in Maryland does put all the traffic violation information online. Of course, they take out personally identifiable information, but do put out a lot of other information. If any official from that county is reading this, thanks for putting it online! Hopefully, other counties follow suit.

By the standards of the data sizes I have worked with, this is rather small; it is a little over 800K records, though it has about 35 columns, including latitude and longitude. I find it convenient to park the data in a relational database, analyze it using SQL first, and then get slices of it into R.

In this post, I'll be using Microsoft SQL Server to import and prepare the traffic violations data, and then use R to visualize the data.

If you are unfamiliar with SQL, this SQL tutorial has been around for a long time, and this is where I got started with SQL. Another thing I find convenient is to use an ETL tool. ETL, if you are not already familiar with it, stands for Extract-Transform-Load, and is as pervasive a utility in IT infrastructures in companies as plumbing and electrical wiring are in homes, and often just as invisible, and noticed only when unavailable. If you are determined to stay open source, you can use PostgreSQL for a database (I prefer PostgreSQL over MySQL, but you can choose whatever you become comfortable with) and say, Talend for the ETL. For this exercise, I had access to Microsoft SQL Server's 2016 preview, and ended up using it. An interesting thing is that it already has an ETL tool called SQL Server Integration Services, and a wizard that makes it extremely convenient to do it all in one place.

Now, why do I spend so much time suggesting all these, when you can very easily just download a file, read it into R and do whatever you want with it? In a nutshell, because of all the goodies you get with it. For example, you could easily detect problems with the data as part of the load, and run SQL. You can handle data sizes that can't fit in memory. And at least with MS SQL, it gets better. You have an easy wizard that you can use to load the data, and figure out problems with it even before you load it. And with MS SQL, you can also run R from within it, on a large scale.

Here are the fields in the table. The query is quite simple, and accesses the meta data which is available in every relational database.

`select column_name, data_type from INFORMATION_SCHEMA.columns where table_name = 'Montgomery_County_MD_Traffic_Violations'`

`order by ORDINAL_POSITION asc`

` `

`Date Of Stop date`

`Time Of Stop time`

`Agency varchar`

`SubAgency varchar`

`Description varchar`

`Location varchar`

`latitude float`

`longitude float`

`Accident varchar`

`Belts varchar`

`Personal Injury varchar`

`Property Damage varchar`

`Fatal varchar`

`Commercial License varchar`

`HAZMAT varchar`

`Commercial Vehicle varchar`

`Alcohol varchar`

`Work Zone varchar`

`State varchar`

`VehicleType varchar`

`Year bigint`

`Make varchar`

`Model varchar`

`Color varchar`

`Violation Type varchar`

`Charge varchar`

`Article varchar`

`Contributed To Accident varchar`

`Race varchar`

`Gender varchar`

`Driver City varchar`

`Driver State varchar`

`DL State varchar`

`Arrest Type varchar`

`Geolocation varchar`

`month_of_year nvarchar`

`day_of_week nvarchar`

`hour_of_day int`

`num_day_of_week int`

`num_month_of_year int`

The last five columns were created by me. It is trivial to create new columns in SQL Server (or any other relational database) and having them computed each time a new row is added.

With RODBC/RJDBC, it is easy to get data from an RDBMS and do what we want with it. Here is an interesting plot of violations by race and if they concentrate in certain areas.

The code to create the above chart uses, as you may have guessed, the ggmap library. Installing ggmap automatically includes the ggplot2 library as well. In order to plot this, we first get the geocodes for Montgomery county, which is easy to do. First, we get the geocode for the location, and then use the get_map function to get the map. Actually, the get_map function also has a zoom level, that we can play with to get the appropriate zoom if we need to zoom in or out instead of using the default.

Here is another one on violation concentrations related to the time of day (the scale is hours from midnight: 0 is midnight and 12 is noon):

The code to do this is as follows:

It looks like there are few violations are really small in the wee hours of the morning. How about violations by month by vehicle?

Looks like it is mostly cars, but there are some light duty trucks as well. But how did we get the numerical month of the year? Again, in SQL, it becomes easy to do:

`ALTER TABLE <table name> ADD COLUMN num_month_of_year AS datepart(month,[Date Of Stop])`

After that, it is a routine matter of plotting a bar chart, after using RODBC/RJDBC to get the data from the database.

`ggplot(data=alldata, aes(num_month_of_year)) + geom_bar(aes(color=VehicleType), fill=factor(VehicleType))`

When we look at the Make of the cars, we see how messed up that attribute is. For example, I have seen HUYAND, HUYND, HUYNDAI, HYAN, HYANDAI, HYN, HYND, HYNDAI. I am pretty sure I even saw a GORRILLA, and have no idea what it might mean. That does not make for a very good plot, or for that matter any analysis. Can we do better? Not really, unless we have reference data. We could build one using the car makes and models that we know of, but that is a nontrivial exercise.

A little SQL goes a long way, particularly if the data is very large and needs to be sampled selectively. For example, here is a query:

The square brackets [] are an MS SQL syntax to work with fields that are separated by spaces and other special characters. To practise SQL itself, you don’t have to get an RDBMS first. In R itself, there is the sqldf library, where you can practice your SQL if you don't have an actual database. For example:

`bymonth <- sqldf("select num_month_of_year, sum(frequency) as frequency from res group by num_month_of_year")`

Since I have been recently introduced to what I call the "rx" functions from Microsoft R Server (the new name for Revolution Analytics' R), I decided to check it against what I normally use, which is `ctree`

from the party package and of course the glm. My intent was not to get insights from the data here; it was to simply see if the rx functions would run slightly better. So I simply tried to relate the number of violations to as many variables as I could use.

The rx functions (in my limited testing) worked when the CRAN R functions did not. Note that I did not go looking for places where one outperformed the other. This was simply to figure out what was doable with the rx functions.

Finally, we can call R from within SQL Server too. Here is an example:

It is no surprise that the least violations are during the wee hours of the morning, but one surprise for me was the vehicle brand names. Of course, one can create a reference set of names and use it to clean this data, so that one can try and see what brands have a higher chance of violations, but that is not an elegant solution. Chances are that as we get such data, the errors in human inputs will be a serious issue to contend with.

Hopefully, this article also illustrates how powerful SQL can be in manipulating data, though this article barely scratched the surface on SQL usage.

Airbnb, the property-rental marketplace that helps you find a place to stay when you're travelling, uses R to scale data science. Airbnb is a famously data-driven company, and has recently gone through a period of rapid growth. To accommodate the influx of data scientists (80% of whom are proficient in R, and 64% use R as their primary data analysis language), Airbnb organizes monthly week-long data bootcamps for new hires and current team members.

But just as important as the training program is the engineering process Airbnb uses to scale data science with R. Rather than just have data scientists write R functions independently (which not only is a likely duplication of work, but inhibits transparency and slows down productivity), Airbnb has invested in building an internal R package called Rbnb that implements collaborative solutions to common problems, standardizes visual presentations, and avoids reinventing the wheel. (Incidentally, the development and use of internal R packages is a common pattern I've seen at many companies with large data science teams.)

The Rbnb package used at Airbnb includes more than 60 functions and is still growing under the guidance of several active developers. It's actively used by Airbnb's engineering, data science, analytics and user experience teams, to do things like move aggregated or filtered data from a Hadoop or SQL environment into R, impute missing values, compute year-over-year trends, and perform common data aggregations. It has been used to create more than 500 research reports and to solve problems like automating the detection of host preferences and using guest ratings to predict rebooking rates.

The package is also widely used to visualize data using a standard Airbnb "look". The package includes custom themes, scales, and geoms for ggplot2; CSS templates for htmlwidgets and Shiny; and custom R Markdown templates for different types of reports. You can see several examples in the blog post by Ricardo Bion linked below, including this gorgeous visualization (created with R) of the 500,000 top Airbnb trips.

Medium (AirbnbEng): Using R packages and education to scale Data Science at Airbnb

*by Bob Horton, Senior Data Scientist, Microsoft*

This is a follow-up to my earlier post on learning curves. A learning curve is a plot of predictive error for training and validation sets over a range of training set sizes. Here we’re using simulated data to explore some fundamental relationships between training set size, model complexity, and prediction error.

Start by simulating a dataset:

```
sim_data <- function(N, num_inputs=8, input_cardinality=10){
inputs <- rep(input_cardinality, num_inputs)
names(inputs) <- paste0("X", seq_along(inputs))
as.data.frame(lapply (inputs, function(cardinality)
sample(LETTERS[1:cardinality], N, replace=TRUE)))
}
```

The input columns are named X1, X2, etc.; these are all categorical variables with single capital letters representing the different categories. Cardinality is the number of possible values in the column; our default cardinality of 10 means we sample from the capital letters `A`

through `J`

.

Next we’ll add an outcome variable (`y`

); it has a base level of 100, but if the values in the first two `X`

variables are equal, this is increased by 10. On top of this we add some normally distributed noise.

```
set.seed(123)
data <- sim_data(3e4, input_cardinality=10)
noise <- 2
data <- transform(data, y = ifelse(X1 == X2, 110, 100) +
rnorm(nrow(data), sd=noise))
```

With linear models, we handle an interaction between two categorical variables by adding an interaction term; the number of possibilities in this interaction term is basically the product of the cardinalities. In this simulated data set, only the first two columns affect the outcome, and the other input columns don’t contain any useful information. We’ll use it to demonstrate how adding non-informative variables affects overfitting and training set size requirements.

As in the earlier post, I’ll use the root mean squared error of the predictions as the error function because RMSE is essentially the same as standard deviation. No model should be able to make predictions with a root mean squared error less than the standard deviation of the random noise we added.

`rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))`

The cross-validation function trains a model using the supplied formula and modeling function, then tests its performance on a held-out test set. The training set will be sampled from the data available for training; to use approximately a 10% sample of the training data, set `prob_train`

to `0.1`

.

```
cross_validate <- function(model_formula, fit_function, error_function,
validation_set, training_data, prob_train=1){
training_set <- training_data[runif(nrow(training_data)) < prob_train,]
tss <- nrow(training_set)
outcome_var <- as.character(model_formula[[2]])
fit <- fit_function( model_formula, training_set)
training_error <- error_function(training_set[[outcome_var]],
predict(fit, training_set))
validation_error <- error_function(validation_set[[outcome_var]],
predict(fit, validation_set))
data.frame(tss=tss,
formula=deparse(model_formula),
training=training_error,
validation=validation_error,
stringsAsFactors=FALSE)
}
```

Construct a family of formulas, then use `expand_grid`

to make a dataframe with all the combinations of formulas and sampling probabilities:

```
generate_formula <- function(num_inputs, degree=2, outcome="y"){
inputs <- paste0("X", 1:num_inputs)
rhs <- paste0("(", paste(inputs, collapse=" + "), ") ^ ", degree)
paste(outcome, rhs, sep=" ~ ")
}
formulae <- lapply(2:(ncol(data) - 1), generate_formula)
prob <- 2^(seq(0, -6, by=-0.5))
parameter_table <- expand.grid(formula=formulae,
sampling_probability=prob,
stringsAsFactors=FALSE)
```

Separate the training and validation data:

```
validation_fraction <- 0.25
in_validation_set <- runif(nrow(data)) < validation_fraction
vset <- data[in_validation_set,]
tdata <- data[!in_validation_set,]
run_param_row <- function(i){
param <- parameter_table[i,]
cross_validate(formula(param$formula[[1]]), lm, rmse,
vset, tdata, param$sampling_probability[[1]])
}
```

Now call the cross-validate function on each row of the parameter table. The `foreach`

package makes it easy to process these jobs in parallel:

```
library(foreach)
library(doParallel)
```

```
registerDoParallel() # automatically manages cluster
learning_curve_results <- foreach(i=1:nrow(parameter_table)) %dopar% run_param_row(i)
learning_curve_table <- data.table::rbindlist(learning_curve_results)
```

The `rbindlist()`

function from the `data.table`

package puts the results together into a single data frame; this is both cleaner and dramatically faster than the old `do.call("rbind", ...)`

approach (though we’re just combining a small number of rows, so speed is not an issue here).

Now plot the results. Since we’ll do another plot later, I’ll wrap the plotting code in a function to make it more reusable.

```
plot_learning_curve <- function(lct, title, base_error, plot_training_error=TRUE, ...){
library(dplyr)
library(tidyr)
library(ggplot2)
lct_long <- lct %>% gather(type, error, -tss, -formula)
lct_long$type <- relevel(lct_long$type, "validation")
plot_me <- if (plot_training_error) lct_long else lct_long[lct_long$type=="validation",]
ggplot(plot_me, aes(x=log10(tss), y=error, col=formula, linetype=type)) +
ggtitle(title) + geom_hline(yintercept=base_error, linetype=2) +
geom_line(size=1) + xlab("log10(training set size)") + coord_cartesian(...)
}
```

```
plot_learning_curve(learning_curve_table, title="Extraneous variables are distracting",
base_error=noise, ylim=c(0,4))
```

This illustrates the phenomenon that adding more inputs to a model increases the requirements for training data. This is true even if the extra inputs do not contain any information. The cases where the training error is zero are actually rank-deficient (like having fewer equations than unknowns), and if you try this at home you will get warnings to that effect; this is an extreme kind of overfitting. Other learning algorithms might handle this better than `lm`

, but the general idea is that those extra columns are distracting, and it takes more examples to falsify all the spurious correlations that get dredged up from those distractors.

But what if the additional columns considered by the more complex formulas actually did contain predictive information? Keeping the same `X`

-values, we can modify `y`

so that these other columns matter:

```
data <- transform(data, y = 100 + (X1==X2) * 10 +
(X2==X3) * 3 +
(X3==X4) * 3 +
(X4==X5) * 3 +
(X5==X6) * 3 +
(X6==X7) * 3 +
(X7==X8) * 3 +
rnorm(nrow(data), sd=noise))
validation_fraction <- 0.25
in_validation_set <- runif(nrow(data)) < validation_fraction
vset <- data[in_validation_set,]
tdata <- data[!in_validation_set,]
run_param_row <- function(i){
param <- parameter_table[i,]
formula_string <- param$formula[[1]]
prob <- param$sampling_probability[[1]]
cross_validate(formula(formula_string), lm, rmse, vset, tdata, prob)
}
learning_curve_results <-
foreach (i=1:nrow(parameter_table)) %dopar% run_param_row(i)
lct <- data.table::rbindlist(learning_curve_results)
```

This time we’ll leave the training errors off the plot to focus on the validation error; this is what really matters when you are trying to generalize predictions.

```
plot_learning_curve(lct, title="Crossover Point", base_error=noise,
plot_training_error=FALSE, ylim=c(1.5, 5))
```

Now we see another important phenomenon: The simple models that work best with small training sets are out-preformed by more complex models on larger training sets. But these complex models are only usable if they are given sufficient data; plotting a learning curve makes it clear whether you have used sufficient data or not.

Learning curves give valuable insights into the model training process. In some cases this can help you decide to expend effort or expense on gathering more data. In other cases you may discover that your models have learned all they can from just a fraction of the data that is already available. This might encourage you to investigate more complex models that may be capable of learning the finer details of the dataset, possibly leading to better predictions.

These curves can are computationally intensive, as is fitting even a single model on a large dataset in R. Parallelization helped here, but in a future post I’ll show similar patterns in learning curves for much bigger data sets (using real data, rather than synthetic) by taking advantage of the scalable tools of Microsoft R Server.