With the current focus on deep learning, neural networks are all the rage again. (Neural networks have been described for more than 60 years, but it wasn't until the the power of modern computing systems became available that they have been successfully applied to tasks like image recognition.) Neural networks are the fundamental predictive engine in deep learning systems, but it can be difficult to understand exactly what they do. To help with that, Brandon Rohrer has created this from-the-basics guide to how neural networks work:

In R, you can train a simple neural network with just a single hidden layer with the nnet package, which comes pre-installed with every R distribution. It's a great place to start if you're new to neural networks, but the deep learning applications call for more complex neural networks. R has several packages to check out here, including MXNet, darch, deepnet, and h2o: see this post for a comparison. The tensorflow package can also be used to implement various kinds of neural networks. And the rxNeuralNet function (found in the MicrosoftML package included with Microsoft R Server and Microsoft R Client) provides high-performance training of complex neural networks using CPUs and GPUs.

Data Science and Robots Blog: How neural networks work

The MicrosoftML package introduced with Microsoft R Server 9.0 added several new functions for high-performance machine learning, including rxNeuralNet. Tomaz Kastrun recently applied rxNeuralNet to the MNIST database of handwritten digits to compare its performance with two other machine learning packages, h2o and xgboost. The results are summarized in the chart below:

In addition to having the best performance (for both the CPU-enabled and GPU-enabled modes), rxNeuralNetwork did not have to sacrifice accuracy. In fact, rxNeuralNetwork had the best accuracy of the three algorithms: 97.8%, compared to 95.3% for h2o and 94.9% for xgBoost. The same training and validation set were used for each case, and the R code is available here. (If you're looking for other uses of MicrosoftML, this script also applies algorithms like rxFastForest and rxFastLinear to various other datasets.)

The MicrosoftML package can be used to classify other kinds of images, too. This post from the Microsoft R Server Tiger Team demostrates using the rxNeuralNet function to classify images from the UCI Image Segmentation Data Set. But for more on the OCR application, follow the link below.

TomaztSQL: rxNeuralNet vs. xgBoost vs. H2O

One of the major "wow!" moments in the keynote where SQL Server 2016 was first introduced was a demo that automated the process classifying images of galaxies in a huge database of astronomical images.

The SQL Server Blog has since published a step-by-step tutorial on implementing the galaxy classifier in SQL Server (and the code is also available on GitHub). This updated version of the demo uses the new MicrosoftML package in Microsoft R Server 9, and specifically the `rxNeuralNet`

function for deep neural networks. The tutorial recommends using the Azure NC class of virtual machines, to take advantage of the GPU-accelerated capabilities of the function, and provides details on using the SQL Server interfaces to train the neural netowrk and run predictions (classifications) on the image database. For the details, follow the link below.

SQL Server Blog: How six lines of code + SQL Server can bring Deep Learning to ANY App

Oksana Kutina and Stefan Feuerriegel fom University of Freiburg recently published an in-depth comparison of four R packages for deep learning. The packages reviewed were:

- MXNet: The R interface to the MXNet deep learning library. (The blog post refers to an older name for the package, MXNetR.)
- darch: An R package for deep architectures and restricted Boltzmann machines.
- deepnet: An R package implementing feed-forward neural networks, restricted Boltzmann machines, deep belief networks, and stacked autoencoders.
- h2o: The R interface to the H2O deep-learning framework.

The blog post goes into detail about the capabilities of the packages, and compares them in terms of flexibility, ease-of-use, parallelization frameworks supported (GPUs, clusters) and performance -- follow the link below for details. I include the conclusion from the paper here:

The current version of deepnet might represent the most differentiated package in terms of available architectures. However, due to its implementation, it might not be the fastest nor the most user-friendly option. Furthermore, it might not offer as many tuning parameters as some of the other packages.

H2O and MXNetR, on the contrary, offer a highly user-friendly experience. Both also provide output of additional information, perform training quickly and achieve decent results. H2O might be more suited for cluster environments, where data scientists can use it for data mining and exploration within a straightforward pipeline. When flexibility and prototyping is more of a concern, then MXNetR might be the most suitable choice. It provides an intuitive symbolic tool that is used to build custom network architectures from scratch. Additionally, it is well optimized to run on a personal computer by exploiting multi CPU/GPU capabilities.

darch offers a limited but targeted functionality focusing on deep belief networks.

Information Systems Research R Blog: Deep Learning in R

Microsoft R Server 9 includes a new R package for machine learning: MicrosoftML. (So do the Data Science Virtual Machine and the free Microsoft R Client edition, incidentally.) This package includes a suite of fast predictive modeling functions implemented by Microsoft Research, including:

- Linear (
`rxFastLinear`

) and logistic (`rxLogisticRegression`

) model functions based on the Stochastic Dual Coordinate Ascent method; - Classification/regression trees (
`rxFastTrees`

) and random forests (`rxFastForests`

) based on FastRank, an efficient implementation of the MART gradient boosting algorithm; - A neural network algorithm (
`rxNeuralNet`

) with support for custom, multilayer network topologies; and - One-class anomaly detection (
`rxOneClassSvm`

) based on support vector machines.

As the function names suggest, the implementations are tuned for speed: most use multiple CPUs, and some will even use the GPU (if available). Not all of the implementations scale to unlimited data sizes, however; all but the linear and logistic regression routines are bound by available RAM.

If you want to give these routines a try, the MIcrosoft R Server Tiger Team has prepared a walkthrough analyzing the famous NYC Taxi data set. Once you have access to Microsoft R Server (or Client), this R script walks you through the process of:

- Loading the MicrosoftML package
- Importing the NYC Taxi Data from SQL Server (it comes preinstalled on the Data Science Virtual Machine)
- Splitting the data into a test set and a training set, with the binary value "tipped" (whether or not the driver was tipped) as the response
- Fitting several predictive models: logistic regression, linear model,, fast forest, and neural network.
- Making predictions on the test data
- Evaluating model performance by comparing AUC (area under the ROC curve)

The ROC curves are shown below. As you'd expect the linear model performs poorly compared to the others, since it's being applied here to a binary variable.

To try it out yourself, follow the walkthrough linked below, which also provides instructions for running the logistic regression model in SQL Server Management Studio.

Microsoft R Server Tiger Team: Predicting NYC Taxi Tips using MicrosoftML

*by Bob Horton, Microsoft Senior Data Scientist*

Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier. In an earlier post, I described a simple “turtle’s eye view” of these plots: a classifier is used to sort cases in order from most to least likely to be positive, and a Logo-like turtle marches along this string of cases. The turtle considers all the cases it has passed as having tested positive. Depending on their actual class they are either false positives (FP) or true positives (TP); this is equivalent to adjusting a score threshold. When the turtle passes a TP it takes a step upward on the y-axis, and when it passes a FP it takes a step rightward on the x-axis. The step sizes are inversely proportional to the number of actual positives (in the y-direction) or negatives (in the x-direction), so the path always ends at coordinates (1, 1). The result is a plot of true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity), which is all an ROC curve is.

Computing the area under the curve is one way to summarize it in a single value; this metric is so common that if data scientists say “area under the curve” or “AUC”, you can generally assume they mean an ROC curve unless otherwise specified.

Probably the most straightforward and intuitive metric for classifier performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people a completely bogus screening test that always reports “negative” will be 99.9999% accurate. Unlike accuracy, ROC curves are insensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.

In this post I’ll work through the geometry exercise of computing the area, and develop a concise vectorized function that uses this approach. Then we’ll look at another way of viewing AUC which leads to a probabilistic interpretation.

Let’s start with a simple artificial data set:

```
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
```

Here the vector `prediction`

holds ersatz scores; these normally would be assigned by a classifier, but here we’ve just assigned numbers so that the decreasing order of the scores matches the given order of the category labels. Scores 9 and 10, one representing a positive case and the other a negative case, are replaced by their average so that the data will contain ties without otherwise disturbing the order.

To plot an ROC curve, we’ll need to compute the true positive and false positive rates. In the earlier article we did this using cumulative sums of positives (or negatives) along the sorted binary labels. But here we’ll use the `pROC`

package to make it official:

```
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
```

`## Area under the curve: 0.825`

```
roc_df <- data.frame(
TPR=rev(roc_obj$sensitivities),
FPR=rev(1 - roc_obj$specificities),
labels=roc_obj$response,
scores=roc_obj$predictor)
```

The `roc`

function returns an object with plot methods and other conveniences, but for our purposes all we want from it is vectors of TPR and FPR values. TPR is the same as sensitivity, and FPR is 1 - specificity (see “confusion matrix” in Wikipedia). Unfortunately, the `roc`

function reports these values sorted in the order of ascending score; we want to start in the lower left hand corner, so I reverse the order. According to the `auc`

function from the pROC package, our simulated category and prediction data gives an AUC of 0.825; we’ll compare other attempts at computing AUC to this value.

If the ROC curve were a perfect step function, we could find the area under it by adding a set of vertical bars with widths equal to the spaces between points on the FPR axis, and heights equal to the step height on the TPR axis. Since actual ROC curves can also include portions representing sets of values with tied scores which are not square steps, we need to adjust the area for these segments. In the figure below we use green bars to represent the areas under the steps. Adjustments for sets of tied values will be shown as blue rectangles; half the area of each of these blue rectagles is below a sloped segment of the curve.

The function for drawing polygons in base R takes vectors of x and y values; we’ll start by defining a `rectangle`

function that uses a simpler and more specialized syntax; it takes x and y coordinates for the lower left corner of the rectangle, and a height and width. It sets some default display options, and passes along any other parameters we might specify (like color) to the `polygon`

function.

```
rectangle <- function(x, y, width, height, density=12, angle=-45, ...)
polygon(c(x,x,x+width,x+width), c(y,y+height,y+height,y),
density=density, angle=angle, ...)
```

The spaces between TPR (or FPR) values can be calculated by `diff`

. Since this results in a vector one position shorter than the original data, we pad each difference vector with a zero at the end:

```
roc_df <- transform(roc_df,
dFPR = c(diff(FPR), 0),
dTPR = c(diff(TPR), 0))
```

For this figure, we’ll draw the ROC curve last to place it on top of the other elements, so we start by drawing an empty graph (`type='n'`

) spanning from 0 to 1 on each axis. Since the data set has exactly ten positive and ten negative cases, the TPR and FPR values will all be multiples of 1/10, and the points of the ROC curve will all fall on a regularly spaced grid. We draw the grid using light blue horizontal and vertical lines spaced one tenth of a unit apart. Now we can pass the values we calculated above to the rectangle function, using `mapply`

(the multi-variate version of `sapply`

) to iterate over all the cases and draw all the green and blue rectangles. Finally we plot the ROC curve (that is, we plot TPR against FPR) on top of everything in red.

```
plot(0:10/10, 0:10/10, type='n', xlab="FPR", ylab="TPR")
abline(h=0:10/10, col="lightblue")
abline(v=0:10/10, col="lightblue")
with(roc_df, {
mapply(rectangle, x=FPR, y=0,
width=dFPR, height=TPR, col="green", lwd=2)
mapply(rectangle, x=FPR, y=TPR,
width=dFPR, height=dTPR, col="blue", lwd=2)
lines(FPR, TPR, type='b', lwd=3, col="red")
})
```

The area under the red curve is all of the green area plus half of the blue area. For adding areas we only care about the height and width of each rectangle, not its (x,y) position. The heights of the green rectangles, which all start from 0, are in the TPR column and widths are in the dFPR column, so the total area of all the green rectangles is the dot product of TPR and dFPR. Note that the vectored approach computes a rectangle for each data point, even when the height or width is zero (in which case it doesn’t hurt to add them). Similarly, the heights and widths of the blue rectangles (if there are any) are in columns dTPR and dFPR, so their total area is the dot product of these vectors. For regions of the graph that form square steps, one or the other of these values will be zero, so you only get blue rectangles (of non-zero area) if both TPR and FPR change in the same step. Only half the area of each blue rectangle is below its segment of the ROC curve (which is a diagonal of a blue rectangle). Remember the ‘real’ `auc`

function gave us an AUC of 0.825, so that is the answer we’re looking for.

```
simple_auc <- function(TPR, FPR){
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
with(roc_df, simple_auc(TPR, FPR))
```

`## [1] 0.825`

Now let’s try a completely different approach. Here we generate a matrix representing all possible combinations of a positive case with a negative case. Each row represents a positive case, in order from the highest-scoring positive case at the bottom to the lowest-scoring positive case at the top. Similarly, the columns represent the negative cases, sorted with the highest scores at the left. Each cell represents a comparison between a particular positive case and a particular negative case, and we mark the cell by whether its positive case has a higher score (or higher overall rank) than its negative case. If your classifier is any good, most of the positive cases will outrank most of the negative cases, and any exceptions will be in the upper left corner, where low-ranking positives are being compared to high-ranking negatives.

```
rank_comparison_auc <- function(labels, scores, plot_image=TRUE, ...){
score_order <- order(scores, decreasing=TRUE)
labels <- as.logical(labels[score_order])
scores <- scores[score_order]
pos_scores <- scores[labels]
neg_scores <- scores[!labels]
n_pos <- sum(labels)
n_neg <- sum(!labels)
M <- outer(sum(labels):1, 1:sum(!labels),
function(i, j) (1 + sign(pos_scores[i] - neg_scores[j]))/2)
AUC <- mean (M)
if (plot_image){
image(t(M[nrow(M):1,]), ...)
library(pROC)
with( roc(labels, scores),
lines((1 + 1/n_neg)*((1 - specificities) - 0.5/n_neg),
(1 + 1/n_pos)*sensitivities - 0.5/n_pos,
col="blue", lwd=2, type='b'))
text(0.5, 0.5, sprintf("AUC = %0.4f", AUC))
}
return(AUC)
}
rank_comparison_auc(labels=as.logical(category), scores=prediction)
```

` `

## [1] 0.825

The blue line is an ROC curve computed in the conventional manner (slid and stretched a bit to get the coordinates to line up with the corners of the matrix cells). This makes it evident that the ROC curve marks the boundary of the area where the positive cases outrank the negative cases. The AUC can be computed by adjusting the values in the matrix so that cells where the positive case outranks the negative case receive a `1`

, cells where the negative case has higher rank receive a `0`

, and cells with ties get `0.5`

(since applying the `sign`

function to the difference in scores gives values of 1, -1, and 0 to these cases, we put them in the range we want by adding one and dividing by two.) We find the AUC by averaging these values.

The probabilistic interpretation is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. This is evident from the figure, where the total area of the plot is normalized to one, the cells of the matrix enumerate all possible combinations of positive and negative cases, and the fraction under the curve comprises the cells where the positive case outranks the negative one.

We can use this observation to approximate AUC:

```
auc_probability <- function(labels, scores, N=1e7){
pos <- sample(scores[labels], N, replace=TRUE)
neg <- sample(scores[!labels], N, replace=TRUE)
# sum( (1 + sign(pos - neg))/2)/N # does the same thing
(sum(pos > neg) + sum(pos == neg)/2) / N # give partial credit for ties
}
auc_probability(as.logical(category), prediction)
```

`## [1] 0.8249989`

Now let’s try our new AUC functions on a bigger dataset. I’ll use the simulated dataset from the earlier blog post, where the labels are in the `bad_widget`

column of the test set dataframe, and the scores are in a vector called `glm_response_scores`

.

This data has no tied scores, so for testing let’s make a modified version that has ties. We’ll plot a black line representing the original data; since each point has a unique score, the ROC curve is a step function. Then we’ll generate tied scores by rounding the score values, and plot the rounded ROC in red. Note that we are using “response” scores from a `glm`

model, so they all fall in the range from 0 to 1. When we round these scores to one decimal place, there are 11 possible rounded scores, from 0.0 to 1.0. The AUC values calculated with the `pROC`

package are indicated on the figure.

```
roc_full_resolution <- roc(test_set$bad_widget, glm_response_scores)
rounded_scores <- round(glm_response_scores, digits=1)
roc_rounded <- roc(test_set$bad_widget, rounded_scores)
plot(roc_full_resolution, print.auc=TRUE)
```

```
##
## Call:
## roc.default(response = test_set$bad_widget, predictor = glm_response_scores)
##
## Data: glm_response_scores in 59 controls (test_set$bad_widget FALSE) < 66 cases (test_set$bad_widget TRUE).
## Area under the curve: 0.9037
```

```
lines(roc_rounded, col="red", type='b')
text(0.4, 0.43, labels=sprintf("AUC: %0.3f", auc(roc_rounded)), col="red")
```

Now we can try our AUC functions on both sets to check that they can handle both step functions and segments with intermediate slopes.

```
options(digits=22)
set.seed(1234)
results <- data.frame(
`Full Resolution` = c(
auc = as.numeric(auc(roc_full_resolution)),
simple_auc = simple_auc(rev(roc_full_resolution$sensitivities), rev(1 - roc_full_resolution$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, glm_response_scores,
main="Full-resolution scores (no ties)"),
auc_probability = auc_probability(test_set$bad_widget, glm_response_scores)
),
`Rounded Scores` = c(
auc = as.numeric(auc(roc_rounded)),
simple_auc = simple_auc(rev(roc_rounded$sensitivities), rev(1 - roc_rounded$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, rounded_scores,
main="Rounded scores (ties in all segments)"),
auc_probability = auc_probability(test_set$bad_widget, rounded_scores)
)
)
```

Full.Resolution | Rounded.Scores | |
---|---|---|

auc | 0.90369799691833586 | 0.89727786337955828 |

simple_auc | 0.90369799691833586 | 0.89727786337955828 |

rank_comparison_auc | 0.90369799691833586 | 0.89727786337955828 |

auc_probability | 0.90371970000000001 | 0.89716879999999999 |

So we have two new functions that give exactly the same results as the function from the `pROC`

package, and our probabilistic function is pretty close. Of course, these functions are intended as demonstrations; you should normally use standard packages such as `pROC`

or `ROCR`

for actual work.

Here we’ve focused on calculating AUC and understanding the probabilistic interpretation. The probability associated with AUC is somewhat arcane, and is not likely to be exactly what you are looking for in practice (unless you actually will be randomly selecting a positive and a negative case, and you really want to know the probability that the classifier will score the positive case higher.) While AUC gives a single-number summary of classifier performance that is suitable in some circumstances, other metrics are often more appropriate. In many applications, overall behavior of a classifier across all possible score thresholds is of less interest than the behavior in a specific range. For example, in marketing the goal is often to identify a highly enriched target group with a low false positive rate. In other applications it may be more important to clearly identify a group of cases likely to be negative. For example, when pre-screening for a disease or defect you may want to rule out as many cases as you can before you start running expensive confirmatory tests. More generally, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If you know these costs, you should probably use them. A good introduction relating ROC curves to economic utility functions, complete with story and characters, is given in the excellent blog post “ML Meets Economics.”

About a decade or so photomosaics were all the rage: a near-recreation of a famous image by using many smaller images as elements. Here, for example, is the Mona Lisa, created using the Metapixel program by overlaying 32x32 images of vehicles and animals.

An image like this presents an interesting computer vision challenge: can you use deep learning techniques to find the pictures of boats and cars embedded in the image, amongst all the noise and clutter of the other images around and often on top of them? This is the challenge that Max Kaznady and his colleagues in the data science team took upon themselves, using the power of an Azure N-Series virtual machine with 24 cores and 4 K80 GPUs. The model was trained using the mxnet package running on Microsoft R Server, which takes advantage of the powerful GPUs to train a Residual Network (ResNet) DNN with 21 convolutional blocks. (You can read more about Resnet in this Microsoft Research paper.) Once the model was trained, an HDInsight Spark cluster running Microsoft R Server was used to parallelize the problem of finding boat and car images within the photomosaic. Here's the architecture of the system, with the steps marked in order in yellow; another blog post explains how to set up such an architecture yourself. You can also just use the Deep Learning Toolkit with the Data Science Virtual Machine.

To learn more about this application, check out this recorded presentation from the Data Science Summit presented by Max Kaznady and Tao Wu, or the blog post linked below.

Cortana Intelligence and Machine Learning Blog: Applying Deep Learning at Cloud Scale, with Microsoft R Server & Azure Data Lake

There's a new e-book available to download free from Microsoft Academy: Data Science with Microsoft SQL Server 2016.

This 90-page e-book is aimed at data scientists who already have some experience in R, but want to learn how to use R wirth SQL Server. The book was written by some of my most experienced colleagues from Microsoft's data science team at Microsoft: Buck Woody, Danielle Dean, Debraj GuhaThakurta, Gagan Bansal, Matt Conners, and Wee-Hyong Tok, and begins with an introduction by Joseph Sirosh. It includes everything you need to know to use R and SQL Server:

- How to install and configure your data science tool-set: SQL Server 2016, Microsoft R Client, RStudio/RTVS, etc.
- How to download data from SQL Server into a local R client
- How to create and update tables in SQL Server from R
- Use a remote SQL Server instance as an R compute engine, driven from your local R client
- Write SQL Server stored procedures that run R code on the server, and how to share them with others.

(If you're new to R or data science, there are also links to learning resources in Chapter 1.) The book also includes several fully-worked examples following the data science process, with links to data and code so you can try them out yourself:

- Creating a model to predict whether a tip is given for a taxi ride in New York City
- Building a customer churn model to find customers likely to switch to a competing service provider
- Analyzing "Internet of Things" data as part of a predictive maintenenance program

**Data Science with Microsoft SQL Server 2016** is available now as a download from Microsoft Academy. (Desktop and mobile PDF formats available. Free registration required.)

If you'd like to manipulate and analyze very large data sets with the R language, one option is to use R and Apache Spark together. R provides the simple, data-oriented language for specifying transformations and models; Spark provides the storage and computation engine to handle data much larger than R alone can handle.

At the KDD 2016 conference last October, a team from Microsoft presented a tutorial on Scalable R on Spark, and made all of the materials available on Github. The materials include an 80-slide presentation covering several tutorials (you can download the 13Mb PowerPoint file here).

Slides 1-29 form an introduction which covers:

- Scaling R scripts on a single machine with the bigmemory and ff packages
- Interfacing to Spark from R with SparkR 1.6
- Installing and using the sparklyr package
- Using Microsoft R Server and the "RevoScaleR" package to offload its computations to Spark
- Comparisons and benchmarks of the techniques to scale R described above

Slides 32-44 form a hands-on tutorial working with the airline arrival data to predict flight delays. In the tutorial, you use SparkR to clean and join the data, R Server's "rxDTree" function to fit a random forest model to predict delays, and then publish a prediction function to Azure with the AzureML package to create a cloud-based flight-delay prediction service. The Microsoft R scripts are available here.

Slides 46-50 form another tutorial, this time working with the NYC Taxi dataset. The first tutorial script uses the sparklyr package to visualize the data and create models to predict the tip amount. This second tutorial script goes further with models, fitting Elastic Net, Random Forest and Gradient Boosted Tree models with both SparkR and sparklyr. In addition this script uses SparkR and SparkSQL to create a map of the trips.

Slides 51-59 demonstrate optimizing the performance of a time series forecasting model, by searching over a large parameter space with the hts package. By running the models in parallel to optimize the MAPE (mean absolute percent error), the total execution time was reduced to 1 day compared to the 40 days to complete the computations serially. The parallelization was achieved with the Microsoft R Server "rxExec" function, which you can replicate with the script available here.

Slides 60-72 includes background on calculating learning curves for various predictive models (see here and here for more information). There's also a tutorial: the R scripts are available here.

To work though the materials from the tutorial, you'll need access to a Spark cluster configured with Microsoft R Server and the necessary scripts and data files. You can easily create an HDInsight Premium cluster including Microsoft R Server on Microsoft Azure: these instructions provide the details. Once the cluster is ready, you can remotely access it from your desktop using `ssh`

as described. The clusters are charged by the hour (according to their size and power), so be sure to shut down the cluster when you're done with the tutorial.

These tutorials are hopefully useful to anyone who is trying to learn to use R with Spark. The full collection of materials and slides is available at the Github repository below.

Github (Azure): KDD 2016 tutorial: Scalable Data Science with R, from Single Nodes to Spark Clusters

In Joseph Sirosh's keynote presentation at the Data Science Summit on Monday, Wee Hyong Tok demonstrated using R in SQL Server 2016 to detect fraud in real-time credit card transactions at a rate of 1 million transactions per second. The demo (which starts at the 17:00 minute mark) used a gradient-boosted tree model to predict the probability of a credit card transaction being fraudulent, based on attributes like the charge amount and the country of origin.

Then, a stored procedure in SQL Server 2016 was used to score transactions streaming into the database at a rate of 3.6 billion per hour. If you'd like to try this yourself, a step-by-step tutorial with code to implement the model and scoring is available here.

Later in the keynote (starting at 25:00), John Salch, VP of Technology and Platforms at PROS describes using R to determine prices for airline tickets, hotel rooms, and laptops. PROS has been using R for a while in development, but found running R within SQL Server 2016 to be 100 **times** (not 100%, 100x!) faster for price optimization. "This really woke us up that we can use R in a production setting ... it's truly amazing," he says.

It's great to see these global-scale applications of R, driving the intelligence of businesses behind the scenes. As Joseph said in the opening, "If there's one language you should learn today ... it's R."

Channel 9: Microsoft Machine Learning & Data Science Summit 2016