I had a great time in Budapest last week for the eRum 2018 conference. The organizers have already made all of the videos available online. Here's my presentation: Speeding up R with Parallel Programming in the cloud.

You can find (and download) my presentation slides here. And if you just want the references from the last slide, here are the links:

*by JS Tan (Program Manager, Microsoft)*

The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves.

However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter. Because of R's rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model's reproducibility!

This is why using containers is extremely important when it comes to operationalizing your R workloads. I'm happy to announce that the doAzureParallel package, powered by Azure Batch, now supports fully containerized deployments. With this migration, doAzureParallel will not only help you scale out your workloads, but will also do it in a completely containerized fashion, letting your bypass the complexities of dealing with inconsistent environments. Now that doAzureParallel runs on containers, we can ensure a consistent immutable runtime while handling custom R versions, environments, and packages.

By default, the container used in doAzureParallel is the 'rocker/tidyverse:latest' container that is developed and maintained as part of the rocker project. For most cases, and especially for beginners, this image will contain most of what is needed. However, as users become more experienced or have more complex deployment requirements, they may want to change the Docker image that is used, or even build their own. doAzureParallel supports both those options, giving you flexibility (without any compromise on reliability). Configuring the Docker image is easy. Once you know which Docker image you want to use, you can simply specify its location in the cluster configuration and doAzureParallel will just know to use it when provisioning subsequent clusters. More details on configuring your Docker container settings with doAzureParallel are included in the documentation.

With this release, we hope to unblock many users who are looking to take their R models, and scale it up in the cloud. To get started with doAzureParallel, visit our Github page. Please give it a try and let us know if you have questions, feedback, or suggestions, or via email at razurebatch@microsoft.com.

Github (Azure): doAzureParallel

*by Błażej Moska, computer science student and data science intern*

One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

\[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

\[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

Bias is defined as follows:

\[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

\(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over `nsampl`

times, where `nsampl`

is the parameter describing number of bootstrap samples.

Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

Because bias and variance could be controlled by parameters sent to the `rpart`

function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are `cp`

(complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and `minsplit`

, which defines minimum number of observations needed to attempt a split.

Operations mentioned above is rather exhaustive in computational terms: we need to create *nsampl* bootstrap samples, grow `nsampl`

trees, calculate `nsampl`

predictions, `nrow`

variances, `nrow`

biases and repeat those operations for the number of parameters (length of the vector `cp`

or `minsplit`

). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

So, summing up, the procedure looks as follows:

- Create bootstrap samples (by bootstrapping original dataset)
- Train model on each of these bootstrap datasets
- Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
- Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
- Calculate mean bias/absolute bias and mean variance

*by Błażej Moska, computer science student and data science intern*

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like `fanny`

, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the `kmeans`

function. The equation is defined as follows (source: Wikipedia):

$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} - c_{j} \| }{ \| x_{i} - c_{k} \| }) ^{ \frac{2}{m-1} } } $$

\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations `for`

loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

#include <Rcpp.h> #include <math.h> using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)-data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)-data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m-1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function `fuzzyClustering`

will then be available from R.

fuzzyClustering=function(data,centers,m){ c <- nrow(centers) rows <- nrow(data) cols <- ncol(data) result <- matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator <- 0 for(i in 1:rows){ for(j in 1:c){ tempDist <- sqrt(sum((centers[j,]-data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist <- sqrt(sum((centers[k,]-data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation } result[i,j] <- 1/denominator #inserting value into membership matrix denominator <- 0 } } return(result); }

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067

Shown below is the output of `Sys.time`

for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.

The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

user system elapsed 0.32 0.00 0.33

R version:

user system elapsed 15.75 0.02 15.94

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of *Parallel Computing for Data Science: With Examples in R, C++ and CUDA, *has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:

- Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
- Load balance, where the computing resources aren't contributing equally to the problem;
- Impacts from use of RAM and virtual memory, such as cache misses and page faults;
- Network effects, such as latency and bandwidth, that impact performance and communication overhead;
- Interprocess conflicts and thread scheduling;
- Data access and other I/O considerations.

The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It's also worth checking out Norm Matloff's keynote from the useR!2017 conference, embedded below.

Norm Matloff: Understanding overhead issues in parallel computation

At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to generate predictions for loan delinquency, using a Python script as the client. (You can see slides here, and a video of the presentation below.)

With Microsoft R Server 9.1, there are now two ways to operationalize models as a Web service or as a SQL Server stored procedure:

**Flexible Operationalization**: Deploy any R script or function.**Real-Time Operationalization**: Deploy model objects generated by specific functions in Microsoft R, but generates predictions much more quickly by bypassing the R interpreter.

In the demo, which begins at the 10:00 mark in the video below, you can see a comparison of using the two types of deployment. Ultimately, I was able to generate predictions from a random forest at a rate of 1M predictions per second, with three Python clients simultaneously drawing responses from the server (an Azure GS5 instance running the Windows Data Science VM).

If you'd like to try out this capability yourself, you can find the R and Python scripts used in the demo at this Github repository. The lending club data is available here, and the script used to featurize the data is here.

At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.

JS Tan using doAzureParallel #rstats package to run simulation on a cluster of 20 low-priority Azure VMs. Total cost: $0.02 #EARLConf2017 pic.twitter.com/Mpl3IUa9zY

— David Smith (@revodavid) June 7, 2017

Using the doAzureParallel package is simple. First, you need to define the cluster you're going to use as a JSON file. (You can see an example on the right.) Here, you'll specify your Azure credentials, the size of the cluster, and the type of nodes (CPUs and memory) to use in the cluster. You can also specify here R packages (from CRAN and/or Github) to be pre-loaded onto each node, and the maximum number of simultaneous tasks to run on each node (for within-node parallelism).

New to this update, the poolSize option allows you to specify the number of dedicated (standard) VM nodes to use, in addition to a number of low-priority nodes to use. Low-priority nodes can be pre-empted by the Azure system at any time, but are much cheaper to use. (Even if a node is pre-empted your parallel computation will be continue; it will just take a little longer with the reduced capacity.) You can even specify a minimum and maximum number of nodes of each class to use, in which case the cluster will **automatically scale** up and down according to either (your choice) the workload or the time of day (e.g. only expand the low-priority part of the cluster on weekends, when pre-emption is less likely).

Once you've defined the parameters of your cluster, all you need to do is declare the cluster as a backend for the foreach package. The body of the `foreach`

loop runs just like a `for`

loop, except that multiple iterations run in parallel on the remote cluster. Here are the key parts of the option price simulation example JS presented at the conference.

This same approach can be used for any "embarrassingly parallel" iteration in R, and you can use any R function or package within the body of the loop. For example, you could use a cluster to reduce the time required for parameter tuning and cross-validation with the caret package, or speed up data preparation tasks when using the dplyr package.

In addition to support for auto-scaling clusters, this update to doAzureParallel also includes a few other new features. You'll also find new utility functions for managing multiple long-running R jobs, functions to read data from and write data to Azure Blob storage, and the ability to pre-load data into the cluster by specifying resource files.

The doAzureParallel package is available for download now from Github, under the open-source MIT license. For details on how to use the package, check out the README and the doAzureParallel guide.

Github (Azure): doAzureParallel