The future package is a powerful and elegant cross-platform framework for orchestrating asynchronous computations in R. It's ideal for working with computations that take a long time to complete; that would benefit from using distributed, parallel frameworks to make them complete faster; and that you'd rather not have locking up your interactive R session. You can get a good sense of the future package from its introductory vignette or from this eRum 2018 presentation by author by Henrik Bengtsson (video embedded below), but at its simplest it allows constructs in R like this:

a %<-% slow_calculation(1:50) b %<-% slow_calculation(51:100) a+b

The idea here is that `slow_calculation`

is an R function that takes a lot of time, but with the special `%<-%`

assignment operator the computation begins *and the R interpreter is ready immediately*. The first two lines of R code above take essentially zero time to execute. The futures package farms off those computations to another process or even a remote system (you specify which with a preceding `plan`

call), and R will only halt when the *result* is needed, as in the third line above. This is beneficial in Bengtsson's own work, where he uses the future package to parallelize cancer research on DNA sequences on high-performance computing (HPC) clusters.

The future package supports a wide variety of computation frameworks including parallel local R sessions, remote R sessions, and cluster computing frameworks. (If you can't use any of these, it falls back to evaluating the expressions locally, in sequence.) The future package also works in concert other parallel programming systems already available in R. For example, it provides future_lapply as a futurized analog of lapply, which will use whatever computation plan you have defined to run the computations in parallel.

The future package also extends the foreach package thanks to the updated doFuture package. By using `registerDoFuture`

as the foreach backend, your loops can use any computation plan provided by the future package to run the iterations in parallel. (The same applies to R packages that use foreach internally, notably the caret package.) This means you can now use foreach with any of the HPC schedulers supported by future, which includes TORQUE, Slurm, and OpenLava. So if you you share a Slurm HPC cluster with colleagues in your department, you can queue up a parallel simulation on the cluster using code like this:

library("doFuture") registerDoFuture() library("future.batchtools") plan(batchjobs_slurm) mu <- 1.0 sigma <- 2.0 x <- foreach(i = 1:3, .export = c("mu", "sigma")) %dopar% { rnorm(i, mean = mu, sd = sigma) }

The future package is available on CRAN now, and works consistently on Windows, Mac and Linux systems. You can learn more in the video at the end of this post, or in the recent blog update linked below.

I had a great time in Budapest last week for the eRum 2018 conference. The organizers have already made all of the videos available online. Here's my presentation: Speeding up R with Parallel Programming in the cloud.

You can find (and download) my presentation slides here. And if you just want the references from the last slide, here are the links:

*by JS Tan (Program Manager, Microsoft)*

The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves.

However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter. Because of R's rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model's reproducibility!

This is why using containers is extremely important when it comes to operationalizing your R workloads. I'm happy to announce that the doAzureParallel package, powered by Azure Batch, now supports fully containerized deployments. With this migration, doAzureParallel will not only help you scale out your workloads, but will also do it in a completely containerized fashion, letting your bypass the complexities of dealing with inconsistent environments. Now that doAzureParallel runs on containers, we can ensure a consistent immutable runtime while handling custom R versions, environments, and packages.

By default, the container used in doAzureParallel is the 'rocker/tidyverse:latest' container that is developed and maintained as part of the rocker project. For most cases, and especially for beginners, this image will contain most of what is needed. However, as users become more experienced or have more complex deployment requirements, they may want to change the Docker image that is used, or even build their own. doAzureParallel supports both those options, giving you flexibility (without any compromise on reliability). Configuring the Docker image is easy. Once you know which Docker image you want to use, you can simply specify its location in the cluster configuration and doAzureParallel will just know to use it when provisioning subsequent clusters. More details on configuring your Docker container settings with doAzureParallel are included in the documentation.

With this release, we hope to unblock many users who are looking to take their R models, and scale it up in the cloud. To get started with doAzureParallel, visit our Github page. Please give it a try and let us know if you have questions, feedback, or suggestions, or via email at razurebatch@microsoft.com.

Github (Azure): doAzureParallel

*by Błażej Moska, computer science student and data science intern*

One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

\[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

\[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

Bias is defined as follows:

\[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

\(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over `nsampl`

times, where `nsampl`

is the parameter describing number of bootstrap samples.

Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

Because bias and variance could be controlled by parameters sent to the `rpart`

function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are `cp`

(complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and `minsplit`

, which defines minimum number of observations needed to attempt a split.

Operations mentioned above is rather exhaustive in computational terms: we need to create *nsampl* bootstrap samples, grow `nsampl`

trees, calculate `nsampl`

predictions, `nrow`

variances, `nrow`

biases and repeat those operations for the number of parameters (length of the vector `cp`

or `minsplit`

). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

So, summing up, the procedure looks as follows:

- Create bootstrap samples (by bootstrapping original dataset)
- Train model on each of these bootstrap datasets
- Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
- Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
- Calculate mean bias/absolute bias and mean variance

*by Błażej Moska, computer science student and data science intern*

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like `fanny`

, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the `kmeans`

function. The equation is defined as follows (source: Wikipedia):

$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} - c_{j} \| }{ \| x_{i} - c_{k} \| }) ^{ \frac{2}{m-1} } } $$

\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations `for`

loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

#include <Rcpp.h> #include <math.h> using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)-data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)-data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m-1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function `fuzzyClustering`

will then be available from R.

fuzzyClustering=function(data,centers,m){ c <- nrow(centers) rows <- nrow(data) cols <- ncol(data) result <- matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator <- 0 for(i in 1:rows){ for(j in 1:c){ tempDist <- sqrt(sum((centers[j,]-data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist <- sqrt(sum((centers[k,]-data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation } result[i,j] <- 1/denominator #inserting value into membership matrix denominator <- 0 } } return(result); }

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067

Shown below is the output of `Sys.time`

for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.

The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

user system elapsed 0.32 0.00 0.33

R version:

user system elapsed 15.75 0.02 15.94

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of *Parallel Computing for Data Science: With Examples in R, C++ and CUDA, *has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:

- Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
- Load balance, where the computing resources aren't contributing equally to the problem;
- Impacts from use of RAM and virtual memory, such as cache misses and page faults;
- Network effects, such as latency and bandwidth, that impact performance and communication overhead;
- Interprocess conflicts and thread scheduling;
- Data access and other I/O considerations.

The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It's also worth checking out Norm Matloff's keynote from the useR!2017 conference, embedded below.

Norm Matloff: Understanding overhead issues in parallel computation

At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to generate predictions for loan delinquency, using a Python script as the client. (You can see slides here, and a video of the presentation below.)

With Microsoft R Server 9.1, there are now two ways to operationalize models as a Web service or as a SQL Server stored procedure:

**Flexible Operationalization**: Deploy any R script or function.**Real-Time Operationalization**: Deploy model objects generated by specific functions in Microsoft R, but generates predictions much more quickly by bypassing the R interpreter.

In the demo, which begins at the 10:00 mark in the video below, you can see a comparison of using the two types of deployment. Ultimately, I was able to generate predictions from a random forest at a rate of 1M predictions per second, with three Python clients simultaneously drawing responses from the server (an Azure GS5 instance running the Windows Data Science VM).

If you'd like to try out this capability yourself, you can find the R and Python scripts used in the demo at this Github repository. The lending club data is available here, and the script used to featurize the data is here.