This post is to announce that version 1.5.0 of the foreach package is now on CRAN. Foreach is an idiom that allows for iterating over elements in a collection, without the use of an explicit loop counter. The foreach package is now more than 10 years old, and is used by nearly 700 packages across CRAN and Bioconductor. (If you're interested in an overview of the foreach package and its history, this RStudio::conf talk by Bryan Lewis is worth watching.)

The main change is 1.5.0 is intended to help reduce errors associated with modifying global variables. Now, `%dopar%`

loops with a sequential backend will evaluate the loop body inside a local environment. Why make that change? Let's illustrate with an example:

```
library(foreach)
registerDoSEQ()
a <- 0
foreach(i=1:10) %dopar% {a <- a + 1}
a
## [1] 0
```

Here, the assignment inside the `%dopar%`

is to a *local* copy of `a`

, so the global variable `a`

remains unchanged. The reason for this change is because `%dopar%`

is intended for use in a parallel context, where modifying the global environment doesn’t make sense: the work will be taking place in different R processes, sometimes on different physical machines, possibly in the cloud. In this context there is no shared global environment to manipulate, unlike the case of a sequential backend.

Because of this, it’s almost always a mistake to modify global variables from `%dopar%`

, even if it used to succeed. This change will hopefully reduce the incidence of programming errors where people prototype their code with a sequential backend, only to have it fail when they use it with a real (parallel) backend.

Note that the behaviour of the `%do%`

operator, which is intended for a sequential backend, remains the same. It matches that of a regular `for`

loop:

```
a <- 0
foreach(i=1:10) %do% {a <- a + 1}
a
## [1] 10
```

If you have any questions or comments, please email me or open an issue at the GitHub repo.

*by Hong Ooi, Senior Data Scientist at Microsoft and maintainer of the foreach package*

This post is to announce some new and upcoming changes in the foreach package.

First, foreach can now be found on GitHub! The repository is at https://github.com/RevolutionAnalytics/foreach, replacing its old home on R-Forge. Right now the repo hosts both the foreach and iterators packages, but that may change later.

The latest 1.4.8 version of foreach, which is now live on CRAN, adds preliminary support for evaluating `%dopar%`

expressions in a local environment when a sequential backend is used. This addresses a long-standing inconsistency in the behaviour of `%dopar%`

with parallel and sequential backends, where the latter would evaluate the loop body in the global environment by default. This is a common source of bugs: code that works when prototyped with a sequential backend, mysteriously fails with a “real” parallel backend.

From version 1.4.8, the behaviour of `%dopar%`

can be controlled with

`options(foreachDoparLocal=TRUE|FALSE)`

or equivalently via the system environment variable

`R_FOREACH_DOPAR_LOCAL=TRUE|FALSE`

with the R option taking its value from the environment variable. The current default value is FALSE, which retains the pre-existing behaviour. It is intended that over time this will be changed to TRUE.

A side-effect of this change is that `%do%`

and `%dopar%`

will (eventually) behave differently for a sequential backend. See this Github issue for more discussion on this topic.

In the background, the repo has also been updated to use modern tooling such as Roxygen, RMarkdown and testthat. None of these should affect how the package works, although there are some minor changes to documentation formats (in particular, the vignettes are now in HTML format rather than PDF).

Some further changes are also planned down the road, to better integrate foreach with the future package by Henrik Bengtsson. See this Github issue for further details.

Please feel free to leave comments, bug reports and pull requests at the foreach repo, or you can contact me directly at hongooi@microsoft.com.

The future package is a powerful and elegant cross-platform framework for orchestrating asynchronous computations in R. It's ideal for working with computations that take a long time to complete; that would benefit from using distributed, parallel frameworks to make them complete faster; and that you'd rather not have locking up your interactive R session. You can get a good sense of the future package from its introductory vignette or from this eRum 2018 presentation by author by Henrik Bengtsson (video embedded below), but at its simplest it allows constructs in R like this:

a %<-% slow_calculation(1:50) b %<-% slow_calculation(51:100) a+b

The idea here is that `slow_calculation`

is an R function that takes a lot of time, but with the special `%<-%`

assignment operator the computation begins *and the R interpreter is ready immediately*. The first two lines of R code above take essentially zero time to execute. The futures package farms off those computations to another process or even a remote system (you specify which with a preceding `plan`

call), and R will only halt when the *result* is needed, as in the third line above. This is beneficial in Bengtsson's own work, where he uses the future package to parallelize cancer research on DNA sequences on high-performance computing (HPC) clusters.

The future package supports a wide variety of computation frameworks including parallel local R sessions, remote R sessions, and cluster computing frameworks. (If you can't use any of these, it falls back to evaluating the expressions locally, in sequence.) The future package also works in concert other parallel programming systems already available in R. For example, it provides future_lapply as a futurized analog of lapply, which will use whatever computation plan you have defined to run the computations in parallel.

The future package also extends the foreach package thanks to the updated doFuture package. By using `registerDoFuture`

as the foreach backend, your loops can use any computation plan provided by the future package to run the iterations in parallel. (The same applies to R packages that use foreach internally, notably the caret package.) This means you can now use foreach with any of the HPC schedulers supported by future, which includes TORQUE, Slurm, and OpenLava. So if you you share a Slurm HPC cluster with colleagues in your department, you can queue up a parallel simulation on the cluster using code like this:

library("doFuture") registerDoFuture() library("future.batchtools") plan(batchjobs_slurm) mu <- 1.0 sigma <- 2.0 x <- foreach(i = 1:3, .export = c("mu", "sigma")) %dopar% { rnorm(i, mean = mu, sd = sigma) }

The future package is available on CRAN now, and works consistently on Windows, Mac and Linux systems. You can learn more in the video at the end of this post, or in the recent blog update linked below.

I had a great time in Budapest last week for the eRum 2018 conference. The organizers have already made all of the videos available online. Here's my presentation: Speeding up R with Parallel Programming in the cloud.

You can find (and download) my presentation slides here. And if you just want the references from the last slide, here are the links:

*by JS Tan (Program Manager, Microsoft)*

The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves.

However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter. Because of R's rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model's reproducibility!

This is why using containers is extremely important when it comes to operationalizing your R workloads. I'm happy to announce that the doAzureParallel package, powered by Azure Batch, now supports fully containerized deployments. With this migration, doAzureParallel will not only help you scale out your workloads, but will also do it in a completely containerized fashion, letting your bypass the complexities of dealing with inconsistent environments. Now that doAzureParallel runs on containers, we can ensure a consistent immutable runtime while handling custom R versions, environments, and packages.

By default, the container used in doAzureParallel is the 'rocker/tidyverse:latest' container that is developed and maintained as part of the rocker project. For most cases, and especially for beginners, this image will contain most of what is needed. However, as users become more experienced or have more complex deployment requirements, they may want to change the Docker image that is used, or even build their own. doAzureParallel supports both those options, giving you flexibility (without any compromise on reliability). Configuring the Docker image is easy. Once you know which Docker image you want to use, you can simply specify its location in the cluster configuration and doAzureParallel will just know to use it when provisioning subsequent clusters. More details on configuring your Docker container settings with doAzureParallel are included in the documentation.

With this release, we hope to unblock many users who are looking to take their R models, and scale it up in the cloud. To get started with doAzureParallel, visit our Github page. Please give it a try and let us know if you have questions, feedback, or suggestions, or via email at razurebatch@microsoft.com.

Github (Azure): doAzureParallel

*by Błażej Moska, computer science student and data science intern*

One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

\[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

\[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

Bias is defined as follows:

\[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

\(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over `nsampl`

times, where `nsampl`

is the parameter describing number of bootstrap samples.

Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

Because bias and variance could be controlled by parameters sent to the `rpart`

function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are `cp`

(complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and `minsplit`

, which defines minimum number of observations needed to attempt a split.

Operations mentioned above is rather exhaustive in computational terms: we need to create *nsampl* bootstrap samples, grow `nsampl`

trees, calculate `nsampl`

predictions, `nrow`

variances, `nrow`

biases and repeat those operations for the number of parameters (length of the vector `cp`

or `minsplit`

). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

So, summing up, the procedure looks as follows:

- Create bootstrap samples (by bootstrapping original dataset)
- Train model on each of these bootstrap datasets
- Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
- Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
- Calculate mean bias/absolute bias and mean variance

*by Błażej Moska, computer science student and data science intern*

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like `fanny`

, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the `kmeans`

function. The equation is defined as follows (source: Wikipedia):

$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} - c_{j} \| }{ \| x_{i} - c_{k} \| }) ^{ \frac{2}{m-1} } } $$

\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations `for`

loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

#include <Rcpp.h> #include <math.h> using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)-data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)-data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m-1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function `fuzzyClustering`

will then be available from R.

fuzzyClustering=function(data,centers,m){ c <- nrow(centers) rows <- nrow(data) cols <- ncol(data) result <- matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator <- 0 for(i in 1:rows){ for(j in 1:c){ tempDist <- sqrt(sum((centers[j,]-data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist <- sqrt(sum((centers[k,]-data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation } result[i,j] <- 1/denominator #inserting value into membership matrix denominator <- 0 } } return(result); }

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067

Shown below is the output of `Sys.time`

for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.

The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

user system elapsed 0.32 0.00 0.33

R version:

user system elapsed 15.75 0.02 15.94

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.