by John Mount Ph.D.

Data Scientist at Win-Vector LLC

Let's talk about the use and benefits of parallel computation in R.

IBM's Blue Gene/P massively parallel supercomputer (Wikipedia).

Parallel computing is a type of computation in which many calculations are carried out simultaneously."

Wikipedia quoting: Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing

The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability. Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:

- Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
- Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
- Parallelization abstraction frameworks such as Thrust/Rth (see here).
- Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)

In addition to having a task ready to "parallelize" you need a facility willing to work on it in a parallel manner. Examples include:

- Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
- Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
- Clusters of computers (such as Amazon ec2, Hadoop backends and more).

Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly. In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. To read on please click here.

*by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise)*

A few weeks ago, we revealed **ddR** (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice.

In designing **ddR**, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close in semantics and API to their R counterparts. You can read the introductory material about the package here. In this post, we show how to use **ddR **functions.

**Classes** **dlist, darray, ****and** **dframe**: These classes are the distributed equivalents of list, matrix, and data.frame, respectively. Keeping their APIs similar to those for the vanilla R classes, we implemented operators and functions that work on these functions in the same ways. The example below creates two distributed lists -- one of five 3s and one out of the elements 1 through 5.

a <- dmapply(function(x) { x }, rep(3,5))

b <- dlist(1,2,3,4,5,nparts=1L)

The argument *nparts* specifies the number of partitions to split the resulting *dlist* **b **into. For *darrays *and *dframes*, which are two-dimensional, *nparts* also permits a two-element vector, which specifies the two-dimensional partitioning of the output.

**Functions dmapply and dlapply: **Following R’s functional-programming paradigm, we have created these two functions as the distributed equivalents of R’s *mapply* and *lapply*. One can supply any combination of distributed objects and regular R args into *dmapply:*

*addThenSubtract* *<-* *function**(x,y,z) { x **+** y **-** z}** *

*c **<-** dmapply(addThenSubtract,a,b,**MoreArgs**=list**(**z**=**5**))*

**Functions parts and collect:** The *parts *construct gives users the ability to partition data in a manner that is very explicit. *parts *is often used in conjunction with *dmapply *to achieve partition-level parallelism. To fetch data, the *collect* keyword is used. So, if we wanted to check our result in **c **from our previous example, we may do:

collect(c)

## [[1]]

## [1] 2

##

## [[2]]

## [1] 3

##

## [[3]]

## [1] 4

##

## [[4]]

## [1] 5

##

## [[5]]

## [1] 6

Backends can easily provide custom implementations of *dlist, darray, *and *dframe, *as well as for *dmapply*. At a minimum, backends define only a couple of new custom classes (extending ddR’s classes), as well as the definitions for a couple of generic functions, including *dmapply*.

With these definitions in place, ddR knows how to properly dispatch work to backends where behaviors differ, whilst taking care of the rest of the work -- since most of these other operations can be defined using just *dmapply*. For example, *colSums* should automatically work on any *darray* created by a backend that has defined *dmapply*!

**Putting it to Work: RandomForest written in ddR**

In addition to adding new backend drivers for **ddR** (e.g., for Spark), part of this initiative is to develop an initial suite of algorithms written in ddR, such that they are portable to all ddR backends. **RandomForest.ddR** is one such algorithm that we have completed, now available on CRAN. ddR packages for K-Means and GLM (generalized linear models) are now also available.

Random Forest is an algorithm that can be parallelized in a very simple way by asking each worker to create a subset of the trees:

simple_RF <-function(formula, data, ntree = 500, ..., nparts = 2)

{

execute_randomForest_parallel <- function(ntree, formula, data, inputArgs)

{

inputArgs$formula <- formula

inputArgs$data <- data

inputArgs$ntree <- ntree

suppressMessages(requireNamespace("randomForest"))

model <- do.call(randomForest::randomForest,inputArgs)

}

dmodel <- dmapply(execute_randomForest_parallel,

ntree = rep(ceiling(500/nparts),nparts),

MoreArgs = list(formula=formula,

data=data,inputArgs=list(...)),

output.type = "dlist", nparts = nparts)

model <- do.call(randomForest::combine, collect(dmodel))

}

model <- simple_RF(Species ~ ., iris)

The main **dmapply** in the above code snippet simply broadcasts all the objects passed to the function to the workers and calls *randomForest* with the same parameters. An important point here is that even if ‘data’ is a distributed object, it will still be broadcast because it is listed in ** MoreArgs, **which accepts a key-value list of either distributed objects or normal R objects.

Here is a sample performance plot of running randomforest:

We tested the **randomForest.ddR** package on a medium sized dataset to measure speedup when increasing the number of cores. From the graph, it is clear that up until 4 cores, there is great improvement and only then does it start to reach the point of diminishing returns. Since most computers these days have several cores, the **randomForest.ddR** package should be helpful for most people. On a single node you can use ** parallel** which stops at 24 cores which corresponds to all the cores of the test machine. You can use

To read up a bit more on ddR and its semantics, visit our GitHub page here or read the user guide on CRAN.

by Andrie de Vries

Recently we had a question on the public mailing list for Revolution R Open (RRO), on the topic of "MKL multithreaded library and mclapply do not play well together".

If you're not familiar with these topics, here is a quick primer:

- The Intel MKL is a fast, multi-threaded math library. We bundle the MKL with RRO.
- The primary benefit of the MKL is that matrix algebra operations are much faster than using the math library that is bundled with R, e.g. more than 40x faster for matrix multiply.
- The function mclapply() in the parallel package is similar to lapply() but runs in parallel on operating systems that support forking (e.g. Linux, but not Windows).

Now, the question was posed as follows:

*After some testing, I have discovered that using mclapply on multiple cores with MKLthreads set to greater than 1 results in the threads sleeping and basically never finishing. Obviously, the temporary solution is to set MKLthreads to 1. But it would be nice if these functions worked together, because you cannot always guarantee that a package in R will not use mclapply while calling a MKL threaded math function, and there are situations where I would like to just use MKLthreads > 1 and not worry about it.*

Unpacking the question:

- The user is correctly using mclapply()
- He also knows how to control the number of threads used by the MKL, i.e. specifying setMklthreads() to the desired number
- The problem only occurs when setMklthreads() specifies more than 1 thread, e.g. setMklthreads(4).

To answer the question, I am going to refer to two some information about the MKL benchmarks at MRAN, as well as a vignette of the doParallel package.

To illustrate this, take a look at some of the performance characteristics we publish at MRAN:

From this plot you can see:

- A big performance boost when using the MKL with just one thread
- A marginal increase when using 4 threads, most notable in matrix multiply, and no benefit for singular value decomposition

**Implication: if you want to only set a single value for the number of MKL threads, and never worry about code that does not run, use setMklthreads(1).**

When you attempt to do parallel programming in R, you must be aware of the potential problems and pitfalls. These pitfalls extend to much more than this example of using the MKL.

The vignette of the doParallel package makes this explicit warning in paragraph 2, "A word of caution":

*Because the parallel package in multicore mode starts its workers using fork without doing a **subsequent exec, it has some limitations. Some operations cannot be performed properly by forked **processes. For example, connection objects very likely won’t work. In some cases, this could cause **an object to become corrupted, and the R session to crash.*

**Implication: Unfortunately there are no silver bullets in parallel programming. Take care when setting up your code, in particular if you make use of parallel paradigms that include forking, e.g. mclapply().**

I reproduce the code used in the original question below. Notice that the last snippet will cause R to become unresponsive. To avoid this, use setMklthreads(1).

By Andrie de Vries

Note by the editor after publication:

In the original post we neglected to give a shout out to Steve Weston, who continues to be the prime driver of new functionality for foreach, iterators and their backends.

The new progress bar functionality as described in this post is all the work of Steve Weston (StackOverflow profile).

Earlier this month Rich Calaway, programme manager at Microsoft and maintainer of the foreach package, published some updates to the foreach suite of packages, including:

Most of the changes were cosmetic, or to conform to CRAN policy. However, the last two packages (doParallel and doSNOW) had some functional changes.

The doSNOW package is a foreach parallel adaptor for the 'snow' Package. Thus it provides a parallel backend for the %dopar% function using Luke Tierney's snow package. (The snow package itself enables a "Simple Network of Workstations", i.e. support for simple parallel computing in R.)

The functional changes to doSNOW were the **addition of support for user-defined progress bars**.

This means that you easily enable progress bars when setting up a parallel job with doSNOW.

You can try it out with this code:

You can get other examples here.

The doParallel package provides a parallel backend for the %dopar% function using the parallel package (part of base R).

In doParallel, the most important change (change log) was a bug fix to stopImplicitCluster functionality, courtesy of Dan Tenenbaum.

We have previously written about the foreach package and parallel processing:

- Tutorial: Parallel programming with foreach
- Creating progress bars with foreach parallel processing
- Monitoring progress of a foreach parallel job

To get started with foreach, take a look at the vignette at https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf

To get started with parallel programming with foreach and doParallel, the vignette is a great a resource: https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

On Monday, we compared the performance of several different ways of calculating a distance matrix in R. Now there's another method to add to the list: using GPU acceleration in R.

A GPU is a dedicated, high-performance chip available on many computers today. Unlike the CPU, it's not used for general computations, but rather for specialized tasks that benefit from a massively multi-threaded architecture. Video-game graphics is the usual target for GPUs, but in recent years they've been used for certain high-performance computing tasks as well. The problem is that GPUs require specialized programming, and because they have limited access to RAM, they're generally not well suited to tasks that require a lot of data throughput. But for simulations and other tasks that require a lot of computing on limited data, they can offer huge performance benefits.

The rpud package for R implements a few algorithms in R that will use a CUDA-compatible NVIDIA GPU for the computations. The algorithms include support vector machines, bayesian classification, and hierarchical linear models. On the NVIDIA Cuda Zone blog, Gord Sissons tested the rpud package for hierarchcal clustering, which involves calculating a distance matrix. Here's a comparison of the perfomance using regular R functions (blue) and with GPU-accelerated functions (orange):

Note the Y axis is on a log-10 scale: in most cases the GPU-based functions ran 10x faster than the standard CPU-based functions.

GPU programming doesn't help with everything, but if your problem happens to be one that has a GPU-based implementation, and you have the appropriate GPU hardware, the results can be dramatic. Check the link below for details of the tests, and how you can spin up a cloud-based GPU server to run them on.

Parallel Forall: GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service

When it comes to speeding up "embarassingly parallel" computations (like for loops with many iterations), the R language offers a number of options:

- An R looping operator, like mapply (which runs in a single thread)
- A parallelized version of a looping operator, like mcmapply (which can use multiple cores)
- Explicit parallelization, via the parallel package or the ParallelR suite (which can use multiple cores, or distribute the problem across nodes in a cluster)
- Translating the loop to C++ using Rcpp (which runs as compiled and optimized machine code)

Data scientist Tony Fischetti tried all of these methods and more attempting to find the distance between every pair of airports (a problem that grows polynomially in time as the number of airports increases, but which is embarassingly parallel). Here's a chart comparing the time taken via various methods as the number of airports grows:

The clear winner is Rcpp — the orange line at the bottom of the chart. The line *looks* like it's flat, but while it the time does increase as the problem gets larger, it's much *much* faster than all the other methods tested. Ironically, Rcpp doesn't use any parallelization at all and so doesn't benefit from the quad-processor system used for testing, but again: it's just that much faster.

Check out the blog post linked before for a detailed comparison of the methods used, and some good advice for using Rcpp effectively (pro-tip: code the whole loop, not just the body, with Rcpp).

On the lambda: Lessons learned in high-performance R

*by Andrew Ekstrom**Recovering physicist, applied mathematician and graduate student in applied Stats and systems engineering*

We know that R is a great system for performing statistical analysis. The price is quite nice too ;-) . As a graduate student, I need a cheap replacement for Matlab and/or Maple. Well, R can do that too. I’m running a large program that benefits from parallel processing. RRO 8.0.2 with the MKL works exceedingly well.

For a project I am working on, I need to generate a really large matrix (10,000x10,000) and raise it to really high powers (like 10^17). This is part of my effort to model chemical kinetics reactions, specifically polymers. I’m using a Markov Matrix of 5,000x5,000 and now 10,000x10,000 to simulate polymer chain growth at femptosecond timescales.

At the beginning of this winter semester, I used Maple 18 originally. I was running my program on a Windows 7 Pro computer using an intel I7 – 3700K (3.5GHz) quad core processor with 32GB of DDR3 ram. My full program took, well, WWWWWAAAAAAAYYYYYYYY TTTTTTTTOOOOOOOO LLLLLOOOONNNNGGGGGG!!!!!!!!

After a week, my computer would still be running. I also noticed that my computer would use 12% -13% of the processor power. With that in mind, I went to the local computer parts superstore and consulted with the sales staff. I ended up getting a “Gamer” rig when I purchased a new AMD FX9590 processor (4.7GHz on 8 cores) and dropped it into a new mobo. This new computer ran the same Maple program with slightly better results. It took 4-5 days to complete... assuming no one else used the computer and turned it off.

After searching for a better method (meaning better software) for running my program, I decided to try R. After looking around for a few hours, I was able to rewrite my program using R. YEAH! Using the basic R (version 3.1.2), my new program only took a few days (2-3). A nice feature of R is an improved BLAS and LAPACK and their implementation in R over Maple 18. Even though R 3.1.2 is faster than Maple 18, R only used 12%-13% of my processor.

Why do I keep bringing up the 12%-13% CPU usage? Well, it means that on my 8 core processor, only 1 core is doing all the work. (1/8 = 0.125) Imagine you go out and buy a new car. This car has a big V8 engine but, only 1 cylinder runs at a time. Even though you have 7 other cylinders in the car, they are NOT used. If that was your car, you would be furious. For a computer program, this is standard protocol. A cure for this type of silliness is to use parallel programming.

Unfortunately, I AM NOT A PROGRAMMER! I make things happen with a minimal amount of typing. I’m very likely to use “default settings” because I’m likely to mistype something and spend an hour trying to figure out, “Is that a colon or a semi colon?” So when I looked around at other websites discussing how to compile and/or install different blas and lapack for R, I started thinking, “I wish I was taking QED right now. (QED = Quantum Electro-Dynamics)” I also use Windows, most of the websites I saw discussed doing this in Linux.

That led me to Revolution Analytics RRO. I installed RRO version 8.0.2 and the MKL available from here: http://mran.revolutionanalytics.com/download/#download

RRO uses Intel’s Math Kernel Library, which is updated and upgraded to run certain types of calculations in parallel. Yes, parallel processing in Windows, which is step one of HPC (High Performance Computing) and something many of my comp sci friends and faculty said was difficult to do.

A big part of my project is raising a matrix to a power. This is a highly parallelizable process. By that I mean, calculating element A(n,n) in the new matrix does not depend upon the value of A(x,x) in the new matrix. They only care about what is in the old matrix. Using the old style (series) computing, you calculate A(1,1), then A(1,2), A(1,3) … A(n,n). With parallel programming, on my 8 core AMD processor, I can calculate A(1,1), A(1,2), A(1,3) … A(1,8) at the same time. If these calculations were “perfectly parallel” I would get my results 8 times faster. For those of us that have read other blog posts on RevolutionAnalytics.com, you know that the speed boost for parallel programming is great, but not perfect. (Almost like it follows the laws of thermodynamics.) By using RRO, I was able to run my program in R and get results for all of my calculations in 6-8 hours. That got me thinking.

If parallel processing on 8 cores instead of series processing on 1 core is a major step up, can I boost the parallel processing possibility? Yes. GPU processors like the Tesla and FirePro are nice and all but:

1) Using them with R requires programming and using Linux. Two things I don’t have time to do.

2) Entry level Tesla and Good Firepro GPUs cost a lot of money. Something I don’t have a lot of right now.

The other option is using an Intel Phi coprocessor, or two. Fortunately, when I started looking, I could pick up a Phi coprocessor for cheap. Like $155 cheap for a brand new coprocessor from an authorized retailer. The video card in my computer cost more than my 2 Phi’s. The big issue, is getting a motherboard that has the ability to handle the Phi’s. Phi coprocessors have 6+GB of ram. Most mobo’s can’t handle more than 4GB of ram through a PCI-E 3.0 slot. So, I bought a second mobo as a “hobby” project computer. This new mobo is intended for “workstations” and has 4 PCI-E 3.0 slots. That gives me enough room for a good video card and 2 Phi’s. This new Workstation PC has an Intel Xeon E5-2620V3 (2.4GHz 6-core, 12-Thread) processor, 2 Intel Xeon Phi coprocessors 31S1P (57 cores with 4 threads per core at 1.1GHz per thread for a total of 456threads) and 48Gb DDR4 Ram.

The Intel Phi coprocessors work well with the Intel MKL. The same MKL RRO uses. Which means, if I use RRO with my Phi’s, after they are properly set up, I should be good to go….. Intel doesn’t make this easy. (I cobbled together the information from 6-7 different sources. Each source had a small piece of the puzzle.) The Phi’s are definitely not “Plug and Play”. I used MPSS version 3.4 for Windows 7. I downloaded the drivers from here:

https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#wn34rel

I had to go into the command prompt and follow some of the directions available here. (Helpful hint, use micinfo to check your Phi coprocessors after step 9 in section 2.2.3 “Updating the Flash”.)

http://registrationcenter.intel.com/irc_nas/6252/readme-windows.pdf

After many emails to Revolution Analytics staff, I was able to get the Phi’s up and running! Now, my Phi’s work harmoniously with MKL. Most of the information I needed is available here. https://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf

In the paper and website above, I needed to create some environmental variables. The generic ones are:

MKL_MIC_ENABLE=1

OFFLOAD_DEVICES=*<list>*

MKL_MIC_MAX_MEMORY=2GB

MIC_ENV_PREFIX=MIC

MIC_OMP_NUM_THREADS=###

MIC_KMP_AFFINITY=balanced

Since I have 2 Phi coprocessors, my <list> is 0, 1.(At least this is the list that worked.) I set MKL_MIC_MAX_MEMORY to 8GB. ( I have the ram to do it, so why not.) MIC_OMP_NUM_THREADS = 456.

Below, is a sample program I used to benchmark Maple 2015, R and RRO on my Gamer computer and my Workstation. Between the time I started this project and now, Maple up graded their program to Maple 2015. The big breakthrough is that Maple now does parallel processing. So, I ran the program below using Maple 2015 to see how it compares to R and RRO. (I uninstalled Maple 18 in anger.) I also ran the same program on my Workstation PC to see how well the Phi coprocessors worked. Once I had everything enabled, I didn’t want to disable anything. So, I just have the one, VERY IMPRESSIVE, time for my workstation.

require("expm")

options(digits=22)

a=10000

b=0.000000001

c=matrix(0,a,a)

for ( i in 1:a){c[i,i] = 1-1.75*b}

for ( i in 1:a){c[i-1,i] = b}

for ( i in 2:a){c[i,i-1] = 0.75*b}

c[1,1]=1-b

c[a,a]=1

c[a,a-1]=0

system.time(e=c%^%100)

By using RRO instead of R, I got my results 3.12 hours faster. Considering the fact that I have several dozen more calcs like this one, saving 3hrs per calc is wonderful ;-) By using RRO instead of Maple 2015, I saved about 41 mins. By using RRO with the Phi’s on my Workstation PC, I was done in 187.3s. I saved an additional 39 mins over my Gamer Computer! When I ran my full program, it took under an hour. Compared to the days/weeks for my smaller calculations, an hour is awesome!

An interesting note on the InteL MKL. It only uses cores, not threads, on the main processor. I’m not sure how it handles the threads on the Phi coprocessors. So, my Intel Xeon processor only had 50% usage of the main processor.

Now, your big question is, “Why should I care?” I ran a 10,000x10,000 matrix and raised it to unbelievably high values. I used a brute force method to do it. Suppose that you are doing “Big Data” analysis and you have 30 columns by 2,000,000 rows. If you run a linear regression on that data, your software will use a Pseudoinverse to calculate the coefficients of your regression. A part of the pseudoinverse involves multiplying your 30x2,000,000 matrix by a 2,000,000x30 matrix and it’s all parallelizable! Squaring my matrix uses about 1.00x10^{12} operations (assuming I have my Big O calculation correct.) The pseudo inverse of your matrix uses a mere 1.80x10^{9} operations.

Some of my friends who do these sort of “Big Data” calculations using the series method built into basic R or SAS tell me that they take hours(1-2) to complete. With my workstation, I have the computational power of 17 servers that use my same Xeon processor. That calculation would take me way less than a minute.

Behold, the power of parallel processing!

Bay Area engineer Vineet Abraham recently ran some benchmarks for Revolution R Open (RRO) running on Mac OS X and on Ubuntu. Thanks to the multi-threaded processing capabilites of RRO, several operations ran much faster than R downloaded from CRAN, without having to change any code:

For the most part, RRO performs significantly faster than standard R both locally and on the server. RRO performs really well on the matrix operations as seen in column group mm (over 90% faster than standard R); this is probably due to the addition of the Intel Math Kernel library.

(In fact, while the Intel MKL is used on Ubunti, on OS X the standard Accelerate Framework provides the multi-threading capability, with similar results.) As Vineet's benchmarks show, RRO doesn't improve things for every benchmark, but with some mathematically-intensive operations the difference can be dramatically.

On a related note, I've been doing some benchmarks on RRO 8.0.3 (based on R 3.1.3), due to be released very soon. On my 2-core Surface Pro (yes, it runs fine on a Surface), using the multi-threading reduced the computation for the Urbanek benchmarks from 32 seconds to 8 seconds.

Numbr Crunch: Benchmarking R/RRO is OSX and Ubuntu on the cloud

by Joseph Rickert

I have written a several posts about the Parallel External Memory Algorithms (PEMAs) in Revolution Analytics’ RevoScaleR package, most recently about rxBTrees(), but I haven’t said much about rxExec(). rxExec() is not itself a PEMA, but it can be used to write parallel algorithms. Pre-built PEMAs such as rxBTrees(), rxLinMod(), etc are inherently parallel algorithms designed for distributed computing on various kinds of clusters: HPC Server, Platform LSF and Hadoop for example. rxExec()’s job, however, is to help ordinary, non-parallel functions run in parallel computing or distributed computing environments.

To get a handle on this, I think the best place to start is with R’s foreach() function which enables an R programmer to write “coarse grain”, parallel code. To be concrete, suppose we want to fit a logistic regression model to two different data sets. And to speed things up, we would like to do this in parallel. Since my laptop has two multi-threaded cores, this a straight-forward use case to prototype. The following code points to two of the multiple csv files that comprise the mortgageDefault data set available at Revolution Analytics’ data set download site.

#---------------------------------------------------------- # load needed libraries #---------------------------------------------------------- library(foreach) #---------------------------------------------------------- # Point to the Data #---------------------------------------------------------- dataDir <- "C:\\DATA\\Mortgage Data\\mortDefault" fileName1 <- "mortDefault2000.csv" path1 <- file.path(dataDir,fileName1) fileName2 <- "mortDefault2001.csv" path2 <- file.path(dataDir,fileName2) #---------------------------------------------------------- # Look at the first data file system.time(data1 <- read.csv(path1)) #user system elapsed #2.52 0.02 2.55 dim(data1) head(data1,3) #creditScore houseAge yearsEmploy ccDebt year default #1 615 10 5 2818 2000 0 #2 780 34 5 3575 2000 0 #3 735 12 1 3184 2000 0

Note that it takes almost 3 seconds to read one of these files into a data frame.

The following function will read construct the name and path of a data set from parameters supplied to it, reads the data into a data frame and then uses R’s glm() function to fit a logistic regression model.

#----------------------------------------------------------- # Function to read data and fit a logistic regression #----------------------------------------------------------- glmEx <- function(directory,fileStem,fileNum,formula){ fileName <- paste(fileStem,fileNum,".csv",sep="") path <- file.path(directory,fileName) data <- read.csv(path) model <- glm(formula=formula,data=data,family=binomial(link="logit")) return(summary(model))} form <- formula(default ~ creditScore + houseAge + yearsEmploy + ccDebt)

Something like this might be reasonable if you had a whole bunch of data sets in a directory. To process the two data sets in parallel we set up and internal cluster with 2 workers, register the parallel backend and run foreach() with the %dopar% operator.

#---------------------------------------------------------- # Coarse grain parallelism with foreach #---------------------------------------------------------- cl <- makePSOCKcluster(2) # Create copies of R running in parallel and communicating over sockets. # My laptop has 2 multi threaded cores registerDoParallel(cl) #register parallel backend system.time(res <- foreach(num = c(2000,2001)) %dopar% glmEx(directory=dataDir,fileStem="mortDefault",fileNum=num,formula=form)) #user system elapsed #5.34 1.99 43.54

stopCluster(cl)

The basic idea is that my two-core PC processes the two data sets in parallel. The whole thing runs pretty quickly: two logit models are fit on a million rows each in about 44 seconds.

Now, the same process can be accomplished with rxExec() as follows:

#----------------------------------------------------------- # Coarse grain parallelism with rxExec #----------------------------------------------------------- rxOptions(numCoresToUse=2) rxSetComputeContext("localpar") # use the local parallel compute context rxGetComputeContext() argList2 <- list(list(fileNum=2000),list(fileNum=2001)) system.time(res <- rxExec(glmEx,directory=dataDir,fileStem="mortDefault",formula=form,elemArgs=argList2))

#user system elapsed #4.85 2.01 45.54

First notice that rxExec() took about the same amount of time to run. This is not surprising since, under the hood, rxExec() looks a lot like foreach() (while providing additional functionality). Indeed, the same Revolution Analytics team worked on both functions.

You can also see that rxExec() looks a bit like an apply() family function in that it takes a function, in this case my sample function glmEx(), as one of its arguments. The elemArgs parameter takes a list of arguments that will be different for constructing the two file names, while the other arguments separated by commas in the call statement are parameters that are the same for both. With this tidy syntax we could direct the function to fit models that are located in very different locations and also set different parameters for each glm() call.

The really big difference between foreach() and rxExec(), however, is the line

rxSetComputeContext("localpar")

which sets the compute context. This is the mechanism that links rxExec() and pre-built PEMA’s to RevoScaleR’s underlying distributed computing architecture. Changing the the compute context allows you to run the R function in the rxExec() call on a cluster. For example, in the simplest case where you can log into an edge node on a Hadoop cluster, the following code would enable rxExec() to run the glmEx() function on each node of the cluster.

myHadoopContext <- RxHadoopMR()

rxSetComputeContext(myHadoopContext)

In a more complicated scenario, for example where you are remotely connecting to the cluster, it will be necessary to include your credentials and some other parameters in the statement that specifies the compute context.

Finally, we can ratchet things up to a higher level of performance by using a PEMA in the rxExec() call. This would make sense in a scenario where you want to fit a different model one each node of a cluster while making sure that you are getting the maximum amount of parallel computation from all of the cores on each node. The following new version of the custom glm function uses the RevoScaleR PEMA rxLogit() to fit the logistic regressions:

---------------------------------------------------------- # Finer parallelism with rxLogit #---------------------------------------------------------- glmExRx <- function(directory,fileStem,fileNum,formula){ fileName <- paste(fileStem,fileNum,".csv",sep="") path <- file.path(directory,fileName) data <- read.csv(path) model <- rxLogit(formula=formula,data=data) return(summary(model))} argList2 <- list(list(fileNum=2000),list(fileNum=2001)) system.time(res <- rxExec(glmExRx,directory=dataDir,fileStem="mortDefault",formula=form,elemArgs=argList2))

#user system elapsed #0.01 0.00 8.33

Here, still running just locally on my laptop, we see quite an improvement in performance. The computation runs in about 8.3 seconds. (Remember that over two seconds of this elapsed time is devoted to reading the data.). Some of this performance improvement comes from additional, “finer grain” parallelism of the rxLogit() function. Most of the speedup, however, is likely due to careful handling of the underlying matrix computations.

In summary, rxExec() can be thought of as an extension of foreach() that is capable of leveraging all kinds of R functions in distributed computing environments.