by Joseph Rickert
In the last week of January, HP Labs in Palo Alto hosted a workshop on distributed computing in R that was organized by Indrajit Roy (Principal Researcher, HP) and Michael Lawrence (Genentech and R-core member). The goal was to bring together a small group of R developers with significant experience in parallel and distributed computing to discuss the feasibility of designing a standardized R API for distributed computing. The workshop (which would have been noteworthy for no other reason than five R-core members attended) took the necessary, level-setting, first step towards progress by getting this influential group to review most of the R distributed computing schemes that are currently in play.
The technical presentations which were organized in three sections: (1) Experience with MPI like backend, (2) Beyond embarrassingly parallel computations and (3) Embrace disks, thread parallelism, and more; spanned an impressive range of topics from a way to speed up computations with parallel C++ code on garden-variety hardware to trying to write parallel R code that will run on different supercomputer architectures.
Luke Tierney, who has probably been doing R parallel computing longer than anyone, led off the technical presentations by providing some historical context and describing a number of the trade-offs between “Explicit” parallelization (like snow and parallel) and the “Implicit” parallelization of vectorized operations (like pnmath). The last slide in Luke’s presentation is a short list of what might be useful additions to R.
The next three presentations in the first section described the use of MPI and snow like parallelism in high performance computing applications. Martin Morgan’s presentation explains and critiques the use of the BioConductor BiocParallel package in processing high throughput genomic data. Junji Nakano and George Ostrouchow both describe supercomputer applications. Junji’s talk shows examples of using the functions in the Rhpc package and presents some NUMA distance measurements. George's talk, an overview of high performance computing with R and the pbdR package, also provides a historical context before working through some examples. The following slide depicts his super condensed take on parallel computing.
All of these first session talks deal with some pretty low level details of parallel computing. Dirk Eddelbuettel’s talk from the third session continues in this vein, discussing thread management issues in the context of C++ and Rcpp. During the course of his talk Dirk compares thread management using “raw” threads, OpenMP and Intel’s TBB library.
The remaining talks delivered at the workshop dealt mostly with different approaches to implementing parallel external memory algorithms. These algorithms apply to applications where datasets are too large to be processed in memory all at one time. Parallelism comes into play either to improve performance or because the algorithms must be implemented across distributed systems such as Hadoop.
Michael Kane’s talk provides an overview of bigmemory, one of the first “chunk” computing schemes to be implemented in R. He describes the role played by mmap, shows how later packages have built on it and offers some insight into its design and features that continue to make it useful.
RHIPE was probably the first practical system for using R with Hadoop. Saptarshi Guha’s talk presents several programming paradigms popular with RHIPE users. Ryan Hafen’s talk describes how MapReduce supports the divide and recombine paradigm underlying the Tessera project and shows how all of this builds on the RHIPE infrastructure. He outlines the architecture of distributed data objects and distributed data frames and provides examples of functions from the datadr package.
Mario Inchiosa’s presentation describes RevoPemaR, an experimental R package for writing parallel external memory algorithms, based on R’s system of Reference Classes. Class methods initialize(), processData(), updateResults() and processResults() provide a standardized template for developing external memory algorithms. Although RevoPemaR has been released under the Apache 2.0 license it is currently only available with Revolution R Enterprise and depends on the RevoScaleR package for chunk processing and distributed computing.
Indrajit Roy’s presentation describes how the distributed data structures in the distributedR package extend existing R data structures and enables R programmers to write code that is capable of running on any number of underlying parallel and distributed hardware platforms. They can do this without having to worry about the location of the data or performing task and data scheduling operations.
Simon Urbanek’s presentation covers two relatively new projects that grew out of a need to work with very large data sets: iotools, a collection of highly efficient, chunk-wise functions for processing I/O streams and ROctopus a way to use R containers as in-memory compute elements. Iotools achieves parallelism through a split, compute and combine strategy with all jobs going through at least these three stages. One notable advantage is that iotools functions are very efficient using native R syntax. Here is the sample code for aggregating point locations by ZIP code:
r <- read.table(open(hmr( hinput("/data/2014/08"), function(x) table(zcta2010.db()[ inside(zcta2010.shp(), x[,4], x[,5]), 1]), function(x) ctapply(as.numeric(x), names(x), sum))))
ROctopus is still at the experimental stage. When completed it will allow running sophisticated algorithms such as GLMs and LASSO on R objects containing massive data sets with no movement of data and no need for data conversion after loading.
Simon finishes up with lessons learned from both projects which should prove to be influential in guiding future work.
Michael Sannella takes a little different approach than the others in his presentation. After an exceptionally quick introduction to Hadoop and Spark he examines some of the features of the SparkR interface with regard to the impact they would have on any distributed R system and makes some concrete suggestions. The issues Michael identifies include:
- Hiding vs exposing distributed operations
- Sending auxiliary data to workers
- Loading packages on workers
- Developing/testing code on distributed R processes
The various presentations make it clear that there are at least three kinds of parallel / distributed computing problems that need to be addressed for R:
- Massive, high-performance computing that may involve little or no data
- Single machine or small cluster parallelism where the major problem is that the data are too large to fit into the memory of a single machine. Here, parallel code is a kind of bonus. As long as the data is going to be broken up into chunks it is natural to think about processing the chunks in parallel. Chunk-wise computing, however, can work perfectly well without parallelism on relatively small data.
- Distributed computing on massive data sets that are distributed across a particular underlying architecture, such as Hadoop. Here parallelism is inherent in the structure of the platform, and the kinds of parallel computing that can be done may be constrained by the architecture.
The next step in this ambitious project is to undertake the difficult work of evaluation, selection and synthesis. It is hoped that, at the very least, the work will lead to a consensus on the requirements for distributed computing. The working group expects to produce a white paper in six months or so that will be suitable for circulation. Stay tuned, and please feel free to provide constructive feedback.
Comments
You can follow this conversation by subscribing to the comment feed for this post.