by Joseph Rickert
One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.
A master node controls multiple worker nodes each of which runs multiple R processes in parallel.
As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.
To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:
The distributed R package contains:
- functions to set up the infrastructure for the distributed platform
- distributed data structures that are the analogues of R’s data frames, arrays and lists, and
- the functions foreach() and splits() to let users write their own parallel algorithms.
A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.
vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.
The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources
The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and split-sample validation.
The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.
The HPdcluster package contains a parallel, distributed kmeans algorithm.
The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.
The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.
library(HPdclassifier) # loading the library
Loading required package: distributedR
Loading required package: Rcpp
Loading required package: RInside
Loading required package: randomForest
distributedR_start() # starting the distributed environment
Workers registered - 1/1.
All 1 workers are registered.
[1] TRUE
nparts <- sum(ds$Inst) # number of available distributed instances
# Describe the data
nSamples <- 100 # number of samples
nAttributes <- 5 # number of attributes of each sample
nSpllits <- 1 # number of splits in each darray
# Create the distributed arrays
dax <- darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))
day <- darray(c(nSamples,1), c(round(nSamples/nSpllits),1))
# Load the distributed arrays
foreach(i, 1:npartitions(dax),
function(x=splits(dax,i),y=splits(day,i),id=i){
x <- matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))
y <- matrix(runif(nrow(y)), nrow(y), 1)
update(x)
update(y)
})
# Fit the Random Forest Model
myrf <- hpdrandomForest(dax, day, nExecutor=nparts)
# prediction
dp <- predictHPdRF(myrf, dax)
Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.
Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.
Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolution-r-enterprise-distributedr.
How does this package relate/compare to packages such as "RHIPE" or "RHadoop"? Does distributedR operate in a fundamentally different way?
Thanks
Posted by: Dan | October 24, 2014 at 13:03