The RHadoop packages make it easy to connect R to Hadoop data (rhdfs), and write map-reduce operations in the R language (rmr2) to process that data using the power of the nodes in a Hadoop cluster. But getting the Hadoop cluster configured, with R and all the necessary packages installed on each node, hasn't always been so easy.
But now with HDInsight, Microsoft's Apache Hadoop-in-the-cloud service, it's much easier. As you configure your Hadoop cluster, you now have the option of installing R and RHadoop as part of the setup process. It's simply of matter of setting an option to run a pre-prepared script on the cluster nodes, and complete instructions are provided for Linux-based and Windows-based Hadoop clusters.
With the cluster thus configured, you can then use simple R commands to create data in HDFS, and use the mapreduce function from rmr2 to peform calculations on data using any R function, as shown in the toy example below:
The script also installs a collection of R packages that will be useful for your mapreduce calls: rJava, Rcpp, RJSONIO, bitops, digest, functional, reshape2, stringr, plyr, caTools, and stringdist. And of course you can modify the setup script to install any other packages or tools you need on the nodes.
HDInsights is available with your Microsoft Azure subscription, or you can try HDInsights for free with a free one-month trial of Azure. If you're new to HDInsight, you might also want to check out these tutorials on getting started with Linux and Windows Hadoop clusters.
Microsoft HDInsight: Install and use R on Linux and Windows HDInsight Hadoop clusters
The r-installer-v01.sh script linked to in the article is incorrectly pointing to RStudio's cran instead of Revolution Analytics on line 13 of the script.
Posted by: Jeremy Jackson | September 25, 2015 at 13:08