If you'd like to manipulate and analyze very large data sets with the R language, one option is to use R and Apache Spark together. R provides the simple, data-oriented language for specifying transformations and models; Spark provides the storage and computation engine to handle data much larger than R alone can handle.
At the KDD 2016 conference last October, a team from Microsoft presented a tutorial on Scalable R on Spark, and made all of the materials available on Github. The materials include an 80-slide presentation covering several tutorials (you can download the 13Mb PowerPoint file here).
Slides 1-29 form an introduction which covers:
- Scaling R scripts on a single machine with the bigmemory and ff packages
- Interfacing to Spark from R with SparkR 1.6
- Installing and using the sparklyr package
- Using Microsoft R Server and the "RevoScaleR" package to offload its computations to Spark
- Comparisons and benchmarks of the techniques to scale R described above
Slides 32-44 form a hands-on tutorial working with the airline arrival data to predict flight delays. In the tutorial, you use SparkR to clean and join the data, R Server's "rxDTree" function to fit a random forest model to predict delays, and then publish a prediction function to Azure with the AzureML package to create a cloud-based flight-delay prediction service. The Microsoft R scripts are available here.
Slides 46-50 form another tutorial, this time working with the NYC Taxi dataset. The first tutorial script uses the sparklyr package to visualize the data and create models to predict the tip amount. This second tutorial script goes further with models, fitting Elastic Net, Random Forest and Gradient Boosted Tree models with both SparkR and sparklyr. In addition this script uses SparkR and SparkSQL to create a map of the trips.
Slides 51-59 demonstrate optimizing the performance of a time series forecasting model, by searching over a large parameter space with the hts package. By running the models in parallel to optimize the MAPE (mean absolute percent error), the total execution time was reduced to 1 day compared to the 40 days to complete the computations serially. The parallelization was achieved with the Microsoft R Server "rxExec" function, which you can replicate with the script available here.
Slides 60-72 includes background on calculating learning curves for various predictive models (see here and here for more information). There's also a tutorial: the R scripts are available here.
To work though the materials from the tutorial, you'll need access to a Spark cluster configured with Microsoft R Server and the necessary scripts and data files. You can easily create an HDInsight Premium cluster including Microsoft R Server on Microsoft Azure: these instructions provide the details. Once the cluster is ready, you can remotely access it from your desktop using ssh
as described. The clusters are charged by the hour (according to their size and power), so be sure to shut down the cluster when you're done with the tutorial.
These tutorials are hopefully useful to anyone who is trying to learn to use R with Spark. The full collection of materials and slides is available at the Github repository below.
Github (Azure): KDD 2016 tutorial: Scalable Data Science with R, from Single Nodes to Spark Clusters
why does it list that sparkr/sparklyr doesn't connect to hadoop? it most certainly does.
Posted by: vincent | October 14, 2016 at 23:21
Goodmorning,
I am currently using R server on HDinsight in Azure.
I ask you for some help since I can not install sparklyr. When executing devtools::install_github("rstudio/sparklyr") I get the following error message:
ERROR: dependencies ‘tibble’, ‘rprojroot’ are not available for package ‘sparklyr’
* removing ‘/home/etignone/R/x86_64-pc-linux-gnu-library/3.2/sparklyr’
Unfortunately I can not install ‘tibble’ and ‘rprojroot’. Do you have any suggestion?
Thank you very much,
Edoardo Tignone
Posted by: Edoardo Tignone | October 19, 2016 at 04:22
Hi Edoardo,
Did you try the installation R-script that we published for this purpose? Sparklyr does have some dependencies (including the ones you mention), which we address in our script. The link to the script, which is to be run on the edge node of a HDInsight (Premium) cluster with Microsoft R server, is given below.
Thank you.
Debraj GuhaThakurta
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/KDDCup2016/Scripts/RunningScriptActions/github_installs.R
Posted by: Debraj GuhaThakurta | October 19, 2016 at 08:51
Vincent;
The "Hadoop" column on slide 29 would have been better labelled "Hadoop MapReduce"; it does not mean the whole Hadoop ecosystem. SparkR and sparklyr run on Hadoop clusters via Spark. RevoScaleR supports both Spark and the earlier Hadoop MapReduce.
Posted by: Robert Horton | October 19, 2016 at 13:10