by Joseph Rickert
Apache Spark, the open-source, cluster computing framework originally developed in the AMPLab at UC Berkeley and now championed by Databricks is rapidly moving from the bleeding edge of data science to the mainstream. Interest in Spark, demand for training and overall hype is on a trajectory to match the frenzy surrounding Hadoop in recent years. Next month's Strata + Hadoop World conference, for example, will offer three serious Spark training sessions: Apache Spark Advanced Training, SparkCamp and Spark developer certification with additional spark related talks on the schedule. It is only a matter of time before Spark becomes a big deal in the R world as well.
If you don't know much about Spark but want to learn more, a good place to start is the video of Reza Zadeh's keynote talk at the ACM Data Science Camp held last October at eBay in San Jose that has been recently posted.
Reza is a gifted speaker, an expert on the subject matter and adept at selecting and articulating the key points that can carry an audience towards comprehension. Reza starts slowly, beginning with the block diagram of the Spark architecture and spends some time emphasizing RDDs, Resilient Distributed Data Sets as the key feature that enables Spark's impressive performance and defines and circumscribes its capabilities.
After the preliminaries, Reza takes the audience on a deep dive of three algorithms in Spark's machine learning library MLib, gradient descent logistic regression, page rank and singular value decomposition and moves on to discuss some of the new features in Spark release 1.2.0 including All Pairs Similarity.
Reza's discussion of Spark's SVD implementation is a gem of a tutorial on computational linear algebra. The SVD algorithm considers two cases, the "Tall and Skinny" situation where there are less than about a 1,000 columns and the "roughly square" case where the number of rows and columns are about the same. I found it comforting to learn that the code for this latter case is based on highly reliable and "immensely optimized" Fortran77 code. (Some computational problems get solved and stay solved.)
Reza's discussion of the All Pairs Similarity, based on the DIMSUM (Dimension Independent Matrix Square Using MapReduce) algorithm and a non-intuitive sampling procedure where frequently occurring pairs are sampled less often, is also illuminating.
To get some hands-on experience with Spark your next steps might be to watch the three hour, Databricks video: Intro to Apache Spark Training - Part 1.
From here, the next obvious question is: "How do I use Spark with R?" Spark itself is written in Scala and has bindings for Java, Python and R. Searching for a Spark demo online, however, will most likely turn up either a Scala or Python example. sparkR, the open source project to produce an R binding, is not as far along as the other languages. Indeed, a Cloudera web page refers to SparkR as "promising work". The SparkR GitHub page shows it to be a moderately active project with 410 commits to date from 15 contributors.
In SparkR Enabling Interactive Data Science at Scale, Zongheng Yang (only a 3rd year Berkeley undergraduate when he delivered this talk last July) lucidly works through a word count demo and a live presentation using sparkR with RStudio and a number of R packages and functions. Here is the code for his word count example.
SparkR Word count Example
Note the sparkR lapply() function which is an alias for the Spark map and mapPartitions functions.
These are still early times for Spark and R. We would very much like to hear about your experiences with sparkR or any other effort to run R over Spark.