by Bill Jacobs, Microsoft Advanced Analytics Product Marketing
They say that time is infinite. Seem to me data is fast becoming the same. Or perhaps it's becoming true that our thirst for speed is providing eternal job security to computer scientists who can deliver it.
Apache Spark, one of the Apache Foundation's fastest-growing open source projects, delivers new levels of speed to computing clusters, combining in-memory computing and efficient parallelization. With Spark, Hadoop clusters and data lakes can achieve speeds far greater than available with Hadoop's MapReduce framework.
We're happy to announce this month that Microsoft has integrated support for Apache Spark into Microsoft R Server for Hadoop, bringing Spark's speed advantages within the reach of R users.
But how much faster is it? <drum-roll please...>
To measure performance, we first pitted R Server for Hadoop running on MapReduce against R Server for Hadoop running R algorithms in Apache Spark on Hadoop. In our testing using R Server to move work into Spark boosted GLM speeds by roughly 6x over R Server on Hadoop MapReduce on the same cluster and the same data. That's quite a testament to Spark's new distributed compute engine in-memory architecture.
Second, more pertinent to users of open source R and CRAN algorithms (and perhaps struggling with scale limits), we compared R Server on a five node Spark cluster to open source R with CRAN algorithms which can only run on a single server. We expected faster, and we go it. R Server ran GLM 125 times faster on five times the hardware, showing the combined speed of R Server's parallelized algorithms and Spark's in-memory architecture.
Hadoop and Spark have a reputation of being "testy" to develop for, requiring fluency in programming languages including Java, Python or Scala skills. For R users involved in things like risk analytics, insurance underwriting, ad optimization or fraud detection, speed is only useful if it's easily achievable. We built R Server version 8 for Hadoop and Spark for this case.
For most R users, however, it's what you learn from the data that matters, often making effort to learn a new language a bridge too far. And so, it is the intersection of speed and ease of use that matters the most.
Ease of use was treated as a top engineering priority for Spark integration. As a result, using Spark requires only minimal script changes. CRAN R users can add R Server's remote execution context settings and substitute ScaleR algorithms in order to run parallelized computation on large data sets in Spark. For existing users of R Server for Hadoop, it's easier still, requiring them to change the context setting from MapReduce to Spark.
The combined speed and simplicity of R Server on Spark are available to preview today in the Azure Cloud and will be generally available in Azure later this summer. For use on premises, MapR, Hortonworks and Cloudera users can install Microsoft R Server for Hadoop, version 8.0.5, which will become available the last week of June 2016.
If you're interested in learning more, here's a short video on Microsoft's Channel9 blog site describing our Spark support in Microsoft R Server version 8.0.5.
If you'd like to test drive R Server on Spark and Hadoop in the Azure cloud, you can start for free at https://azure.microsoft.com. With your Azure account, you can spin-up a cluster with R Server and Spark on HDInsight Hadoop in about the time it takes to make a decent lunch.