by Ali Zaidi, Data Scientist at Microsoft
In previous post we showcased the use of the sparklyr package for manipulating large datasets using a familiar dplyr syntax on top of Spark HDInsight Clusters.
In this post, we will take a look at the
RxSpark API for R, part of the
RevoScaleR package and the Microsoft R Server distribution of R on HDInsight. We'll use RxSpark to visualize a dataset of 140M taxi rides between boroughs in New York City.
Dealing with data in distributed storage and programming with concurrent systems often requires learning complicated new paradigms and techniques. Statisticans and data scientists familiar wtih R are unlikely to have much experience with such systems. Fortunately, the
RevoScaleR package abstracts away the difficult portions of distributed computation and allows the user to focus on building R code that can be automatically deployed in distributed environments.
WODA - Write Once, Deploy Anywhere
In a similar spirit to how
sparklyr allowed us to reuse our functions from the
dplyr package to manipulate Spark DataFrames, the
RxSpark API allows a data scientist to develop code that can be deployed in a multitude of environments. This allows the developer to shift their focus from writing code that’s specific to a certain environment, and instead focus on the complex analysis of their data science problem. We call this flexibility Write Once, Deploy Anywhere, or WODA for the acronym lovers.
For a deeper dive into the
RevoScaleR package, I recommend you take a look at the online course, Analyzing Big Data with Microsoft R Server. Much of this blogpost follows along the last section of the course, on deployment to Spark.
NYC Taxi Data
In this section, we will examine the ubiquitious NYC Taxi Dataset, and showcase how we can develop data analysis pipelines that are platform invariant.