by Andrie de Vries
Today is the first day of UseR!2015 conference in Aalborg in Northern Denmark. But yesterday was a day packed with 16 tutorials on a range of interesting topics. I submitted a proposal many months ago to run a session on using R in Hadoop and was very happy to selected to run a session in the morning.
When we first started planning the session, we set a Big Hairy Audacious Goal to run the session using a HortonWorks Hadoop cluster hosted in the Microsoft Azure cloud.
We trialled the session at the Birmingham R user group during May, and then again last week during a Microsoft internal webinar. In both cases, the cluster performed great.
Then I asked the organisers how many people expressed an interest, and I heard that 66 people said they might come and that the room will seat 144 people.
At this point I started having cold sweats!
Just as luck would have it, minutes before start time something went wrong. The cluster would not start and participants were unable to access the server for the first hour of the tutorial. Fortunately, we had a back-up plan.
You can take a vicarious tour through the analysis of the NYC Taxi Cab data with the rmarkdown slides built with the RStudio presentation suite.
Although I did not cover all the material during the session, the full set of presentations are:
2. Analysing New York taxi data with Hadoop
3. Computing on distributed matrices
4. Using RHive to connect to the hive database
(These presentations, sample data, all the scripts and examples live in github at https://github.com/andrie/RHadoop-tutorial. Use the tag https://github.com/andrie/RHadoop-tutorial/tree/2015-06-30-UseR!2015 to find the state of the repository at the time of the tutorial.)
The central example throughout is to use mapreduce() in the package rmr2 to summarize new York taxi journeys, grouped by hour of the day. This is a dataset of ~200GB of uncompressed csv files.
You can read more about how this data came into the public domain at http://chriswhong.com/open-data/foil_nyc_taxi/. Here is a simple plot of the results, sampled from a 1 in a 1000 sample of data:
The full code to do this is at available in a github repository at https://github.com/andrie/RHadoop-tutorial. Here is an extract:
library(rmr2)
library(rhdfs)
hdfs.init()
rmr.options(backend = "hadoop")
hdfs.ls("taxi")$file
homeFolder <- file.path("/user", Sys.getenv("USER"))
taxi.hdp <- file.path(homeFolder, "taxi")
headerInfo <- read.csv("data/dictionary_trip_data.csv", stringsAsFactors = FALSE)
colClasses <- as.character(as.vector(headerInfo[1, ]))
taxi.format <- make.input.format(format = "csv", sep = ",",
col.names = names(headerInfo),
colClasses = colClasses,
stringsAsFactors = FALSE
)
taxi.map <- function(k, v){
original <- v[[6]]
date <- as.Date(original, origin = "1970-01-01")
wkday <- weekdays(date)
hour <- format(as.POSIXct(original), "%H")
dat <- data.frame(date, hour)
z <- aggregate(date ~ hour, dat, FUN = length)
keyval(z[[1]], z[[2]])
}
taxi.reduce <- function(k, v){
data.frame(hour = k, trips = sum(v), row.names = k)
}
m <- mapreduce(taxi.hdp, input.format = taxi.format,
map = taxi.map,
reduce = taxi.reduce
)
dat <- values(from.dfs(m))
library("ggplot2")
p <- ggplot(dat, aes(x = hour, y = trips, group = 1)) +
geom_smooth(method = loess, span = 0.5,
col = "grey50", fill = "yellow") +
geom_line(col = "blue") +
expand_limits(y = 0) +
ggtitle("Sample of taxi trips in New York")
p
In reading through the examples and following the presentation information, it looks like the R-related components are installed on a node in the cluster?
Is it possible to perform hdfs commands and submit Hive queries from a shared Linux server that is not in the cluster?
For example, in corporate environments, there are Linux servers for non-Hadoop related workflows. Is there a technique to connect from these servers to the Hadoop cluster to execute hdfs commands and Hive queries?
Posted by: Phillip Burger | July 21, 2015 at 09:23