(Update Oct 7 2011: This post was written tongue-in-cheek, well before any integrations between R and Hadoop had been released. If you're interested in actual packages for running R in Hadoop, check out the RHadoop Project.)
There's been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It's the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you're an R programmer, you've probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions -- the "apply" functions -- designed to evaluate a function over a series of data values (the "map" step) and collate and condense the results (the "reduce" step). In fact, you can almost boil it down to a single line of R code:
sapply(map(data), reduce)
where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.
It's not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.
Open Data: mapReduce Reduced (& Ported to R)
Also look at the HadoopStreaming package: http://cran.r-project.org/web/packages/HadoopStreaming/
Posted by: Shane | November 16, 2009 at 09:27
To me the mapreduce part of hadoop is the trivial part. It's the job and task trackers of Hadoop that make me want the whole thing integrated with R and not just this "port".
Posted by: anon | November 16, 2009 at 12:43
Indeed -- the "and it's trivial" headline was decidedly tongue-in-cheek. If you're interested in using R as the reduce operator in Hadoop, check out the Rhipe project.
Posted by: David Smith | November 16, 2009 at 13:20
Take a look at this: in Ruby you can implement all of Unix in just 14 lines (and it's trivial).
http://pwpwp.blogspot.com/2009/11/unix-in-14-lines-of-ruby-its-trivial.html
Posted by: Manuel Simoni | November 17, 2009 at 11:54