(Update Oct 7 2011: This post was written tongue-in-cheek, well before any integrations between R and Hadoop had been released. If you're interested in actual packages for running R in Hadoop, check out the RHadoop Project.)
There's been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It's the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you're an R programmer, you've probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions -- the "apply" functions -- designed to evaluate a function over a series of data values (the "map" step) and collate and condense the results (the "reduce" step). In fact, you can almost boil it down to a single line of R code:
where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.
It's not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.
Open Data: mapReduce Reduced (& Ported to R)