Hadley Wickham has just released updates to his data-manipulation packages for R, plyr and reshape (now called reshape2), that are much faster and more memory-efficient than the previous incarnations. The reshape2 package lets you flexibly restructure and aggregate data using just three functions (melt, acast and dcast), whereas the plyr package is kind of like a supercharged SQL "GROUP BY" statement for R data frames.
One of the most interesting aspects of this update is that plyr can now parallelize its operations and make use of multiple processors simultaneously to speed up really big data-munging jobs. It makes use of Revolution's contributed foreach package, so whatever platform you're on (Windows, Linux, or Mac) you can specify a suitable parallel backend and take advantage of significant speedups on multiprocessor machines.
For example, on a 2-core Windows box can use the doSMP package from Revolution R to speed up a plyr call as follows:
require(doSMP)
workers <- startWorkers(2) # My computer has 2 cores
registerDoSMP(workers)
llply(my_data, aggr_function, .parallel=TRUE)
On Unix-like platforms (including Linux and Mac) you can use the doMC package for similar ends. Find more information about plyr at Hadley's website, below.
Had.co.nz: plyr
plyr and foreach seemed like a perfect match, so I'm excited to hear they're finally hitched! By the way, it seems that plyr v1.2 had a bug that broke ggplot2, but Hadley quickly fixed it and plyr v1.2.1 should be making it's way through CRAN now
Posted by: Mike Lawrence | September 10, 2010 at 09:25