In a two-part series at the Los Angeles R User Group[*], Ryan Rosario took a look at the many ways you can take the R language to the limits of high-performance computing.
In Part I (see video at this link; slides and code also available), Ryan focuses on the various methods of parallel computing in R. There's some great material here on explicit parallelism, especially if you're looking to get into the nuts and bolts of the Rmpi package. Ryan also gives several examples of using the snow and snowfall packages for fine-grained parallel computing. If you don't want to think too hard about the details of parallel programming, but just want to use the power of your hardware to speed up "embarrassingly parallel jobs", Ryan also covers implicit parallelism with the multicore package, and shows how to simplify things even further with foreach[**]. Part I wraps up with a brief look at high-performance computing with GPUs: computations can be very fast, but the tools available still aren't very user-friendly. If you're thinking about getting into parallel computing with R, Part I of Ryan's talk gives a great overview of the possibilities available. It also includes some advice about when not to try parallel computing:
“Each iteration should execute computationally-intensive work. Scheduling tasks has overhead, and can exceed the time to complete the work itself for small jobs.”
This sage advice is worth taking to heart. My personal (but unscientific) rule of thumb is that it's worth trying parallelism only when each iteration takes longer than the time it takes to get up and pour a cup of coffee. (Then again, the coffee pot is less than 5m from my desk.)
In Part II (video coming soon, slides and code available now), Ryan looks at the various tools available to break the constraints of R storing all data in memory, and perform analysis of very large data sets from within the R environment. Much of the presentation is focused on the bigmemory and ff packages, which use different techniques to store data on disk instead of in memory. In the former case, there's an interesting example of combining both foreach and bigmemory to speed up processing of the airline delay data set, along with an example of doing linear regression on the data. (Revolution's Joseph Rickert does a similar analysis using the forthcoming RevoScaleR package in this white paper, where the computation is automatically parallelized and runs somewhat faster. I'll be talking more about RevoScaleR and showing a demonstration of this analysis in a webinar on Wednesday.) Ryan compares ff and bigmemory and finds that performance-wise they're much the same, but does note an interesting aspect of ff: if you need to create extremely long vectors, it can help. The goal of ff is to get rid of the following message:
> x <- rep(0, 2^31 - 1)
Error: cannot allocate vector of length 2147483647
If you've been thinking about getting into MapReduce and/or Hadoop, Ryan has some great introductory materials beginning at slide 49. He gives several examples of using parallel programming tools to speed up map/reduce processing with the mapReduce package, and if you want to play with Hadoop but don't program in Java, Ryan also shows how to use the HadoopStreaming package to drive Hadoop directly from R. If you want more power in controlling Hadoop, Ryan also touches briefly on the Rhipe package.
Thanks go to Ryan for making these useful materials available!
Byte Mining: Taking R to the Limit, Part I & Part II
[*] Revolution Analytics is a proud sponsor of the Los Angeles Area R User Group.
[**] foreach is an open-source package developed by Revolution Analytics.
Great post. I've been learning a lot about R's memory limitations lately thanks to Revolutions Analytics and R bloggers. I really appreciate how you are posting about these topics even though Rev Analytics is selling a product that does this very thing. Thank you for allowing the community to share their ideas. I believe its only going to help make your company, Revolution Analytics, even stronger.
Posted by: Larry (IEOR Tools) | August 24, 2010 at 06:03
I'd really like to hear more about advice for when to consider parallel computing. It makes sense that we should only consider parallel computing for big jobs (the time saved for a small job might exceed the time it takes to get it running). However, I don't understand your "each iteration takes longer than the time it takes to get up and pour a cup of coffee..." rule of thumb. Why does it matter if there are 100 iterations that each take 5 minutes or 5 iterations that take 100 minutes? Thanks.
Posted by: Aaron | August 24, 2010 at 06:33
It's not the total time of the job that really matters. Converting an iterative process from sequential to parallel incurs overhead when you run the job (I'm not even counting the time to set it up, here). Each iteration will actually take a little *longer* than in the sequential case, while the system handles the details of running iterations in parallel. The idea is to save elapsed (wall-clock) time by running these slightly-longer iterations in parallel, rather than sequentially.
Both your examples satisfy my "pour-coffee" rule, so let's try a different, contrived example. Suppose I'm running a job with 100 iterations that each take 0.5 seconds. Total time for a sequential run: 50 seconds. Now let's run that in parallel on a dual-core machine. It'll take 25 seconds, right? But that's not counting overhead: let's say it's 1 second per iteration. Now the total elapsed time is 75 seconds (25 seconds parallel computation, 50 seconds parallel overhead). That's a contrived example, but I've seen plenty of real-life examples where adding parallelism can make things run slower. On the other hand, in this example if the iteration time was 10 seconds rather than 0.5 seconds, you'd be seeing a speedup.
Posted by: David Smith | August 24, 2010 at 08:00
@Larry -- Thanks!
Posted by: David Smith | August 24, 2010 at 10:11
Thanks for the explanation.
Posted by: Aaron | August 24, 2010 at 10:18