« Traffic prediction contest closing soon | Main | What's for lunch? Private browsing. »

August 23, 2010

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b0133f344c2c4970b

Listed below are links to weblogs that reference Taking R to the Limit: Parallelism and Big Data:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Great post. I've been learning a lot about R's memory limitations lately thanks to Revolutions Analytics and R bloggers. I really appreciate how you are posting about these topics even though Rev Analytics is selling a product that does this very thing. Thank you for allowing the community to share their ideas. I believe its only going to help make your company, Revolution Analytics, even stronger.

I'd really like to hear more about advice for when to consider parallel computing. It makes sense that we should only consider parallel computing for big jobs (the time saved for a small job might exceed the time it takes to get it running). However, I don't understand your "each iteration takes longer than the time it takes to get up and pour a cup of coffee..." rule of thumb. Why does it matter if there are 100 iterations that each take 5 minutes or 5 iterations that take 100 minutes? Thanks.

It's not the total time of the job that really matters. Converting an iterative process from sequential to parallel incurs overhead when you run the job (I'm not even counting the time to set it up, here). Each iteration will actually take a little *longer* than in the sequential case, while the system handles the details of running iterations in parallel. The idea is to save elapsed (wall-clock) time by running these slightly-longer iterations in parallel, rather than sequentially.

Both your examples satisfy my "pour-coffee" rule, so let's try a different, contrived example. Suppose I'm running a job with 100 iterations that each take 0.5 seconds. Total time for a sequential run: 50 seconds. Now let's run that in parallel on a dual-core machine. It'll take 25 seconds, right? But that's not counting overhead: let's say it's 1 second per iteration. Now the total elapsed time is 75 seconds (25 seconds parallel computation, 50 seconds parallel overhead). That's a contrived example, but I've seen plenty of real-life examples where adding parallelism can make things run slower. On the other hand, in this example if the iteration time was 10 seconds rather than 0.5 seconds, you'd be seeing a speedup.

@Larry -- Thanks!

Thanks for the explanation.

The comments to this entry are closed.


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog