« Traffic prediction contest closing soon | Main | What's for lunch? Private browsing. »

August 23, 2010

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b0133f344c2c4970b

Listed below are links to weblogs that reference Taking R to the Limit: Parallelism and Big Data:

2 Favorites

  • DataJunkie
  • Walshtp

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Great post. I've been learning a lot about R's memory limitations lately thanks to Revolutions Analytics and R bloggers. I really appreciate how you are posting about these topics even though Rev Analytics is selling a product that does this very thing. Thank you for allowing the community to share their ideas. I believe its only going to help make your company, Revolution Analytics, even stronger.

I'd really like to hear more about advice for when to consider parallel computing. It makes sense that we should only consider parallel computing for big jobs (the time saved for a small job might exceed the time it takes to get it running). However, I don't understand your "each iteration takes longer than the time it takes to get up and pour a cup of coffee..." rule of thumb. Why does it matter if there are 100 iterations that each take 5 minutes or 5 iterations that take 100 minutes? Thanks.

It's not the total time of the job that really matters. Converting an iterative process from sequential to parallel incurs overhead when you run the job (I'm not even counting the time to set it up, here). Each iteration will actually take a little *longer* than in the sequential case, while the system handles the details of running iterations in parallel. The idea is to save elapsed (wall-clock) time by running these slightly-longer iterations in parallel, rather than sequentially.

Both your examples satisfy my "pour-coffee" rule, so let's try a different, contrived example. Suppose I'm running a job with 100 iterations that each take 0.5 seconds. Total time for a sequential run: 50 seconds. Now let's run that in parallel on a dual-core machine. It'll take 25 seconds, right? But that's not counting overhead: let's say it's 1 second per iteration. Now the total elapsed time is 75 seconds (25 seconds parallel computation, 50 seconds parallel overhead). That's a contrived example, but I've seen plenty of real-life examples where adding parallelism can make things run slower. On the other hand, in this example if the iteration time was 10 seconds rather than 0.5 seconds, you'd be seeing a speedup.

@Larry -- Thanks!

Thanks for the explanation.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog