Comments on Taking R to the Limit: Parallelism and Big DataTypePad2010-08-23T17:40:37ZBlog Administratorhttp://blog.revolutionanalytics.com/tag:typepad.com,2003:http://blog.revolutionanalytics.com/2010/08/taking-r-to-the-limit-parallelism-and-big-data/comments/atom.xml/Aaron commented on 'Taking R to the Limit: Parallelism and Big Data'tag:typepad.com,2003:6a010534b1db25970b0133f34b23fa970b2010-08-24T17:18:48Z2010-08-24T17:18:48ZAaronThanks for the explanation.<p>Thanks for the explanation.</p>David Smith commented on 'Taking R to the Limit: Parallelism and Big Data'tag:typepad.com,2003:6a010534b1db25970b0133f34b1a90970b2010-08-24T17:11:16Z2010-08-24T17:11:16ZDavid Smithhttp://www.revolutionanalytics.com@Larry -- Thanks!<p>@Larry -- Thanks!</p>David Smith commented on 'Taking R to the Limit: Parallelism and Big Data'tag:typepad.com,2003:6a010534b1db25970b0134866ebdaa970c2010-08-24T15:00:48Z2010-08-24T15:00:48ZDavid Smithhttp://www.revolutionanalytics.comIt's not the total time of the job that really matters. Converting an iterative process from sequential to parallel incurs...<p>It's not the total time of the job that really matters. Converting an iterative process from sequential to parallel incurs overhead when you run the job (I'm not even counting the time to set it up, here). Each iteration will actually take a little *longer* than in the sequential case, while the system handles the details of running iterations in parallel. The idea is to save elapsed (wall-clock) time by running these slightly-longer iterations in parallel, rather than sequentially.</p>
<p>Both your examples satisfy my "pour-coffee" rule, so let's try a different, contrived example. Suppose I'm running a job with 100 iterations that each take 0.5 seconds. Total time for a sequential run: 50 seconds. Now let's run that in parallel on a dual-core machine. It'll take 25 seconds, right? But that's not counting overhead: let's say it's 1 second per iteration. Now the total elapsed time is 75 seconds (25 seconds parallel computation, 50 seconds parallel overhead). That's a contrived example, but I've seen plenty of real-life examples where adding parallelism can make things run slower. On the other hand, in this example if the iteration time was 10 seconds rather than 0.5 seconds, you'd be seeing a speedup.</p>Aaron commented on 'Taking R to the Limit: Parallelism and Big Data'tag:typepad.com,2003:6a010534b1db25970b0133f34a09f5970b2010-08-24T13:33:56Z2010-08-24T13:33:56ZAaronI'd really like to hear more about advice for when to consider parallel computing. It makes sense that we should...<p>I'd really like to hear more about advice for when to consider parallel computing. It makes sense that we should only consider parallel computing for big jobs (the time saved for a small job might exceed the time it takes to get it running). However, I don't understand your "each iteration takes longer than the time it takes to get up and pour a cup of coffee..." rule of thumb. Why does it matter if there are 100 iterations that each take 5 minutes or 5 iterations that take 100 minutes? Thanks.</p>Larry (IEOR Tools) commented on 'Taking R to the Limit: Parallelism and Big Data'tag:typepad.com,2003:6a010534b1db25970b0134866e21d8970c2010-08-24T13:03:04Z2010-08-24T13:03:04ZLarry (IEOR Tools)http://industrialengineertools.blogspot.comGreat post. I've been learning a lot about R's memory limitations lately thanks to Revolutions Analytics and R bloggers. I...<p>Great post. I've been learning a lot about R's memory limitations lately thanks to Revolutions Analytics and R bloggers. I really appreciate how you are posting about these topics even though Rev Analytics is selling a product that does this very thing. Thank you for allowing the community to share their ideas. I believe its only going to help make your company, Revolution Analytics, even stronger. </p>