« What's the best platform for a high score on Canabalt? | Main | Setting up a parallel computing cluster for R with OpenSSH and doSNOW »

February 24, 2011

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b0147e2cc6f4a970b

Listed below are links to weblogs that reference Packages for By-Group Processing in R:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Nice post -- I was happy to find data.table.

R.e. between-machine parallelism, any idea if anyone is working along the lines of snow? I spent 4 or 5 hours man-handling snow last week, and had a general sense of discomfort (not the good kind, either). The code-base is old -- no options for timeout, and repeated segfaults on failures to makeSOCKcluster. A quick google search shows 3+ years of confused users and few success stories. Its namespace handling is quirky (reliance on .GlobalEnv makes it hard to use snow clusters within functions), and the socket mechanism is both insecure (creating a public telnet on a published port), and inaccessible to many users (requires firewall reconfig, which requires root).

OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.

Much of the work in snow has been superceded to some extent, e.g. clusterApply by foreach and doSNOW. Still, the basic idea is sound. It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

Personally, I sit on a fast campus network and have at least 10 remote cores available that I could farm out for big jobs. the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login priviledges are required!

Maybe I should put this in my own blog :)
Anyway, food for thought. Any comments appreciated.

Ok, I did just that, and it's working. If you're interested, see:
http://helmingstay.blogspot.com/2011/02/snow-and-ssh-secure-inter-machine.html

Does doSMP use a scalable memory manager? Or do the different threads have private memory managers? Or do typical workloads not result in contention on the memory manager?

Christian, thanks for sharing that post -- I'm sure it will be useful to many others, and I also featured it in a post today.

David, doSMP doesn't do anything that clever; it just uses the companion revoIPC package to spawn additional instances of R to do the computations (much like the multicore/doMC package combo does on Linux systems). You just get to control the number of simultaneous instances (which shouldn't be greater than the number of available cores, of course).

@David So doSMP is basically fork? I suppose that gets rid of the possibility of within address space contention, but you must pay a price to spawn processes on Windows. That costs big time.

The comments to this entry are closed.


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog