Analyst and BI expert Steve Miller takes a look at the facilities in R for doing "by-group" processing of data. The task consisted of:
... read several text files, merge the results, reshape the intermediate data, calculate some new variables, take care of missing values, attend to meta data, execute a few predictive models and graph the results.
Then repeat the models and graphs for groups or sub-populations marked by distinct values of one or more dimension variables of interest.
The latter step is commonly referred to as “by-group processing.” SAS programmers will recognize by group processing with syntax that invokes a procedure on a sorted data set that looks something like:
proc reg data = dblahblah; by vblahblah;
Check out Steve's post for how he addressed this in R using the high-performance data.table package by Matthew Dowle (and as Steve suggests, a good place to get started is the example vignettes).
I'd also add a recommendation for the plyr package which also offers tools to split up data sets by various criteria, and then do by-processing. Here, the plyr: divide and conquer guide is a good place to start. As an added bonus, you can also divide and conquer the computations by exploiting multiple nodes in parallel by engaging a parallel backend for the foreach function. (Note for Windows users: the doSMP backend from Revolution R is also available now on R-Forge and will be on CRAN soon, too.)
Information Management: By-Group Processing, the R data.table and the Power of Open Source
Nice post -- I was happy to find data.table.
R.e. between-machine parallelism, any idea if anyone is working along the lines of snow? I spent 4 or 5 hours man-handling snow last week, and had a general sense of discomfort (not the good kind, either). The code-base is old -- no options for timeout, and repeated segfaults on failures to makeSOCKcluster. A quick google search shows 3+ years of confused users and few success stories. Its namespace handling is quirky (reliance on .GlobalEnv makes it hard to use snow clusters within functions), and the socket mechanism is both insecure (creating a public telnet on a published port), and inaccessible to many users (requires firewall reconfig, which requires root).
OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.
Much of the work in snow has been superceded to some extent, e.g. clusterApply by foreach and doSNOW. Still, the basic idea is sound. It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.
Personally, I sit on a fast campus network and have at least 10 remote cores available that I could farm out for big jobs. the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login priviledges are required!
Maybe I should put this in my own blog :)
Anyway, food for thought. Any comments appreciated.
Posted by: Christian Gunning | February 24, 2011 at 16:35
Ok, I did just that, and it's working. If you're interested, see:
http://helmingstay.blogspot.com/2011/02/snow-and-ssh-secure-inter-machine.html
Posted by: Christian Gunning | February 24, 2011 at 20:43
Does doSMP use a scalable memory manager? Or do the different threads have private memory managers? Or do typical workloads not result in contention on the memory manager?
Posted by: David Heffernan | February 25, 2011 at 01:12
Christian, thanks for sharing that post -- I'm sure it will be useful to many others, and I also featured it in a post today.
Posted by: David Smith | February 25, 2011 at 09:26
David, doSMP doesn't do anything that clever; it just uses the companion revoIPC package to spawn additional instances of R to do the computations (much like the multicore/doMC package combo does on Linux systems). You just get to control the number of simultaneous instances (which shouldn't be greater than the number of available cores, of course).
Posted by: David Smith | February 25, 2011 at 09:29
@David So doSMP is basically fork? I suppose that gets rid of the possibility of within address space contention, but you must pay a price to spawn processes on Windows. That costs big time.
Posted by: David Heffernan | February 25, 2011 at 12:46