I had a really great time at the BioC2009 conference in Seattle this week. The conference is dedicated to the analysis of gene expression data using the the BioConductor packages for R. It's an area I haven't really looked at for quite a while, but boy, has everything changed. Three years ago the hot topic was microarray analysis; now, that's old hat.
The big thing this year is sequence analysis: scientists can now look inside a cell and identify which segments of DNA are being expressed (and at what rate). This provides a direct link between experimental factors like drugs, diseases, and risk factors and the actual behavior of specific parts of the body in terms of the proteins our cells create via genes in our DNA. Not only does this create a vast volume of data about known genes, but has spawned a whole new data-identification problem of finding previously unknown (but observed, via expression) genes in the reference genome. Suddenly, the ability to find a short sequence (a gene) in a VERY long string (the genome) using fuzzy-matching techniques (genes aren't always the exact same ACTG sequence) is an important computational problem.
I think that may be why there was such interest in my Parallel Programming in R lab session. It was a 2-hour lecture and practical session. Much of the material I've covered here before, such as using iterators and foreach to parallelize loops in R on multi-core systems or on clusters. It does seem to be the case that a lot of typical calculations in BioConductor are embarrassingly parallel, and it's easy to convert a for loop or an lapply loop to foreach and get some significant speedups. (Others including Michael Knudsen and William Webber have found this to be the case, too.) There was also some new material on using video cards (GPUs) for high-performance multi-threaded calculations, which I'll probably write up as a blog post some day. You can download the slides (2Mb PDF) and the R script file I used for the session.
All in all, it was a great event (despite the record heat in Seattle) and it's clear that BioConductor has a thriving community about it. Robert Gentleman announced at the conference he's stepping down as the head of the project to take up a new role in the Bay Area, but with Martin Morgan taking the reins of the team in Seattle and with the development community worldwide, the BioConductor will continue to be at the leading edge of gene expression data analysis for the foreseeable future.
Thank you for sharing! I read this blog every day and it is always worthwhile.
The slides and script are very helpful. I do have one question. I tried an "nws" backend in R 2.9.1 (windows), but I can't register it, and you alluded to this in your slides. Is my only option to wait for an official release of R 2.10?
Posted by: Wade Davis | July 29, 2009 at 15:08
To use registerDoNWS() you'll need REvolution R Enterprise. Right now it's only available via subscription.
If you're on MacOS or Linux, you can install the "doMC" package from CRAN and use registerDoMC(), but unfortunately the multicore package it depends on isn't supported on Windows.
Posted by: David Smith | July 29, 2009 at 15:39