I had a really great time at the BioC2009 conference in Seattle this week. The conference is dedicated to the analysis of gene expression data using the the BioConductor packages for R. It's an area I haven't really looked at for quite a while, but boy, has everything changed. Three years ago the hot topic was microarray analysis; now, that's old hat.
The big thing this year is sequence analysis: scientists can now look inside a cell and identify which segments of DNA are being expressed (and at what rate). This provides a direct link between experimental factors like drugs, diseases, and risk factors and the actual behavior of specific parts of the body in terms of the proteins our cells create via genes in our DNA. Not only does this create a vast volume of data about known genes, but has spawned a whole new data-identification problem of finding previously unknown (but observed, via expression) genes in the reference genome. Suddenly, the ability to find a short sequence (a gene) in a VERY long string (the genome) using fuzzy-matching techniques (genes aren't always the exact same ACTG sequence) is an important computational problem.
I think that may be why there was such interest in my
Parallel Programming in R lab session. It was a 2-hour lecture and practical session. Much of the material I've covered here before, such as using
iterators and foreach to parallelize loops in R on multi-core systems or on clusters. It does seem to be the case that a lot of typical calculations in BioConductor are embarrassingly parallel, and it's easy to convert a
for loop or an
lapply loop to
foreach and get some significant speedups. (Others including
Michael Knudsen and
William Webber have found this to be the case, too.) There was also some new material on using video cards (GPUs) for high-performance multi-threaded calculations, which I'll probably write up as a blog post some day. You can download the
slides (2Mb PDF) and the
R script file I used for the session.
All in all, it was a great event (despite the record heat in Seattle) and it's clear that BioConductor has a thriving community about it. Robert Gentleman announced at the conference he's stepping down as the head of the project to take up a new role in the Bay Area, but with Martin Morgan taking the reins of the team in Seattle and with the development community worldwide, the BioConductor will continue to be at the leading edge of gene expression data analysis for the foreseeable future.