Tuesday's meeting of the Bay Area R UseR Group at the LinkedIn offices was a great event. The headline speaker was Joe Adler, author of the excellent R reference manual, R in a Nutshell. Joe's presentation was an in-depth look at the relative speed of various options in R for looking up values from a key in a key-value pair sequence. One of the simplest ways of doing this is to assign a names to a vector with the names function, and then you can look up the value associated with name "Australia" (if, say, you'd named your vector population with country names) with the statement population["Australia"] or population[["Australia"]]. With simulations, Joe revealed that the latter version is a bit faster than the former, but both have lookup times proportional to the length of the vector (i.e. lookups get slower for longer vectors). An faster (but more complicated) option is to use environments, where lookups can be done via a hash table in constant time. Joe provides all the details in this blog post. He did offer some good advice: even though the simpler constructs are slower, in general it's best to program for clarity, and only optimize for speed when really necessary.
A surprise addition to the program -- and a very pleasant surprise, at that -- was a lightning talk from Megan Price at Benetech. Benetech is a non-profit organization contracted by the likes of Amnesty International and Human Rights Watch to answer thorny geopolitical questions through the use of data and science. For example: "Were acts of genocide committed against the Mayan people in Guatemala?" (the answer, sadly, is yes.) Megan opened her talk by saying that she "uses R to save the world" -- and I think the was only half-joking. As Megan explained in a fascinating presentation, using statistical techniques to address these questions can "transform emotional political debates into debates about methodology and science". Specifically, she uses Multiple Systems Estimation techniques (from the rcapture package) to count things that are otherwise difficult to quantify. The process is related to capture-recapture methods used, for example, to count the number of animals in the wild. Megan also offered some great advice for creating scientific reports with R: her process of building reports dynamically with xtable and Sweave means that when something changes at the last minute -- and "something always changes at the last minute!", says Megan -- it's a simple process to recreate the report to incorporate the changes, without having to reformat and cut-and-paste everything together again.
Bay Area UseR Group: April 13 2010
It'd be great to hear more about Megan's report building process. When I put together a post on Sweave, it turned out to be my popular post to date.
R facilitates some truly beautiful data analytic workflows, but my sense is that many people new to R are looking for direction on this. It's certainly something that I've been thinking about and refining for a couple of years now, and I still don't think I've got it quite right yet.
Posted by: Jeromy Anglim | April 16, 2010 at 02:10