Here's a new use for R: designing T-shirts for a band:
Despite having a cool name, "The Probable Error of a Mean" isn't a real band. Taking inspiration from the title of the 1908 paper where FisherStudent introduced the concept of the t distribution, Thomas Levine used R to design this T-shirt for the Shirt.Woot competition. The typeface is even the same as that used in the original paper!
How is it that a company like Apple can produce wonderful user interfaces, but open-source UIs are generally, well, substandard? Matthew Thomas suggested back in 2002 that too many cooks can spoil the broth:
Every contributor to the [open-source] project tries to take part in the interface design, regardless of how little they know about the subject. And once you have more than one designer, you get inconsistency, both in vision and in detail. The quality of an interface design is inversely proportional to the number of designers.
The solution? According to Matt Asay, the key is to get a commercial company involved. He's seen an upcoming release of Ubuntu from Canonical, Ltd. that may rival MacOS for usability. He says:
You need a little more than open source, it seems, to make products usable. You need control, and control doesn't always jibe well with open-source development. This is one reason that we're seeing the emergence of the Open Core licensing model for open source.
A coalition of four teams called BellKor's Pragmatic Chaos has qualified for the million-dollar Netflix Prize by creating a predictive model that generates movie recommendations with 10% greater accuracy than Netflix's current system. The New York Times has more, but at this point the details are still sketchy as competitors have 30 days to beat BellKor's performance:
Mr. Piotte, a founder of Pragmatic Theory, explained why he recently joined the larger team. “Because of the nature of the competition, making a coalition of teams is a quick way to improve results,” he said in an e-mail Friday night. “We felt that we had little chance to keep the lead against such a coalition unless we were part of one, too.”
But he declined to say just how the team nudged their work over the 10 percent threshold. “Since the competition remains open for 30 days, we are reluctant to disclose any secret at this time,” Mr. Piotte said. “All I can say is that we all worked very hard to achieve this mark, and that the final solution contains many original ideas.”
However in email today Chis Volinsky, of the AT&T Labs team in the coalition, gave me a few details. He said that R was used sparingly for some visualization and a little bit of model development, and that most of the heavy lifting was done in C++ with a little bit of SAS. It seems the teams worked quite independently: Chris isn't sure what the other collaborators used.
The map was created in R using the maps and maptools packages, exported as PDF, and then touched up with Adobe Illustrator (for annotations and typography). As we saw on Friday, the New York Times has a similar practice to make R graphics publication-ready.
The New York Times has an interesting graphic today charting the late Michael Jackson's musical career. The chart superposes the Billboard rankings of each of his hits over time as sparklines, and compares his output to that of The Beatles, U2, Mariah Carey, Usher and, perplexingly, Boyz II Men.
It's a nicely done chart, but what impresses me the most was the speed at which they put it out. (According to the dateline, it was published yesterday, hours after Jackson's death.) Either the NYT graphics department has applied the obituary model to quantitative graphics and has a chart like this cued up for every pop star, or they have some very powerful software and rich data stores. (I guess when Madonna gets hit by a bus we'll know the answer.)
Update: Well, we have an answer. The NYT does have powerful software to create its graphics: R! Amanda Cox, a graphics editor at the Times (and collaborator on the MJ graphic), emailed me with the news:
I saw your blog post on the Michael Jackson chart in the New York Times today.
I thought it might amuse you to know that the charts were made in R. (Then cleaned up in Illustrator and moved into Flash, but they started life in R.)
Thanks, Amanda! I've admired the NYT's amazing graphics for some time, but never knew R was involved. Count this as another example where R's power and rapid development lends itself to analysis of breaking news events.
As announced by Peter Dalgaard this morning, the latest update to R, R 2.9.1, has been released on schedule. If you roll your own builds of R, the sources are available now on CRAN. As of this writing none of the binary releases (for Windows and MacOS) have been updated yet: it will probably be a few days before those builds are completed by the R Core Team and propagated to the CRAN mirrors.
This is a maintenance release and fixes a number of mostly minor issues. The complete list of changes appears after the jump.
It's amazing how a simple change in axes can make a chart much easier to interpret. (And as an added bonus, Jon's chart addresses one of my pet peeves: it's in PNG format. The JPG format of the original is almost always a poor choice for charts.)
Last week, I posted a link to a primer on running R in batch mode, by redirecting the input and output of the R shell command. (By the way, it works the same way in REvolution R too: just replace the command R with Revo). In the comments, Doug Bates reminded me that there's a better way: using R CMD BATCH.
Let's illustrate by example. Suppose I have a script file containing the commands for a statistical analysis. I want to run it in batch mode (perhaps because it will run overnight), and save the output of the commands to a file. Here's my script file, saved as myscript.R:
I can run it in batch mode from the shell or Windows command line, like this:
$ Revo CMD BATCH myscript.R myscript.Rout $
(Note: I'm using REvolution R here, so replace Revo with R if you're using CRAN R. $ is the shell prompt, so don't type that.) This has a similar effect to the file redirection method:
Revo < myscript.R > myscript.Rout
with a few differences:
CMD BATCH automatically captures warnings and errors, which the file redirection method as shown above would lose: they'd be displayed at the shell prompt, not captured in the file. (It is achievable with some tricky shell syntax, but I can never remember how to do it.)
CMD BATCH automatically adds a call to proc.time() at the end of the session, to document how long the script took to run.
The CMD BATCH syntax is easier to remember (at least for me).
By default, both methods show both the prompts and input commands and their outputs in the output file myscript.Rout, just as if you'd typed them at the interactive prompt. But if you only want the outputs (you're building a text report using R commands, say), a very handy option is --slave: this suppresses the prompts, commands, and R's startup blurb to leave you only with the output. With the command:
One last point: you can actually pass your own command-line arguments into R to affect the behaviour of your script. Maybe we'll look at that in detail in some other post, but for now the help for the function commandArgs will get you started. You might also want to use Rscript instead, which we've discussed before.
The latest in the O'Reilly "Short Cuts" series, and the first devoted to R, is Data Mashups in R. Written by Jeremy Leipzig and Xiao-Yi Li, this 30-page article is an excellent and very practical example of integrating messy data from varied sources, using R or REvolution R.
It's not designed as a manual-style introduction to R. But while working through the fully-detailed example it presents, even programmers unfamiliar to R will get a good sense of the practical capabilities of R when working with real-life data sources. "Learning by doing" is a great way to bootstrap your knowledge of any language, and a concrete, practical example like this really helps with that process.
The example used is a particularly timely one: how to automate the process of downloading foreclosure data from a public website, and presenting it in graphical form, like this:
In practical terms, creating this graphic necessitates integration of data from various sources:
Downloading an HTML file of foreclosed addresses from a public web-site (with download.file) and extracting addresses in messy formats (with grep);
Downloading geolocation data from a Yahoo web-service and parsing the XML result (with xmlTreeParse);
Downloading an ESRI shape file of Philadelphia and its census tracts, and plotting a map (with the maptools package);
Matching the individual addresses to census tracts and counting the number of foreclosures in each (with plotPolys).
One of the great things about the article is that it's not presented as a theoretical exercise where everything just works the first time. It takes pains to cover not just how to accomplish the task itself, but also how to deal with problems you're likely to encounter along the way (especially if you adapt the methods to your own needs). Practical advice is given on how to deal with connection problems caused by firewalls and proxies, how to deal gracefully with the failure of servers to respond (using tryCatch), and how to detect error messages from the Yahoo web service when it fails to recognize addresses. The authors also let you know where they found the information of how to deal with such problems, so it also serves as a guide for finding help resources in R.
My only criticisms are minor ones. The article leads the user through downloading Census data, matching it to the tracts, and performing some simple summary statistics and exploratory graphics, but stops short of any actual statistical analysis. And cutting-and-pasting R code from a PDF document is a pain: there's no associated script file (as far as I know).
But for experienced R users looking for tips on integrating messy data sources from the Web, or for programmers new to R looking for a practical example to work through as an introduction to R and who (in the words of the authors) "want to incorporate statistical analysis into their data pipelines", Data Mashups in R is well worth $4.99 to download.