Packages for R are being added and updated so frequently now that it's tough to keep up with them all (the @CRANberriesFeed Twitter feed helps, though). But here are a couple of recent package updates that caught my eye:
The Rcpp package for seamless integration between R and C++ has been updated. While most of the changes are under the hood, new users will appreciate the updated FAQ and Quick Reference Guide.
The powerful data-wrangling packages reshape2 and plyr have had several updates recently.
The rms package of regression tools has been updated with new capabilities for Cox regression.
The formatR package has been updated with a new function to reformat R source code for easier reading.
There's been a lot of news recently (here in the US at least) of the triumph of the supercomputer "Watson" over two human competitors in the TV game show, "Jeopardy!". It's an absolute triumph in computer science and natural lanuage processing, which you can read about it here, here, and especially here in the reflections of one of the human competitors, Ken Jennings. An episode on NOVA on Watson's design and creation is also available on-line, but I particularly liked the intersection of art and analytics that went into the design of Watson's avatar in this clip from IBM Research:
Watson's avatar, though beautiful, is clearly non-human. And Watson would most likely fail the standard Turing Test: if you were to read the transcript of the Jeopardy! closed-captions on Watson's answers you might suspect that the player Watson isn't human: when it (occasionally) failed, it failed in very non-human ways.
So how far away are we from fielding a computer that can pass a Turing test, and in five minutes of text messaging convince a panel of experts that it's a more convincing human than an actual, human rival? Not that far off, it turns out. In the Atlantic, journalist and poet Brian Christan recounts how he competes in a real live Turing Test and discovers that the challenge isn't so much for the machines to imitate humans, but for the humans to resemble, well, themselves. When our computer overlords inevitably take over, will we find ourselves ultimately trying to resemble them?
(Incidentally, Alan Turing, after whom the Turing Test is named, is a hero of mine and so I'm very excited that a documentary about his life is in the works.)
For students planning to attend the annual worldwide R user conference, useR! 2011, travel grants are available to help defray the cost of attending the conference in the UK. CRISM is offering bursaries for accommodation and conference fees, and Revolution Analytics is offering $1,000 travel grants. Applications for funding, which must be accompanied by an abstract for a contributed talk or poster, must be submitted before March 11. Details at the useR! 2011 conference website, linked below.
The author of the ggplot2 graphics package for R, Hadley Wickham, is looking for feedback from ggplot2 users. If you've used ggplot2, fill out his short survey at the link below.
(Updated Feb 18) Note: These data actually relate to a gathering of Twitter users in Iran, not Egypt. Apologies for the error, and thanks to the commenters for the correction.
Twitter played a significant role in the recent uprising in Egypt, with protesters communicating via tweets marked with the #25bahman hastag (February 14 in the arabic calendar) to plan and rally for the demonstration. Michael Bommarito downloaded all such tweets and plotted their frequency over time using R's ggplot2 library:
Not surprisingly, the activity peaked on February 14. The complete set of tweets would be an interesting focus for further analysis; they can be downloaded from Bommarito's post linked below.
The next release of R is scheduled for release February 25, and R 2.12.2 will likely be the final bug-fix release of the 2.12 series before R 2.13 is released in April. According to the NEWS file in the latest daily build, 2.12.2 will improve complex-arithmetic support on some rare platforms that don't support complex types in C99, and upgrades the PCRE library begind regular-expression functions like regex. There will also a number of minor big-fixes, mainly addressing rare edge-cases. The corresponding section of the current NEWS file is included after the jump, but of course there may be additional changes or fixes documented before the February 25 release.
Free dating site OkCupid (which was recently acquired by match.com) collects a lot of data. With over 3 million members, many of whom have provided extensive information about their personal details including preferences, lifestyle, sexuality and hobbies via their dating profiles, they have a wealth of information upon which to identify trends about the love lives of a typical OkCupid member.
On their informative, entertaining and sometimes controversial blog OkTrends, co-founder Christian Rudder (with the assistance of data scientist Max Shron) analyzes the data to report aggregate trends and insights, such as the differences in preferences between white and black people, how the behaviors of gay members are at odds with some pernicious gay stereotypes, and how religion relates to reading and writing levels. With its blend of data analysis and humor, the OkTrends blog addresses interesting facts and topics that not many others have the will -- or the data -- to write openly about.
Rudder tells me that when the blog first launched, the data analyses were run manually in Microsoft Excel. Six month later, Max Shron introduced the OkCupid team to R, and enabled more interesting analyses of the data, and to use more of it. According to co-founder Sam Yagan, once they ran data on R, everything got a lot "better and faster", and they were able to produce posts faster and write about more intricate data with better visualizations.
Today (appropriately, on St Valentine's day), GigaOM has published an in-depth article about OkCupid's use of R, by Revolution Analytics' Mike Minelli. Read the article for more details about the data, analyses and reporting OkCupid does with R to reveal hidden facts about our love lives and, ultimately, to find our Valentine.
xkcd is a very funny web comic. But I saved this particular episode not for the humour content, but as a really interesting example of information design and data visualization. It's an illustration for every optimal move to win (or at least draw) a game of Noughts and Crosses (or Tic-Tac-Toe, for the non-Commonwealth readers), anticipating every possible move of the opponent:
It takes a few moments to figure how to read it, but it's so rewarding. The red X's mark the optimal move - here, X moves first, and the optimal move is the top-left square. O may make any of 8 possible moves in response; drill down into one of the 8 largest "boxes" in the grid to find the smaller red X that anticpates that next move, and so on. As more moves are played without a win, you'll drill deeper and deeper into the chart until you get to the smallest sub-games (you'll need to view the larger version by then), many of which end in a draw (or tie).
It's a lovely, and unique, example of hierarchical information design, where the one layer serves as the framework to put the next layer down in context, all the while encapsulating the results from all 102 possible games. I counted that number by hand -- each small box represents a completed game in the chart -- so I might be off a couple. It's clearly smaller than the number of possible games when not playing optimally, and doesn't count the symmetry of X playing in any of the four corners. The chart for O is more complex (it's much harder to win when you don't get to go first) and is also included in the full chart linked below.
Like many academics, Arthur Charpentier thinks a lot about publishing papers in journals. Specifically, we wondered if there was a way to figure out which journal was the best place to publish his next paper and have it accepted:
I was wondering if there were clusters of journals, i.e. journals that publish almost the same kind of articles (so that next time one of my paper is rejected by the editor, I just go to for some journal in the same cluster).
By using statistical analysis of the titles of articles that have been published in various journals, he was able to figure out which journals are most alike, and group them together in clusters as displayed in this tree graph:
Check out the full post in the Freakonometrics blog for the details of the analyses and the R code that created them.
Tiobe Software ranks the popularity of programming languages based on references in search engines. While the methodology might be debated in terms of the absolute rankings it produces, it is quite interesting to see how the rankings fluctuate over time: Tiobe has produced a monthly report of rankings based on this methodology since 2001.
In the Tiobe Programming Community Index for February 2011, the top three slots are held by the general-purpose languages Java, C and C++. Domain-specific languages naturally fall farther down the list: in this months report, R is ranked at #25, with Matlab at 29 and SAS at 30. What's interesting is the movement: Matlab is down from #19 a month ago (and #20 a year ago), whereas R is up from #26. But look at SAS, mentioned in the report's summary as having "lost much ground": it's down from #16 a month ago and #14 a year ago.