So when the world is taken over by a Zombie horde, you're going to want to figure out a way to get the human population to safety. This R script by econometrician Francis Smart won't help you do that exactly, but given a list of waypoints to navigate through zombie-infested lands to a safe house, it will tell you how many how many members of your human party survive:
While zombie-escaping might not seem to be a terribly useful application of R (until the outbreak comes, anyway), Francis's script comes with detailed comments that make it a useful example of simulation and animations for budding R programmers. (Try switching up the seed used in the set.seed call to get a new horde of zombies to play with, and experiment with new values of the waypoints variable to guide your hapless humans to safety.) It's also a great example of using the animation package to crate animated GIFs with R.
If you're looking for a programming challenge, an interesting experiment would be to write a function to automatically select waypoints on the route to safety. You could then run the simulation over and over (using a random seed, this time) and count the number of survivors to evaluate its effectiveness. Let us know what you come up with in the comments.
Even a casual glance at the R Community Calendar shows an impressive amount of R user group activity throughout the world: 45 events in April and 31 scheduled so far for May. New groups formed last month in Knoxville, Tennessee (The Knoxville R User Group: KRUG) and Sheffield in the UK (The Sheffield R Users). An this activity seems to be cumulative. This month, the Bay Area R User’s Group (BARUG) expects to hold its 52nd and 53rd meet ups while the Sydney Users of R Forum (SURF) will hold its 50th. Everywhere R user groups are sponsoring high quality presentations and making them available online, but the Orange County R User Group is pushing the envelope with respect to sophistication and reach. Last Friday, I attended a webinar organized by this group where Professor Trevor Hastie of Stanford University presented Sparse Linear Models with demonstrations using GLMNET. This was a world-class presentation and quite a coup for Orange County to have Professor Hastie present.
The glmnet package written Jerome Friedman, Trevor Hastie and Rob Tibshirani contains very efficient procedures for fitting lasso or elastic-net regularization paths for generalized linear models. So far the glmnet function can fit gaussian and multiresponse gaussian models, logistic regression, poisson regression, multinomial and grouped multinomial models and the Cox model. The efficiency of the glmnet algorithm comes from using cyclical coordinate descent in the optimization process and from Jerome Friedman's underlying Fortran code.
Although Professor Hastie’s presentation was primarily concerned with fitting models for the wide problem (the number of explanatory variables is much larger than the number of observations) the lasso and elastic-net algorithms are just as applicable to data sets with large numbers of observations. It is likely that in the future we will see glmnet implementations for variable selection on datasets with thousands of variables and hundreds of millions of observations. The following graph shows the regularization paths for the coefficients of a model fit the HIV data from one Professor Hastie’s examples.
Each curve represents a coefficient in the model. The x axis is a function of lambda, the regularization penalty parameter. The y axis gives the value of the coefficient. The graph shows how the coefficients “enter the model” (become non-zero) as lambda changes. The following code, based on an example from the webinar, produces the plot and also shows how easy it is to perform cross-validation.
library(glmnet)# load the package load("hiv.rda")# HIV dataclass(hiv.train) # The data are stored as a list names(hiv.train) # The names of the list elements are x and y dim(hiv.train$x) # The explanatory data consists of 704 observations of# 208 binary mutation variableshead(hiv.train[[1]])# Look at the explanatory datahead(hiv.train[[2]])# Look at the response data: changes in susceptibility to antiviral drugs
fit=glmnet(hiv.train$x,hiv.train$y)# fit the modelplot(fit,xvar="lambda", main="HIV model coefficient paths")# Plot the paths for the fit
fit # look at the fit for each coefficient#
cv.fit=cv.glmnet(hiv.train$x,hiv.train$y)# Perform cross validation on the fited modelplot(cv.fit)# Plot the mean sq error for the cross validated fit as a function# of lambda the shrinkage parameter# First vertical line indicates minimal mse# Second vertical line is one sd from mse: indicates a smaller model# is "almost as good" as the minimal mse model
tpred=predict(fit,hiv.test$x)# Predictions on the test data
mte=apply((tpred-hiv.test$y)^2,2,mean)# Compute mse for the predictionspoints(log(fit$lambda),mte,col="blue",pch="*")# overlay the mse predictions on the plotlegend("topleft",legend=c("10 fold CV","Test"),pch="*",col=c("red","blue"))
Don’t be content with this partial example. Professor Hastie and The Orange County R User Group have graciously made the slides, code and data available at this link. The webinar is well worth watching in its entirety.
As you might expect, Professor Hastie gives a masterful presentation: lucid, clear and succinct. This is inspite of the fact that Professor Hastie begins the presentation by commenting that it was his first webinar ever and that he was a little uncomfortable talking to his screen. (I think anyone who has ever given a webinar can relate to this: you talk to the screen and no energy from the audience comes back. Nothing is more disruptive to efforts to be enthusiastic than silence.) Nevertheless, Professor Hastie presents a difficult topic with a clarity that carries his audience along, and he is completely unphased by the inevitable glitch. Watch how he handles the upside down slide. You can download his slides, R scripts and data from the link below.
Saturday morning I was drinking my coffee wondering how much effort goes into R worldwide. (It’s my job.) I noticed that there were 4469 packages on CRAN, and it occurred to me that tabulating the packages by publication date would give some indication of how much effort is being expended to improve packags and keep them up to date. With very little work at all I was able to read the table on the Available CRAN packages by date of publication page and produce this plot.
Maybe I should not have been, but I was surprised to see that most CRAN packages were either created or updated in the last year or so. Apparently, only 264 packages haven’t been touched since 2010 or before. (If you are ever worried about whether an older package is going to work for you, go to the CRAN checks page for the package and look at the notes. For example, the CRAN check for vioplot, the current longevity record holder, look just fine to me .)
This is astounding! Like most people, I suppose, I tend to use only a small number of packages on a regular basis. I’m clueless about what most of the other packages do, and don’t think much about them. But they are all meaningful to somebody, probably thousands of somebodies, and a tremendous number of hours are being spent to keep them current and improve them. So the next time a colleague refers to R itself as a “statistical package” find a way to make this point: “Package” sounds so small, “wrapped up”, and done, but R is never done, it is a language that is constantly changing, improving and increasing its expressive power through the ongoing efforts a global, large scale engineering effort. The success of R is in no small part due to the mechanism that the R core group implemented to enhance the language through the parallel, asynchronous efforts of the package developers.
A little munging (download packages-post.r) on the CRAN Package Check Results page shows that there are at least 2,596 package maintainers active now. Since many packages have multiple authors this number is probably way less than the number of package developers. Nevertheless, it gives some idea, a lower bound, on the number of people that are actively involved in creating and maintaining R packages. Also, I might be wrong, but I am guessing that people who volunteer to maintain a package are also involved in improving it. So the following plot that lists the developers who are maintaining at least 10 packages must be the trace of untold hours of “after hours”, solitary work. I think we all owe a tremendous debt of gratitude to these R superstars and all the other package developers. After all, it’s not their job.
Baseball fans have been serious about statistics since Carl Pearson was a young man (although I doubt that Carl followed the game). It is not clear, though, exactly when baseball statisticians moved from doing descriptive stats into predictive analytics. In his book Super Crunchers, Ian Ayers credits Bill James of Baseball Abstract fame for getting this particular ball into play. And of course, Michael Lewis' Moneyball brought the power of predictive statistics to reshape the game of baseball itself to the public’s attention in a very dramatic way.
Moreover, baseball statistics are likely to become even more dramatic in the future. For a look at how companies like Sportvision are building the infrastructure to overlay a TV image of a batter with a heat map of the sweet spots of his strike zone and the like, take a look at the presentation Graham Goldbeck made to the Bay Area User Group last October.
So, today, it is widely recognized that a deep understanding of the venerable American pastime requires a fairly high level of statistical play. Those of us blogging at Revolution Analytics have encouraged testing assumptions, looking for patterns and making predictions with several baseball related posts including:
But now, Professor Michael Friendly of Canada’s York University has taken the analysis of the game to a new, more approachable level by wrapping up Sean Lahman’s Database into the R package Lahman. Version 2.0-1 is available on R-Forge and should be up on CRAN by he end of the week. Michael recently wrote me that "the original motivation was to provide a comprehensive, R version of data on baseball statistics for an annual project I run in a graduate course on multivariate data analysis". (This project is actually a pretty cool thing itself. Students hone their data analysis and conference presentation skills by preparing papers on topics related to baseball statistics for presentation to the prestigious but fictional "Hotelling Society".)
The file Lahman Data Sets (Download Lahman Data Sets) lists the 25 data sets that are available in the Lahman package. Note that some of these are of a pretty good size: Batting contains 96,600 rows and 24 columns while Fielding is a 164,898 by 18 table. The file 400 Hitters Plot (Download 400 Hitters Plot) contains Michaels code for the following plot which is big league (geekier) version of the graph that appeared in the New York Times two years ago.
Looking at the way this curve breaks I find myself vacillating between a sense of awe at the accomplishments of Ty Cobb and Rogers Hornsby and wondering how much money could be made by explaining the dip and rise of the curve. In any event, I'd like to think that, the Lahman package will forever link Spring, baseball and lazy afternoons of doing some stats with R.
Michael also indicated that he would welcome contributions to the Lahman package project. So, if you have some examples that you would like to share, or you would like to write some code please sign up on R-Forge.
The Washington Post reports that by analyzing more than 10 million emails sent through the Yahoo! Mail service in 2012, a team of researchers used the R language to create a map of countries whose citizens email each other most frequently:
The chart above shows the top 1000 country-country pairs by email frequency, arranged in a clustered network using the R igraph package. While it's not surprising that countries that share proximity, language and culture might tend to email each other more often, it's interesting that the clusters tend to resemble those of Huntington's Clash of Civilizations, whose "major civilizations" are used to color-code the map. (A formal test of similarity was done using R's SNA package, with country-specific factors corrected for using a linear mixed-effects model.) On this point, the authors write:
In this respect we cautiously assign a level of validity to Huntington’s contentions, with a few caveats. The first issue was already mentioned – overlap between civilizations and other factors contributing to countries’ level of association. Huntington’s thesis is clearly reflected in the graph [above], but some of these civilizational clusters are found to be explained by other factors ... The second limitation concerns the fact that we investigated a communication network. There is no necessary “clash” between countries that do not communicate, and Huntington’s thesis was concerned primarily with ethnic conflict.
Indeed, the validity of Huntington’s ideas with respect to ethnic conflict has come into controversy, and we limit ourselves to showing the validity — at least partial — of this division for communication networks.
The Washington Post article (linked below) has more discussion about the research, and you can find details of the methodology in the arXiv paper.
In a guest post here on February 20, Tammer Kamel introduced us to Quandl, a kind of "wikipedia" of time series data. In the post, Tammer (the founder of Quandl) noted that they were working on an R package to give R users access to Quandl as a data source.
That package is now available. It includes the Quandl function, which R users can give a time series identifier, a start date and and end date and get an R time series object for plotting or data analysis. Since all data on Quandl is free and open, R users can use this function to access and use time series data without restriction. Enjoy!
Anyone interested in playing around with the data generated by the PITCHf/x cameras at major league baseball games should definitely check out the pitchRx package from Carson Sievert. Major League Baseball Advanced Media makes the data available for download, and this package provides an interface from R to the speed, position and pitcher data for just about every MLB pitch. This introduction to the pitchRx package provides details on how to grab data, subset data by pitcher or team, and visualize strike zones. But one of the coolest things you can do is create animations, like the one below of every 'four-seam' and 'cutting' fastballs thrown by Mariano Rivera and Phil Hughes during the 2011 season:
If you're a baseball fan, this package will be a blast to play with. Download it from CRAN now (or just type install.packages("pitchRx") in R), and try some of the examples from the link below.
If you're writing any significant amount of R code, you might want to start think about bundling it up into packages. An R package combines functions, data, documentation and unit tests, and is a convenient and reliable system to manage and version collections of R content that could otherwise become unwieldy. And if you want to share your code with colleagues or even the entire world via github or CRAN, an R package is the best way to do it.
The canonical documentation for building an R package is the Writing R Extensions Guide that comes with the R distribution. Following all of these instructions by hand can be a little complicated, but fortunately Hadley Wickham and his colleagues at RStudio have created the devtools package to make the process simpler.
The devtools package makes it much easier to create a package following the structure defined in the Extensions Guide. It also makes it easier to test your package, using the unit tests you've been creating. (You have been writing unit tests, right? Including unit tests in a package is an excellent idea: not only does it make your package more reliable in the long run, it also simplifies the development process in surprising ways. If you're not writing unit tests, definitely give it a go.)
Another good practice is to manage your package code in a source-code control system. (Packages have their own version numbers, which you can use to keep track of updates to your code.) If you're using github to manage your code, you can also install your package directly from github with the install_github command (this is also a handy way to share your package with others). And if you want to share your package to the entire world, devtools can also test if your package is ready to submit to CRAN (though be sure to read the CRAN submission guidelines first).
You can find more information about devtools at the package homepage linked below.
Arc diagrams are an alternate way of representing two-dimensional graphs. Rather than scattering the nodes across the page connected by straight edges, you can instead arrange the nodes along a one-dimensional axis, and replace the straight edges with arcs between the nodes. While an arc diagram might not give as good a sense of the connections between the nodes as a traditional graph layout, judicious ordering of the nodes can help identify clusters. It's also a useful format (especially in the vertical orientation) when you want to label each of the notes with other quantities in a table-like format.
Thanks to Gaston Sanchez, you can now create arc diagrams in R with the arcdiagram package. Gaston created the arc diagram below to visualize the characters in the Victor Hugo classic (and now a major motion picture) Les Miserables. Each character is connected by an arc if they appear together in the same chapter; the wider the arc, the more the characters appeared in chapters together. The ordering (and colour) of the nodes identify groups of characters that appear in the novel together.
You can find the complete code to create the arc diagram above, along with details on how to install the arcdiagream package, at Gaston's blog, Data Analysis Visually Enforced.
An extension to arc diagrams is the hive plot, where instead of the nodes being laid out along a single one-dimensional axis they are laid out along multiple axes. This can help reveal more complex clusters (if the nodes represent connected people, imagine for example laying out nodes along axes of both "income" and "enthicity"), and is a particularly useful way of visualizing graphs with many nodes and edges that look like a dense "hairball" using traditional graph layouts. Here's an example of a hive plot:
The above plot comes from the Hive Plots homepage, and shows the connections between similar genes (nodes) in three related genomes (SL, BA and SN). You can create hive plots in R using the hiveR package.