This is Part 2 of a five-part article series, with new parts published each Thursday. You can download the complete article from the Revolution Analytics website.
Critical Mass and Going Viral
R was created in 1993 by Ross Ihaka and Robert Gentleman at the University of Aukland in New Zealand. It’s called R for the simple reason that both of its creators have first names beginning with the letter “R.” Some believe that R’s single-letter name represents a sort of homage to the S language, since R is an open-source descendent of S and much of the code written for S runs unaltered in R. The S language was developed by a Bell Labs team that included John Chambers, for which he won the prestigious ACM Software System award in 1998.
Ihaka and Gentleman had set out create a language that would make it easier for them to teach their introductory data analysis courses. But news of the new language spread quickly, and in 1995 they were convinced to make the R source code available under the terms of the Free Software Foundation's GNU General Public License. Their decision to share R freely was a seminal moment in the annals of analytic software development.
As interest in R surged, a core group of dedicated volunteers coalesced around the project. This core group of leading statisticians and computer scientists from around the world is now the project’s official leadership team. They are the guardians of the R language, and oversee changes and implementations of new features to the R source code on a six-month cycle. They also provide guidance, support and advice to R users through a very active mailing list.
“R got lucky because the core group is an extremely talented collection of statisticians who have a great vision and who really think about what they’re doing,” says Abhijit Dasgupta, a consulting biostatistician at the National Institutes of Health. “It also built up a very active user community around this core group. As a result, the customer support for R through various online forums is amazingly good. That’s one of the truly great benefits of R – it has a fantastic user community that keeps growing. It’s infectious.”
According to Muenchen, who has actually measured the popularity of data analysis software through the careful analysis of Internet traffic, “R is the most discussed software by roughly a two-to-one margin, followed by Stata then SAS.”
R’s skyrocketing popularity translates into more than just bragging rights – the R user community is now so large that it generates new R programs (called “packages”) at an astonishing pace. It’s almost as if the R community has achieved critical mass and been transformed into one gigantic, self-organizing virtual factory that produces new R software with clockwork regularity.
How does this self-organizing virtual software factory work? Let’s look at one more or less typical member of the R user community, Glenn Meyers.
Meyers is a vice president of research at ISO Innovative Analytics. He holds a bachelor’s degree in mathematics and physics from Alma College in Alma, Michigan, a master’s degree in mathematics from Oakland University, and a Ph.D. in mathematics from the State University of New York at Albany. He is a Fellow of the Casualty Actuarial Society and a member of the American Academy of Actuaries. If you’re a casualty actuary, you are probably already familiar with Meyers. He’s won many of the top actuarial prizes and awards, he gives speeches and he sits on international committees.
Meyers also writes a regular column for Actuarial Review. When he writes about a new technique for analyzing data, he often includes the code for the new method in his column. Most of his code is written in R, because R has become the common language – the lingua franca – of statistical analysis.
“You put out your R code and it becomes immediately usable,” says Meyers. “People download the code and just start using it.”
Large commercial software vendors will rarely develop new programs unless there’s a large enough market to justify their development costs. And it can take years for large vendors to bring new programs to market.
The R community, on the other hand, develops and releases new software continuously, thanks to the contributions of thousands of people like Meyers. “The most powerful reason for using R is the community,” he says.
Or as Robert Sudol of AllianceBernstein puts it, “the more people who use R, the more powerful it becomes.”
Sudol uses R to predict future economic trends by analyzing seasonal data and looking for patterns or anomalies. It’s a complicated job that requires creative thinking and improvisation. “If you’re trying to do something that’s not in the code set, you go out and find an R package ... and then you snap it right in and start using it,” Sudol explains. “There are so many people out there making modifications and enhancements, you’re going to find something you can use.”
Sudol says that he sees “a lot of parallels between R and Linux. R is becoming ubiquitous, so if you’re starting a huge project and you don’t have a lot of programmers ... you go to colleges and hire people who can be productive when they walk in the door. That’s a huge benefit.”
R also offers benefits to companies that are trying to reduce the amount of money they spend on renewing licenses with traditional enterprise software vendors such as SAS and SPSS. “The nicest thing about R is that it’s free,” says Sudol. Choosing an R package over a traditional software product can literally save you “hundreds of thousands of dollars,” says Sudol.