In a presentation to the Chicago R User Group last night, Drew Conway used his new Infochimps package in R to assess the relative popularity of programming languages. Drew used the word.stats function in the Infochimps package to count the frequency of common computer languages mentioned in Twitter messages, and displayed the results in this bar chart:
It's not perfect: languages like C and C++ are excluded because they're impossible to search for, "ada" is excluded because it's ambiguous (and otherwise that niche language would be ranked most popular), and R is measured by the frequency of its community twitter hashtag #rstats and not the letter R. But it's interesting nonetheless. There's lots more info about Infochimps in general, and how this chart in particular was created, in the slides downloadable from Drew's blog.
Another way to look at programming language popularity is the frequency of mentions on two popular programmer's resource sites. In a post at the Dataists blog, Drew Conway (again) and John Myles White used R and the XML package to extract the number of questions on stackoverflow.com and number of projects on github.com for about 50 programming languages, and plotted the results in this scatterplot:
As you can see, R tanks higher than the median for github projects and quite a lot higher for stackoverflow questions.
So R is doing quite well amongst programming languages in general. As a specialized statistics language, a more relevant comparison may come from looking at tags at the statistical question-and-answer site stats.stackexchange.com, where R currently has 260 questions compared to 6 for SAS and 22 for SPSS.
Thanks for posting. Always interesting to look at.
As a rubyist and schemer, I would love if this were true. But, the conclusions are almost preordained because of the data sources chosen.
Posted by: Kevin Taylor | December 17, 2010 at 11:45
Although to be fair, I know SAS has a lot of white papers and discussion boards within its own website. Given that it's a proprietary software, there's no reason to think that it will have the same traffic on StackOverflow. Back when I used SAS on a daily basis a google search for SUGI plus my question usually brought up dozens of excellent papers.
But anyways: Go R and #rstats !!!
Posted by: Matt Moehr | December 17, 2010 at 11:47
This is an interesting post, given that it comes from people who are in the statistics space. Both Github and Twitter are very well respected but neither can be reasonably (let alone quantitatively) represent the population in order to derive the ranking you discuss.
Both cases have very significant bias because of several well know factors. For instance Twitter is a big Ruby shop, so it is not surprising that Ruby is over-represented on twitter (nothing against Ruby, it's one of my favorites). It is also (duh!) a big Web show, introducing bias to any web related languages such as JavaScript or PHP.
Similarly github (of which I am a very happy user) represents a very particular slice of the programming community with its own set of biases.
Finally, one of the longest running rankings of programming language popularity is the TIOBE index (http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html), which posts very different rankings than what you post here for similar periods.
This post, instead of being an interesting post dealing with data collection methodology, disambiguation, and about trying to figure out the bias in your sample, takes instead the easy route of making pretty pictures and ignoring the real substance. This is something I would expect of the general press, but not from an R centric company.
Posted by: Arnon Moscona | December 20, 2010 at 09:32