A recent article by Matt Asay claims that "Python is displacing R as the language for data science". Python has certainly made some great strides in recent years, evolving beyond a data processing tool (an area where Python excels) to a data analysis tool. The Pandas project, in particular, has greatly expanded Python's ability to handle statistical data sets (introducing an object akin to R's data frame), and added some time series handling tools. But Python is still a long, long way from being able to support the range of statistical procedures supported by the core R language, let alone those provided by the 5000 community-contributed packages in CRAN.
Asay's article is heavy on anecdote but light on actual data to support its claim. (ComputerWorld's Sharon Machlis does a great job pointing out the irony there.) Nonetheless, data do exist on R and Python usage; while there's no user-registration data for open-source projects, secondary sources can provide intelligence on how open source projects are being used. RStudio's Hadley Wickham uses data from the developer Q&A site StackOverflow to chart the number of open questions asked per month about R and Python, as a proxy for active usage:
As an general-purpose data processing tool, it's no surprise that Python has more activity than the domain-specific analytics language R. But it's clear that both are growing explosively (Wickham describes the growth as "very close to being exponential"). Looking closer though, we see that the proportion of R questions, as a fraction of Python questions, is also growing rapidly:
This belies the claim that Python is displacing R. In fact, this chart suggests the reverse is true, and that R usage is growing at a faster rate than Python.
More data points come from user surveys. In the 2013 KDNuggets poll of top languages for analytics, data mining and data science, R was the most-used software for the third year running (60.9%), with Python in second place (38.8%). More tellingly, R's usage grew almost four times faster than Python's in 2013 versus 2012 (8.4 percentage points for R, compared to 2.7 percentage points for Python).
It's a similar story on the community side. R has more than 125 active user groups worldwide, and the number of user group meetings has increased by 41% in the last year. Python has around 400 user groups (I couldn't find stats on the growth rate), but RedMonk's Stephen O'Grady compares the communities devoted to data science:
At RedMonk, we typically bet on the bigger community, but that’s not as easy here. Python’s total community is obviously much larger, but it seems probable that R’s community, which is more or less strictly focused on data science, is substantially larger than the subset of the Python community specifically focused on data.
My personal take is that there's more than enough room for both Python and R. As the data science boom continues, both will continue to grow as more and more practitioners enter the world of statistical computing. Some (especially those that come from a computer-science background) will choose Python. And those that come from a statistics or data science background will choose R (or will have already learned R in their studies). And even some that come from the die-hard developer community will end up loving R. But both communities will consider to advance the art of data science, and as open-source communities will inevitably cross-pollinate each other. R has already influenced Python in the realm of data analysis, and it would be no bad thing if Python were to influence R in other areas. That, after all, is the beauty of open source software.
SO questions are a poor proxy for usage. If you employ your stats skills to normalize for sample bias, I think you'll be surprised. Python is just plain easier to use without outside assistance... more questions are answered by the docs. And the code is more comprehensible and readable (since this was the goal for Python as a language and this was not the primary goal during design of R).
Posted by: Hobson Lane | December 06, 2013 at 12:06
I'm a huge Python lover, but this is really just Matt talking about something he has no clue about. As I said in a tweet [1] in the Twitter thread and I stand to this:
"Sorry to say this but this is a typical 'God knows everything and @mjasay knows everything better' ;)"
Cheers,
Michael
[1] https://twitter.com/mhausenblas/status/405941892187041793
Posted by: Michael Hausenblas | December 06, 2013 at 12:30
While the statement of growth close to being exponential is true, I think it is true for only one of the above ..... I have to agree with Matt Assay, the one with exponential growth will replace the other in time .... looking around the innovation ..... unhindered by "grumpy" ..... and "unspoilt by progress" (I have stolen a catch phrase from a beer company) :)
Posted by: Andy Frost | December 06, 2013 at 18:16
Dear David,
Thank you for writing this VERY important post. I greatly appreciate the passion and commitment by which you help and nurture the R community.
Yours,
Tal
Posted by: Tal Galili | December 07, 2013 at 01:19
Interesting post, David. Thanks.
The growth in R usage also matches with what we have seen in our 2007-2013 Data Miner Surveys. We recently released the latest summary report, and we have included several pages that describe the skyrocketing growth in R usage. The highlights are here: http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html. Anyone who wants a copy of the FREE 41-page report can contact us at [email protected].
Happy Holidays, everyone!
-- Karl
Posted by: Karl Rexer | December 07, 2013 at 09:46
I use R, and not Python, for this type of analysis... but using support questions to proxy usage is not without problems. Critics would probably say that Python usage is underreported, since Python has "one obvious way to do things". This seems like a good first step, but a nice extension would be to find some other proxies for usage which aren't conflated with ease of use.
Posted by: Scott Porter | December 09, 2013 at 08:34
We can fight about R vs Python AFTER we save the humanity from SAS and Matlab. For now, lets just make both amazing!
PD: I use more python :P
Posted by: Daniel Rodriguez | December 09, 2013 at 09:01
Everyone thus far has been so nice about reflecting upon Asay's article. I applaud you all. My own character flaw becomes exposed when I tend to lose my poker face as I have an extraordinary distaste for baseless, FUD'ish blogging.
The issue I take with his article isn't so much the nature of his argument than how he tries to forward it. Does it not occur to him that, someone who is allegedly a 'data science' mogul and makes a positional statement, yet does not provide supporting evidence via his own craft might find a bit of a credibility problem?
Of course, what he misses is the real context in which his argument resides. From a purely fundamental statistical and more generally scientific viewpoint, one cannot compare the outcomes of an apple and orange simply by their visual attributes (such is one particular grudge I have in the infographics world today - another kvetch for another time). Naturally he would have had to look at the *use case* as a stratum to add at least some substance to his thesis. That is, he needed to compare the intersection where pythonistas and r users (and even dual users) converge.
Programmatically (not syntactically), R and Python have several points of congruence. They are both multi-paradigm: array (of the 2, R is more suited to vector programming), object-oriented, imperative, functional, procedural, reflective (thank you Wikipedia for that nice summary so I didn't have to recall that from dusty texts). Technically, they can be used for similar things. Nothing new there.
By contrast, R does not have as strict a typing discipline as Python, which can be both a strength and a weakness depending again on your use case.
The syntax and best coding practices are indeed different between the R and Py. For those coming from purely OOP studies and experience with more base languages (C++, etc), yes there will be plenty of gripes. Gee, we've never seen THAT before - yet these languages/environments persist and have their places, just as R and Python do (imagine those same folks being forced to learn SAS - I imagine the suicide rate in the world will have consequentially increased 4 fold :) ).
However, if the objective includes time-to-model development, and a more primary focus on method, then R is far more mature in this regard.
Note I didn't say better, worse, etc (my mention of SAS unequivocally being an exception :) ).
I personally use BOTH R and Py in my work, depending on the use case. I use other programming environments as well for the same reason - I don't believe that a single technology stack is a determinant of 'better or best' in DS.
Sure, it would be interesting to see an R/Py or some other hybrid to test R's mettle, where code discipline is a bit more unified and translatable, with the addition of every scientific package imaginable, with vector based programming, and better scalability. Change can be good, and is important.
But my guess is, you'd just have a mash-up where each of R and Py, or whatever else would retain their own characteristics. Hmmm I wonder about that RPy package thingy they have out there :). Even better, ever use RevoDeployR? No, I don't see either language being supplanted by the other - or a 'better' or 'worse' overall language in general. That's very myopic thinking. I believe Ruby was one such attempt at this experiment - and it certainly has its following, but it most certainly didn't diminish the importance of any of its component language contributors.
---------------Let me diverge here ---------------
So I'm kvetching about one comparatively small issue in the universe... blame my genes on that one :). But I do believe that Asay's blog is a small contributing part to a much larger problem in the 'data science' arena. It's as if there's this rather muted 'mortal combat' between those who are good at dangling shining lights, and those who are genuinely, measurably, and meaningfully impacting their objects, and the conglomerative discipline itself.
His blog resembles very much in my mind the article "The Death of the Statistician" . You can google the title itself and find other references which seem to indistinctly draw these boundaries between 2 different disciplines trying to achieve similar final objectives.
Where does this myopia come from? That's the easy one: economic/power advantage. This is nothing new of course, either conceptually or historically. However when this behavior is extended into the scientific research world itself, many new problems emerge - mostly in the realm of general credibility and value (allow for example clinical science, big pharma, and medical device histories:) ) - not a good thing.
The data science world and all related parties should be *very* concerned about this (even as small as the aforementioned blog) and should consider appropriate actions before the broad sweeping black eyes begin, affecting the credibility of the whole. We have much to do in this 'storming and norming' in each of our areas surrounding the rather infant face of data science to deal with this. Any science must maintain not only its creativity, but also its rigor, within its methods and within its ranks, if it is to be a science at all.
Many thanks to David for publishing this in a far better manner than I just did :)
Posted by: plus.google.com/118434202516116757962 | December 09, 2013 at 15:18
Yes, using Python you can technically do anything that you could do in R. You can also do anything Python could do using C. Plus, C has more users than does Python. So, C must be best for data analysis, right?
Well, sometimes C is the right choice, depending on exactly what's being done. All of these, and more, could be used in a good data analysis workflow. Even Excel, Tableau, SPSS and Stata have their place when used properly and to their comparative advantage.
Posted by: Jason | December 09, 2013 at 21:30