By David Smith
I was on a panel back in 2009 where Bow Cowgill said, "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." R is undeniably quirky — especially to computer scientists — and yet it has attracted a huge following for a domain-specific language, with more than two million users wordwide.
So why has R become so successful, despite being outside the mainstream of programming languages? John Cook adeptly tackles that question in a 2013 lecture, "The R Language: The Good The Bad And The Ugly" (embedded below). His insight is that to understand a domain-specific language, you have to understand the domain, and statistical data analysis is a very different domain than systems programming.
I think R sometimes gets a bit of an unfair rap from its quirks, but in fact these design decisions — made in the interest of making R extensible rather than fast — have enabled some truly important innovations in statistical computing:
- The fact that R has lazy evaluation allowed for the development of the formula syntax, so useful for statistical modeling of all kinds.
- The fact that R supports missing values as a core data value allowed R to handle real-world, messy data sources without resorting to dangerous hacks (like using zeroes to represent missing data).
- R's package system — a simple method of encapsulating user-contributed functions for R — enabled the CRAN system to flourish. The pass-by-value system and naming notation for function arguments also made it easy for R programmers to create R functions that could easily be used by others.
- R's graphics system was designed to be extensible, which allowed the ggplot2 system to be built on top of the "grid" framework (and influencing the look of statistical graphics everywhere).
- R is dynamically typed and allows functions to "reach outside" of scope, and everything is an object — including expressions in the R language itself. These language-level programming features allowed for the development of the reactive programming framework underlying Shiny.
- The fact that every action in R is a function — including operators — allowed for the development of new syntax models, like the %>% pipe operator in magrittr.
- R gives programmers the ability to control the REPL loop, which allowed for the development of IDEs like ESS and RStudio.
- The "for" loops can be slow in R which ... well, I can't really think of an upside for that one, except that it encouraged the development of high-performance extension frameworks like Rcpp.
Some languages have some of these features, but I don't know of any language that has all of these features — probably with good reason. But there's no doubt that without these qualities, R would not have been able to advance the state of the art in statistical computing in so many ways, and attract such a loyal following in the process.
Great talk, and I agree on many things. I would also like to point out that Andrie de Vriese and I also wrote a book on R from a programming perspective.
I know, it doesn't sound like that, but "R for Dummies" is an introduction to R I use in my classes of Statistical Computing.
Posted by: Joris Meys | June 09, 2015 at 01:17
Interessting. May I propose ANKHOR FlowSheet as an powerful addition or replacement for some R projects ?
Meanwhile we have moved completely from R to ANKHOR. Their R plugin helped us a lot in the during the transition.
Although some of our programmers were crying ( 'we want to WRITE our code' ), everybody is now a convinced FlowSheeter.
The fact that it utilizes all CPU cores ( and the GPU ), makes it a wonderful tool for all kinds of data centric applications.
One of the best features ( and probably unique ) is the direct inspection of the results of an operation ( even the intermediate results in a loop ). THAT is much more powerful than any other traditional debugging methods.
Just my 2 cents...
Posted by: D | June 14, 2015 at 23:16
"The "for" loops can be slow in R which ... well, I can't really think of an upside for that one, except that it encouraged the development of high-performance extension frameworks like Rcpp."
IIRC, one is not supposed to use 'for' loops. Vectorize and use tapply, lapply,...
If you are doing your own 'for' loops, you should stop and reconsider.
Posted by: Barry | June 18, 2015 at 15:23
There are a lot of complaints about this 'slowness', but often the user can speed that up by writing better code. R is not inherently parallel, even if it has libraries and functions that are. So, if you use a loop or an apply function, it is not automatically parallelized. Another reason why we're using Ankhor analytics now is that they spread and execute calculations in loops automatically on all cores. Approved developers can get Grid server nodes. If you have complex long lasting jobs to run, a special shared loop operator sends the jobs to the grid nodes and collects the results for you without any extra effort :)
Posted by: D | June 19, 2015 at 00:20
If one is wondering whether there is an upside to the slowness of for-loops in R, perhaps it would be the pragmatic effect of forcing users to distinguish almost immediately between serial data operations vs parallel ones. That is, as a rule of thumb, if you cannot implement a certain algorithm in R without a for-loop, it cannot be parallelized as-is across multiple cores/computers, whereas e.g. lapply operations are usually "embarrassingly parallel". Because for any problem with a number of units big enough where speedups are important, for-loops are likely to be annoyingly slow, the programmer will immediately think about whether they can switch to a vectorized version or apply-style function. Recognizing these different problem characteristics becomes a valuable habit once you regularly have to deal with big enough data.
But some would argue it's silly to say "this obstacle forces you to make the progress necessary to deal with the obstacle, therefore it's an advantage".
Posted by: Asaf | June 23, 2015 at 14:29