I had the great pleasure of sitting down for a beer with Steve O'Grady (from the open-source analyst group RedMonk), at the MySQL conference last week. It was great to get the perspective of someone who knows the tech industry so well, sees predictive analytics as a hot area, and is taking an active interest in statistics and R (Steve has been getting into R programming recently). I asked him, amongst all the software tools available, why choose to learn R for predictive analytics? He answered with a great analogy to scuba diving, which he just shared on his blog:
I had the opportunity to dive in a lot of interesting places, from Key West to Cayman Brac to Bonaire to plain old Rockport, MA. One of the things I noticed was that most of the professionals, pretty much to a person, used the same BCD [scuba equipment]: workman-like, beat up Scubapro designs. Ugly, even industrial-looking, but functional. Day after day, dive after dive.
Which begged the question that so many ask themselves in so many industries: what did I know about diving that the professionals did not?
Exactly. My next BCD, which I still own today, was a Scubapro.
I relate this story here because I told it to REvolution Computing’s David Smith last week to explain our interest in the R language.
I love this analogy. R today may not be as pretty as some of the alternatives (though we do have big plans for REvolution R), but it sure is functional, reliable and powerful. And that's why the professionals are using it.
tecosystems: Watch the Professionals
Damn right it's not pretty. Even I get caught out on R quirks after 20 years of using it. Compare letters[c(12,NA)] and letters[c(NA,NA)] for the most recent thing that made me bang my head against the wall.
R gets used because of its functionality, in the huge library of available packages, builtup because of its long history. R is the same generation as XLISPSTAT, but it seems that XLISPSTAT was too ugly even for statisticians, and now seems to be a historical curiosity.
If I could transparently call all the functions in CRAN from Python I'd never write another line of S code again (Rpy is a start!).
Horrible ugly bodged-up historical mess of a language. But hey, Better Than FORTRAN!
Posted by: Barry | April 21, 2010 at 11:53
I wish I could find the quote or remember who said it, but it's along the lines of "Show me a language no-one has complaints about, and I'll show you a language no-one uses". Yep, R is quirky, but that's because, not despite, of the way it's used.
Posted by: David Smith | April 21, 2010 at 12:25
Isn't it considered a good argument that 'R is a free software environment' (http://www.r-project.org/). I think that's one of they main reasons for using R.
Posted by: Locas Seltzer | April 21, 2010 at 12:46
Not sure I agree with your last sentence. The quirks surely come before the usage?
My example with 'letters' comes from a collision of three features - recycling of short subscripts, silent coercion of types (boolean NA to numeric NA), and the existence of five different NA values that all print the same.
Have you read 'the zen of python'? (start python and type 'import this'). Many things in R violate those ideas, often in the 'Simple is better than complex' department.
For example, to really understand that letters[c(1,NA)] is different from letters[c(NA,NA)] you have to see that:
* in the first case, the NA is coerced to a numeric NA because it's in a vector with a numeric '1'.
* in the first case, you are selecting elements by supplying a vector of indexes
* in the second case, your NAs are boolean (logical) NA values
* hence your subscript is a logical vector
* logical vectors are recycled
* now your subscript is a vector of TRUE/FALSE values (which are all NA) of the same length as 'letters'.
Zen: "Simple is better than complex". However, subscript recycling is shooting you in the foot. Yes:
x[c(TRUE,FALSE)]
is a simpler-looking way to get the odd elements of a vector than:
x[rep(c(TRUE,FALSE),length(x)/2]
but the simplest of all is:
oddElements(x) # to be written
Zen: "readability counts"
The R system is lovely, amazing, flourishing, but look at almost any R code and large chunks of it are probably quirks management :)
Posted by: Barry | April 21, 2010 at 12:55
Hi Barry,
Thanks for the very educational example.
I republished your example (With reference here) on my blog:
http://www.r-statistics.com/2010/04/the-difference-between-lettersc1na-and-letterscnana/
And would love to credit you with a link to your website (please contact me on the post for that)
Thanks again,
Tal
Posted by: Tal Galili | April 21, 2010 at 23:42
Thanks a lot for an interesting post. Now that I know what happens, I can avoid it.
Posted by: Bhoom | April 22, 2010 at 06:04