When R is brought up as a possibility for doing statistics or data mining or any sort of predictive analytics among non R users, someone will invariably point out that R has a “steep learning curve”, and the response among those gathered usually includes a significant amount of head nodding. Even those who have put in heroic efforts to help people learn R sometimes say scary things: e.g. in his introduction to Rattle, a Data Mining GUI for R, Graham Williams writes:
R offers a breadth and depth in statistical computing beyond what is available in commercial closed source products. Yet R remains, primarily, a programming language for the highly skilled statistician, and out of the reach of many. (The R Journal Vol. ½, December 2009)
Are things really that bad? Is R that difficult to learn? I think not - unless you take the statement to refer to the absolutely worst case scenario that you can think of: the effort that would be involved in learning R by a person who doesn’t have any background in statistics or data analysis and who has absolutely no interest in learning R. In this context the statement that R has a steep learning curve conveys the same truth as the assertion that Chinese or Italian or Japanese or English or any other language is difficult to learn if you don’t know anything about the people who speak these languages, don’t want to know, and wouldn’t have an opportunity to practice the language anyway. There is some truth to this, but so what? Context is everything. If you have some background in statistics and a desire to learn some more, learning R is not going to be an insurmountable problem. Once you have some version of R installed on your favorite computing platform, any book that provides carefully worked examples of the kinds of statistical analyses that interest you should be all that is required to make you productive. These days, there are dozens or maybe even hundreds of statistics books that use R. Two of my personal favorites are John Fox’s classic An R and S Plus Companion to Applied Regression (Sage Publications, 2002) and Data Analysis and Graphics Using R: An Example-Based Approach (Cambridge Series in Statistical and Probabilistic Mathematics) by John Maindonald and W. John Braun (2010). Books, of course, are only the tip of the iceberg of the resources available for the statistical cognoscenti to learn R. Have a look at the links on Inside-R as places to start on the web.
This discussion does raise the issue of what one means by learning a language. Knowing how to run the R scripts required to get some simple analysis done may not entitle you to say that you know R (significantly more is required to become an R developer) anymore than understanding enough Italian to get around Rome while eating well entitles you to say that you know Italian – but again, so what? It is possible to do some pretty impressive statistics using R without any deeper understanding of the R language than what is required to run the applicable models. On the other hand, if you don’t know any statistics and you don’t want to know more than you have to, then that might be a big problem. But, it’s the statistics part that has the steep learning curve, not R. Maybe this is the point of William’s comment: R remains the language of highly skilled statisticians because it is the language most capable of expressing what highly trained statisticians think about, and R is out of the reach of the many who just don’t care much about statistics. But, if you are reading this post this many probably doesn’t include you.
So suppose you are not a statistician but belong to some other data analysis culture, data mining, for example, and you want to learn R. Well, like any of the world’s great natural languages R approachable from many different starting points. The entry point may be different but the path to knowledge is going to be similar. Find something you care about and start formulating simple sentences. There are fewer R language guides available for people who have some data mining skills, but that is changing. Seni and Elder have recently published a very nice little book on ensemble methods: Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data Mining and Knowledge Discovery) that I highly recommend. And, of course the masters Hastie, Tibshirani and Freidman authors of the classic text, Elements of Statistical Learning, have made both the text and much of the their code available. Moreover, if you are a data miner with a background in computer science you can probably code rings around the highly trained statisticians. If so, you may find the books by Chambers, Software for Data Analysis and Gentlemen, R Programming for Bioinformatics of great value.
Finally, to be perfectly fair, I should acknowledge Graham Williams’ quote in the context in which he wrote it. Like any other language, R is much easier to learn when you have access to the best learning tools, CDs, dictionaries, grammars etc. Rattle is one such high value learning tool and so is Revolution’s R Productivity Environment. But, more about these on another day.
Hello,
But who are the people who say R is hard? Were they SAS users? if so, it is expected, many users of SAS do not program in SAS and the SAS programming model is hardly comparable to R's. For them, using R would be akin to learning how to program. Otherwise, if somebody has had programming experience, R is not much different from any other language. Matlab programmers should feel at ease. I was never quite fond of R's documentation, so there might be a problem there. One person, (correctly) said that many of R's packages lack consistency - in SAS, the procs share many common features.
Nevertheless, in terms of programming in the language, I would like to know what difficulties do beginners face?
Cheers
Saptarshi
Posted by: Saptarshi | June 17, 2010 at 08:45
I'm trying to push R in a social science context.
Social scientists are less frequently programmers than in other fields (in my experience) and so many of the programming ideas that other fields are already acquainted with (loops, control structures) have to be learned in addition to learning the syntax.
I say "steep learning curve" only other social scientists that do not have a programming background. If they have a background in programming, I say, "You'll pick up the syntax, let me know if something is weird, or ask stack overflow."
Posted by: Jesse | June 17, 2010 at 09:44
I'll say "steep learning curve" for programmers --- I've programmed (C, Lisp, Perl, Python, Java) for over thirty years, and found R a bit of a challenge (primarily because I found the descriptions of the basic data-structures to be pretty opaque) when I started using it to produce plots to help me understand the network trace data I was dealing with at the time.
I would say that the new book from O'Reilly, R in a Nutshell by Joseph Adler is an excellent introduction to R that would be the first book I'd hand to a beginner.
Posted by: dm | June 17, 2010 at 11:29
Ugh. The problem beginners face is the attitude on display in this post.
"It's easy? What don't you get?"
For most people, R sits in that uncomfortable space between a program and a language. MATLAB has its own language, but very few people sit and write MATLAB code and then use their favorite compiler. Same with R. Sure, you can write it _in_ anything, but beginners frequently don't approach coding that way. They want some sort of support from the program the code is being written within. But this is how a lot of people start to learn programs, ESPECIALLY statistical packages.
The social sciences, and Stata in specific, are a great example. You can poke around and do some exercises for your stats class so long as someone shows you how to import data and type "reg", or whatever. Do that for a while, then outgrow the interactive mode, and try writing actual code. Then move on to more complicated things.
R, on the other hand, doesn't even have a decent interactive mode. Run a regression, and you see nothing. AH! We have to assign the output of the regression TO something, then ASK to see that thing! That's far from standard, and not at all user-friendly for new people. Add to this the aggressively idiosyncratic syntax, and you get something that people bang their heads over until they go back to something they can actually enjoy while they learn.
This is Linux-v-Windows in a smaller scale. Yes, Linux is far more flexible. Yes, there is a wonderful base of people developing really cool stuff. Yes, it represents the cutting edge. Why doesn't it take off? Because the return from the effort is SOOOO small for SOOOO long. I've been banging my head against R for a couple of years, and I can report back that I am WAY, WAY behind where I should be, and would be for any other language. The only reason I keep doing it is because of the signal it sends: if you want to be seen as "current" you have to know R. Beyond that, give me Stata, Matlab, or even python any day.
And, the tossed off comment at the end of this post is a much deeper problem: The dramatic inconsistency in R packages is not only confusing, it's "dangerous". I've come across packages that get things wrong. And because there isn't a lot of user base, it's not been noticed or corrected. So you have to be able to diagnose BOTH code AND detailed statistical calculations to figure out the problem.
Josh Angrist, a prime figure in social science methods, made that point far better than I can. Paraphrasing, he said he knows he can trust the routine in Stata, but doesn't know what yahoo came up with the random R package.
Posted by: Nash | June 17, 2010 at 13:56
As an syntax based SPSS and SAS user for over 30 years you might expect an easy transition to R. Not the case. All those years with SAS or SPSS make R more difficult to learn.
The reason is that the traditional statistics packages FORCE you to use and think within the framework of a rectangular flat file. It is the only way to work with data (matrix algebra subprograms being an exception). R breaks that mold and encompasses many data types, although primarily vector based.
After 30 years of thinking of all data and data transformations as functions on cases (records) and variables (columns) it is hard to adjust to dataframes being just one special data format among many. I was ashamed of how long it took me to feel confident with R, but it was well worth the effort.
Posted by: Harolddd | June 17, 2010 at 16:04
I'm a professional statistician with a lot of experience with many statistics packages, and I worked as a programmer for many years. I have learned dozens of languages.
When I came to learn R, I was highly motivated, and I spent a lot of time working at becoming good at it.
In my experience (for a variety of reasons), R is harder to learn than other packages and languages that would do a similar job. It's harder to learn than APL or Matlab, for example.
For the very simplest problems and the earliest stages of learning, it's pretty close, especially if you're going to use a GUI (or, for that matter, if you do it in S-Plus - which I had used a number of times before I tried to learn R) - but even without a GUI, it's not much harder than other tools I'd consider for the job.
For very hard problems, once you have learned the various competing tools to similar levels of competence, I think it's actually a bit easier.
But in between, working at moderately difficult problems at a modest level of comprehension, I think it's somewhat more difficult - and there's a noticeably longer period between moving out of the "just getting started" and the "competent enough to tackle difficult problems well" stages. Or there was for me.
On the other hand, I think it repays the additional effort required. Once you get over the hump, it definitely seems worthwhile (and indeed, I often find myself wondering why I had such trouble with what in retrospect are not such difficult issues).
Posted by: efrique | June 17, 2010 at 18:07
@Nash- re: "Josh Angrist, a prime figure in social science methods, made that point far better than I can. Paraphrasing, he said he knows he can trust the routine in Stata, but doesn't know what yahoo came up with the random R package." That makes no sense. With R you always know who wrote the routine. You can look up their CV, papers, email them, and of course look at the source code. With STATA or any other proprietary language, you have no idea. Usually it is someone with alot of CS background, but no stats training. I know because I use to be one of those yahoos. Open source maybe more chaotic, but at least it's transparent.
Posted by: F | June 17, 2010 at 23:51
When helping my friends learn R, there seems to be some basic ideas that consistently stump people from effectively learning and using R. One of the prime ones, as @jesse mentioned, is the explanation of the data structures and the relationship between the different types.
For instance, a dataframe is actually a type of list, so functions such as lapply can take it as an argument.
or
Why does df@variable not work for some functions, but df[,"variable"] does?
These little things cause beginners great angst and they are not well covered or explain -- even in some intro books. And R is full of such little things....
Posted by: Robert | June 21, 2010 at 06:36
I find R to be a great tool for manipulating and analyzing data sets interactively. With base R plus the RODBC, ggplot2, and XML packages, I can do simple analyses in a few lines of code and more sophisticated ones in a page. I use it all the time; it's a great tool.
However, R *is* idiosyncratic and difficult to learn. Though I've worked in a wide range of languages over several decades, I found R hard to learn, and I still make mistakes when using it. One reason for this is internal inconsistency. For example, f<-factor('sdf'); nchar(f) -> 1 (treated as number!); paste(f,f) -> 'sdfsdf' (treated as string); c(f,f) -> 1 1 (treated as number!). Atomic vectors and lists are interchangeable in some contexts but not in others. Patrick Burns' R Inferno has lots more examples of inconsistencies and idiosyncracies like this that make the language difficult to learn and difficult to use.
Another problem is that the terminology used in the language is old-fashioned, idiosyncratic, and inconsistent. Why do strings have the type "character" and not "string"?
All this is to be expected in a language that has had a long history of multiple contributors; it is perhaps time for a clean sweep and the creation of R2....
-s
Posted by: Stavros Macrakis | June 24, 2010 at 16:44
This is a interesting perspective to learning a language that I have never considered. Thanks for this great post.
Posted by: wally | August 27, 2010 at 14:37