« The distribution of online data usage | Main | The impact of the drug war in Mexico »

June 17, 2010

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Hello,
But who are the people who say R is hard? Were they SAS users? if so, it is expected, many users of SAS do not program in SAS and the SAS programming model is hardly comparable to R's. For them, using R would be akin to learning how to program. Otherwise, if somebody has had programming experience, R is not much different from any other language. Matlab programmers should feel at ease. I was never quite fond of R's documentation, so there might be a problem there. One person, (correctly) said that many of R's packages lack consistency - in SAS, the procs share many common features.
Nevertheless, in terms of programming in the language, I would like to know what difficulties do beginners face?

Cheers
Saptarshi

I'm trying to push R in a social science context.

Social scientists are less frequently programmers than in other fields (in my experience) and so many of the programming ideas that other fields are already acquainted with (loops, control structures) have to be learned in addition to learning the syntax.

I say "steep learning curve" only other social scientists that do not have a programming background. If they have a background in programming, I say, "You'll pick up the syntax, let me know if something is weird, or ask stack overflow."

I'll say "steep learning curve" for programmers --- I've programmed (C, Lisp, Perl, Python, Java) for over thirty years, and found R a bit of a challenge (primarily because I found the descriptions of the basic data-structures to be pretty opaque) when I started using it to produce plots to help me understand the network trace data I was dealing with at the time.

I would say that the new book from O'Reilly, R in a Nutshell by Joseph Adler is an excellent introduction to R that would be the first book I'd hand to a beginner.

Ugh. The problem beginners face is the attitude on display in this post.

"It's easy? What don't you get?"

For most people, R sits in that uncomfortable space between a program and a language. MATLAB has its own language, but very few people sit and write MATLAB code and then use their favorite compiler. Same with R. Sure, you can write it _in_ anything, but beginners frequently don't approach coding that way. They want some sort of support from the program the code is being written within. But this is how a lot of people start to learn programs, ESPECIALLY statistical packages.

The social sciences, and Stata in specific, are a great example. You can poke around and do some exercises for your stats class so long as someone shows you how to import data and type "reg", or whatever. Do that for a while, then outgrow the interactive mode, and try writing actual code. Then move on to more complicated things.

R, on the other hand, doesn't even have a decent interactive mode. Run a regression, and you see nothing. AH! We have to assign the output of the regression TO something, then ASK to see that thing! That's far from standard, and not at all user-friendly for new people. Add to this the aggressively idiosyncratic syntax, and you get something that people bang their heads over until they go back to something they can actually enjoy while they learn.

This is Linux-v-Windows in a smaller scale. Yes, Linux is far more flexible. Yes, there is a wonderful base of people developing really cool stuff. Yes, it represents the cutting edge. Why doesn't it take off? Because the return from the effort is SOOOO small for SOOOO long. I've been banging my head against R for a couple of years, and I can report back that I am WAY, WAY behind where I should be, and would be for any other language. The only reason I keep doing it is because of the signal it sends: if you want to be seen as "current" you have to know R. Beyond that, give me Stata, Matlab, or even python any day.

And, the tossed off comment at the end of this post is a much deeper problem: The dramatic inconsistency in R packages is not only confusing, it's "dangerous". I've come across packages that get things wrong. And because there isn't a lot of user base, it's not been noticed or corrected. So you have to be able to diagnose BOTH code AND detailed statistical calculations to figure out the problem.

Josh Angrist, a prime figure in social science methods, made that point far better than I can. Paraphrasing, he said he knows he can trust the routine in Stata, but doesn't know what yahoo came up with the random R package.

As an syntax based SPSS and SAS user for over 30 years you might expect an easy transition to R. Not the case. All those years with SAS or SPSS make R more difficult to learn.

The reason is that the traditional statistics packages FORCE you to use and think within the framework of a rectangular flat file. It is the only way to work with data (matrix algebra subprograms being an exception). R breaks that mold and encompasses many data types, although primarily vector based.

After 30 years of thinking of all data and data transformations as functions on cases (records) and variables (columns) it is hard to adjust to dataframes being just one special data format among many. I was ashamed of how long it took me to feel confident with R, but it was well worth the effort.

I'm a professional statistician with a lot of experience with many statistics packages, and I worked as a programmer for many years. I have learned dozens of languages.

When I came to learn R, I was highly motivated, and I spent a lot of time working at becoming good at it.

In my experience (for a variety of reasons), R is harder to learn than other packages and languages that would do a similar job. It's harder to learn than APL or Matlab, for example.

For the very simplest problems and the earliest stages of learning, it's pretty close, especially if you're going to use a GUI (or, for that matter, if you do it in S-Plus - which I had used a number of times before I tried to learn R) - but even without a GUI, it's not much harder than other tools I'd consider for the job.

For very hard problems, once you have learned the various competing tools to similar levels of competence, I think it's actually a bit easier.

But in between, working at moderately difficult problems at a modest level of comprehension, I think it's somewhat more difficult - and there's a noticeably longer period between moving out of the "just getting started" and the "competent enough to tackle difficult problems well" stages. Or there was for me.

On the other hand, I think it repays the additional effort required. Once you get over the hump, it definitely seems worthwhile (and indeed, I often find myself wondering why I had such trouble with what in retrospect are not such difficult issues).

@Nash- re: "Josh Angrist, a prime figure in social science methods, made that point far better than I can. Paraphrasing, he said he knows he can trust the routine in Stata, but doesn't know what yahoo came up with the random R package." That makes no sense. With R you always know who wrote the routine. You can look up their CV, papers, email them, and of course look at the source code. With STATA or any other proprietary language, you have no idea. Usually it is someone with alot of CS background, but no stats training. I know because I use to be one of those yahoos. Open source maybe more chaotic, but at least it's transparent.

When helping my friends learn R, there seems to be some basic ideas that consistently stump people from effectively learning and using R. One of the prime ones, as @jesse mentioned, is the explanation of the data structures and the relationship between the different types.

For instance, a dataframe is actually a type of list, so functions such as lapply can take it as an argument.

or

Why does df@variable not work for some functions, but df[,"variable"] does?

These little things cause beginners great angst and they are not well covered or explain -- even in some intro books. And R is full of such little things....

I find R to be a great tool for manipulating and analyzing data sets interactively. With base R plus the RODBC, ggplot2, and XML packages, I can do simple analyses in a few lines of code and more sophisticated ones in a page. I use it all the time; it's a great tool.

However, R *is* idiosyncratic and difficult to learn. Though I've worked in a wide range of languages over several decades, I found R hard to learn, and I still make mistakes when using it. One reason for this is internal inconsistency. For example, f<-factor('sdf'); nchar(f) -> 1 (treated as number!); paste(f,f) -> 'sdfsdf' (treated as string); c(f,f) -> 1 1 (treated as number!). Atomic vectors and lists are interchangeable in some contexts but not in others. Patrick Burns' R Inferno has lots more examples of inconsistencies and idiosyncracies like this that make the language difficult to learn and difficult to use.

Another problem is that the terminology used in the language is old-fashioned, idiosyncratic, and inconsistent. Why do strings have the type "character" and not "string"?

All this is to be expected in a language that has had a long history of multiple contributors; it is perhaps time for a clean sweep and the creation of R2....

-s

This is a interesting perspective to learning a language that I have never considered. Thanks for this great post.

The comments to this entry are closed.

Search Revolutions Blog




Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr