The "apply" family of functions in R (apply, sapply, lapply) is a very powerful suite of tools for iterating through structures of data and returning the combined results of each iteration. But with great power comes great responsibility (or something like that): these functions can sometimes be frustratingly difficult to get working exactly as you intended, especially for newcomers to R.
That's where the functions from Hadley Wickham's plyr package come in: they create a unified interface for iterating over data in R with a consistent syntax for specifying what the inputs are, and what the outputs should be.
JD Long captures both the frustration that come with mis-applying apply and the joy of overcoming it with plyr in this 5-minute video. Be sure to check the associated blog post for the background and comments.
Cerebral Mastication: Struggling with apply() in R
To me the video looks deceptive. You have to read the documentation and understand where to get data frames in plyr but you're supposed to just wish sapply returned them? No where in any of the apply documentation does it say you can get data frames out and if you asked someone how to get it no one who knows anything is going to say to look at apply functions (maybe aggregate or loop with rbind depending on the task).
So, he starts from a faulty premise, works himself into a hole using sapply to do something it's documentation says it cannot and then goes on to show a function that can do what some vague zeitgeist in his head says sapply should do.
Furthermore, his terminology is a mess. He's not working with a list at all (which he says repeatedly) in the first place but a function that generates data frames and a vector.
The whole premise is misleading for the example case. Pick a real example!!
For this particular task a loop is just...
myData <- NULL
for (y in years){ myData <- rbind( myData, getData(y)) }
#don't remember exact commands from the video but is this all ldply is a wrapper for?
It's a very misleading video and I think the behaviour of the ld function looks mysterious.
(I'm in no way condemning plyr and the many who love it, just the video.)
Posted by: JC | December 17, 2009 at 12:55
Hey JC, you are spot on about me using the term list instead of vector. That's a really good point. I'm going to edit the blog post to point out that error. Good catch.
In terms of your comments about working my way into a hole. That is correct as well. It appears, however, that you may be failing to appreciate how a beginner with R approaches programming problems. What I was illustrating in the video is how a beginner (and I group myself in that category) approaches a problem and can end up frustrated quickly because things seem non-intuitive. We can point at my intuition and say that I have unreasonable expectations, which may be true. But all new users bring conceptual misunderstandings to the keyboard with them.
The challenge with the apply() family of functions in R is that they use rather different syntax from each other. They also often require wrapping the syntax in helper functions in order to accomplish a logic process such as split-apply-combine (which my simple example did not illustrate). What the plyr package adds is not new functionality. Plyr adds a unified abstraction and simplification to the analytical process.
Thanks for showing how to accomplish the same thing with a loop. That’s a very good illustration of how one can do things in R in many different ways. It’s always useful and educational to see equivalent methods!
Posted by: jd long | December 18, 2009 at 07:52
JC's code using the for-loop could more succinctly be expressed using a Reduce:
myData <- Reduce(rbind, lapply(years, getData))
Posted by: Davor Cubranic | November 09, 2010 at 14:17