« Because it's Friday: The butterfly effect of Game of Thrones, visualized | Main | R in a 64 Bit World »

June 08, 2015


Feed You can follow this conversation by subscribing to the comment feed for this post.

Great talk, and I agree on many things. I would also like to point out that Andrie de Vriese and I also wrote a book on R from a programming perspective.

I know, it doesn't sound like that, but "R for Dummies" is an introduction to R I use in my classes of Statistical Computing.

Interessting. May I propose ANKHOR FlowSheet as an powerful addition or replacement for some R projects ?
Meanwhile we have moved completely from R to ANKHOR. Their R plugin helped us a lot in the during the transition.
Although some of our programmers were crying ( 'we want to WRITE our code' ), everybody is now a convinced FlowSheeter.
The fact that it utilizes all CPU cores ( and the GPU ), makes it a wonderful tool for all kinds of data centric applications.
One of the best features ( and probably unique ) is the direct inspection of the results of an operation ( even the intermediate results in a loop ). THAT is much more powerful than any other traditional debugging methods.
Just my 2 cents...

"The "for" loops can be slow in R which ... well, I can't really think of an upside for that one, except that it encouraged the development of high-performance extension frameworks like Rcpp."

IIRC, one is not supposed to use 'for' loops. Vectorize and use tapply, lapply,...

If you are doing your own 'for' loops, you should stop and reconsider.

There are a lot of complaints about this 'slowness', but often the user can speed that up by writing better code. R is not inherently parallel, even if it has libraries and functions that are. So, if you use a loop or an apply function, it is not automatically parallelized. Another reason why we're using Ankhor analytics now is that they spread and execute calculations in loops automatically on all cores. Approved developers can get Grid server nodes. If you have complex long lasting jobs to run, a special shared loop operator sends the jobs to the grid nodes and collects the results for you without any extra effort :)

If one is wondering whether there is an upside to the slowness of for-loops in R, perhaps it would be the pragmatic effect of forcing users to distinguish almost immediately between serial data operations vs parallel ones. That is, as a rule of thumb, if you cannot implement a certain algorithm in R without a for-loop, it cannot be parallelized as-is across multiple cores/computers, whereas e.g. lapply operations are usually "embarrassingly parallel". Because for any problem with a number of units big enough where speedups are important, for-loops are likely to be annoyingly slow, the programmer will immediately think about whether they can switch to a vectorized version or apply-style function. Recognizing these different problem characteristics becomes a valuable habit once you regularly have to deal with big enough data.
But some would argue it's silly to say "this obstacle forces you to make the progress necessary to deal with the obstacle, therefore it's an advantage".

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr