« The Deepwater Horizon, in context | Main | New R User Group in Atlanta »

June 11, 2010

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

There are two minor points that perhaps should be corrected in this posting. The crossprod function calculates A'*A not A*A', which is what is calculated by the tcrossprod function. And the lm function does use the BLAS but only level-1 BLAS (vector-vector operations) not level-2 (matrix-vector) or level-3 (matrix-matrix) BLAS. Most accelerated BLAS systems concentrate on boosting the performance of level-3 BLAS.

The reason that the lm.fit and glm.fit functions (the workhorses underneath the lm and glm functions) end up calling the older Linpack routines using level-1 BLAS instead of the newer, level-3-based Lapack routines is because of the need to handle certain rank-deficient cases cleanly. The output from example(fastLm) in the RcppArmadillo package shows such a case.

The ability to handle such a case adequately is not specifically a property of level-1 BLAS versus level-3; it requires a special form of pivoting in the QR decomposition. John Chambers modified the dqrdc Linpack subroutine for this in the original S language and Ross Ihaka did similar modifications for R back in the old days. No one has done similar modifications for Lapack's QR decomposition. Every once in a while some of us in R-core discuss alternative approaches that would allow using the Lapack, level-3 BLAS code but no one has sat down and coded them up to see if they would work.

I also commented in a thread on the R-help mailing list that a naive calculation of chol(crossprod(A)) where A is 10000 by 5000 (a combination of your first two examples) is almost certain to produce useless answers. Admittedly there are occasions where one wants the Cholesky decomposition of very large, symmetric matrices - I do this all the time in the lme4 package for mixed-effects models - but you must be very careful of the structure and there must be some form of regularization employed before the answers are of any value. Without regularization (inflating the diagonal of crossprod(A) in some way) the crossproduct matrix is virtually guaranteed to be singular. The problem with the calculation you describe is that it will complete and one could go on to calculate nonsense coefficients, etc., if you did not look closely at the result.

The calculation of singular values or principal components is a more meaningful example.

I know CRAN posts alternative RBLAS.dll files for some common Intel chips (at least for Windows). Using these libraries does speed up many calculations. I wonder how this fits in with your benchmarks above, especially on an Intel Core Duo series (which seems to make up a lot of laptops, at least).

Hi,

Thanks for an interesting post. I'm using R on Mac OS X, and this info is new (and rather surprising) to me:

"The MacOS port of R on CRAN is linked to ATLAS, a "tuned" BLAS that uses multiple cores for computations."

Never have I seen CPU usage above 100% (I have four cores), and I see that as an indication that R is *not* using multithreading. Am I missing something here?

@Doug -- many thanks for the insightful comments, I've updated the post accordingly. Good point about the slightly odd Cholesky example (a function of running code for the purposes of benchmarking than actual analysis).

@Michael -- interesting, thanks. I use Revolution R on the Mac, but I had thought that the CRAN binary for R was linked to a multithreaded library by default. This warrants more investigation.

@David -- Sorry! I just did some of the examples in your blog post, and my standard R 2.11.0 indeed *does* use multithreading. I probably haven't used large enough matrix calculation myself to notice it before.

Большой объем информации, что я могу использовать. Следите за блогами.

Will lmer (from the lme4 package):
(1) be faster using Revolution than with base R, since it relies on functions like chol()?
(2) produce correct answers, given the concerns raised in the above comments by lme4 architect Doug Bates?

If so, it will dramatically improve my use of R (I currently experience very long waits to compute multilevel models with a large data set).

Thanks for the questions, Matt,
1) It will probably be a little faster on Windows if you have a multi-core processor, but the difference will depend on what analysis you do. If you try it out, let us know how it goes.
2) Yes, it will give the correct answers, exactly the same as regular R. (Doug's comment was about the benchmark we used, not the accuracy of R itself.)

Hi David/all,

I ran lmer on some of my data using both Revolution R 4.3 (enterprise/academic) and in base R 2.3.1. I loaded the same workspace and performed identical functions. In every case, Revolution R was faster, although how much faster varied. None were so much faster as to revolutionize my workflow but the gains were impressive nonetheless. I am running Windows Vista on a core i7 920 2.67GHz, 12GB of RAM machine.

RevoR 4.3 R 2.13.1
3.33 5.26
4.94 5.82 # only model with random slope, all others are random intercept only
6.08 6.45
33.13 36.83
243.46 348.32

Sorry for the poor formatting; my tabs were turned into spaces. The first number is for Revolution R and the second is for base R; e.g., 3.33 vs. 5.26. I should specify that the times are total elapsed time reported by system.time()for the five different models.

great post! its great to see that there are people out there still working on improving the technology that windows or Mac provides, the performance improvements seem pretty noticeable from the reports shown

Would it be possible to port this over from the Linux to the Solaris OS? The Solaris runs on the x86 and there are versions of R for the Solaris x86.

The comments to this entry are closed.

Search Revolutions Blog




Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr