R wasn't originally designed as a multithreaded application -- multiprocessor systems were still rare when the R Project was first conceived in the mid-90's -- and so, by default, R will only use one processor of your dual-core laptop or quad-core desktop machine when doing calculations. For calculations that take a long time, like big simulations or modeling of large data sets, it would be nice to put those other processors to use to speed up the computations. There are several parallel processing libraries for R available that allow you to explicitly run loops in R simultaneously (ideally, each on a different processor), but using them does require you to rewrite your code accordingly.
But there is a way to make use of all your processing power for many computations in R, without changing a line of code. That's because R is a statistical computing system, and at the heart of many of the algorithms you use on a daily basis -- data restructuring, regressions, classifications, even some graphics functions -- is linear algebra. The data are transformed into vector and matrix objects, and the internals of R have been cleverly designed to link to a standard "BLAS" API to perform calculations on vectors and matrices. The binaries provided by the R Core Group on CRAN (with one exception, see below) are linked to an "internal BLAS which is well-tested and will be adequate for most uses of R", but is not multi-threaded and so only uses one core. But the beauty of linking to the BLAS API is that you can re-compile R to link to a different, multi-threaded BLAS library and, voilà, suddenly many computations are using all cores and therefore run much faster.
The MacOS port of R on CRAN is linked to ATLAS, a "tuned" BLAS that uses multiple cores for computations. As a result, R on a multi-core Mac (as all new Macs are these days) really zooms. But the Windows binaries on CRAN are not linked to an optimized BLAS. It's possible to compile and link R yourself, but it can be tricky.
That's what we do at Revolution for our Windows and Linux distributions of Revolution R. When we compile R, we link it to the Intel Math Kernel Libraries, which includes a high-performance BLAS implementation tuned to multi-core Intel chips. "Tuning" here means using efficient algorithms, optimized assembly code that exploits features of the chipset, and multi-threaded algorithms that use all cores simultaneously. As a result, you get some serious speed boosts for many operations in R, especially on a multi-core system. Here are some examples:
|
As you can see, using the Intel MKL libraries on a 4-core machine gives some dramatic speedups (about a quarter of the 1-core time, as you might expect). Perhaps more surprisingly, using the Intel MKL libraries on a 1-core machine is also faster than using R's standard BLAS library: this is a result of the optimized algorithms, not additional computing power. You even get improvements on non-Intel chipsets (like AMD).
[A side note: These calculations were actually all run on an 8-core machine, specifically, an Intel Xeon 8-core CPU with 18 GB system RAM running Windows Server 2008 operating system. The complete benchmark code is available on this page. The results for Revolution R 1-core and 4-core were calculated by restricting the Intel MKL library to use 1 thread and 4 threads, using the Revolution-R-specific commands setMKLthreads(1) and setMKLthreads(4) respectively. This has the effect of using only the power of the specific number of cores, even when more cores are available. Note: if you're using Revolution R and are doing explicit parallel programming with doSMP, it's a good idea to call setMKLthreads(1) first. Otherwise, your parallel loops and the multi-threaded linear algebra computations will compete for the same processor and actually degrade performance.]
These results are dramatic, but multi-threaded BLAS libraries aren't a panacea. Not all R commands ultimately link to BLAS code, even ones you might expect. (For example, lm for regression uses a non-BLAS QR decomposition by default. Edit: as pointed out by Doug Bates, lm and glm end up calling the older Linpack routines using level-1 (vector-vector) BLAS instead of the newer, level-3-based (matrix-matrix) Lapack routines because of the need to handle certain rank-deficient cases cleanly.) And if your R code ultimately does not involve linear algebra, you can't expect any improvement at all. (For example, the "Program Control" R benchmarks by Simon Urbanek show only marginal performance gains in Revolution R.) This is when explicit parallel programming is the route to improved performance. We're also working on dedicated statistical routines for Revolution R Enterprise that are explicitly multi-threaded for single-machines and also distributable to multiple machines in a cluster or in the cloud, but that's a topic for another post.
Revolution Analytics: Performance Benchmarks
There are two minor points that perhaps should be corrected in this posting. The crossprod function calculates A'*A not A*A', which is what is calculated by the tcrossprod function. And the lm function does use the BLAS but only level-1 BLAS (vector-vector operations) not level-2 (matrix-vector) or level-3 (matrix-matrix) BLAS. Most accelerated BLAS systems concentrate on boosting the performance of level-3 BLAS.
The reason that the lm.fit and glm.fit functions (the workhorses underneath the lm and glm functions) end up calling the older Linpack routines using level-1 BLAS instead of the newer, level-3-based Lapack routines is because of the need to handle certain rank-deficient cases cleanly. The output from example(fastLm) in the RcppArmadillo package shows such a case.
The ability to handle such a case adequately is not specifically a property of level-1 BLAS versus level-3; it requires a special form of pivoting in the QR decomposition. John Chambers modified the dqrdc Linpack subroutine for this in the original S language and Ross Ihaka did similar modifications for R back in the old days. No one has done similar modifications for Lapack's QR decomposition. Every once in a while some of us in R-core discuss alternative approaches that would allow using the Lapack, level-3 BLAS code but no one has sat down and coded them up to see if they would work.
Posted by: Douglas Bates | June 12, 2010 at 06:00
I also commented in a thread on the R-help mailing list that a naive calculation of chol(crossprod(A)) where A is 10000 by 5000 (a combination of your first two examples) is almost certain to produce useless answers. Admittedly there are occasions where one wants the Cholesky decomposition of very large, symmetric matrices - I do this all the time in the lme4 package for mixed-effects models - but you must be very careful of the structure and there must be some form of regularization employed before the answers are of any value. Without regularization (inflating the diagonal of crossprod(A) in some way) the crossproduct matrix is virtually guaranteed to be singular. The problem with the calculation you describe is that it will complete and one could go on to calculate nonsense coefficients, etc., if you did not look closely at the result.
The calculation of singular values or principal components is a more meaningful example.
Posted by: Douglas Bates | June 12, 2010 at 08:20
I know CRAN posts alternative RBLAS.dll files for some common Intel chips (at least for Windows). Using these libraries does speed up many calculations. I wonder how this fits in with your benchmarks above, especially on an Intel Core Duo series (which seems to make up a lot of laptops, at least).
Posted by: John Johnson | June 12, 2010 at 16:16
Hi,
Thanks for an interesting post. I'm using R on Mac OS X, and this info is new (and rather surprising) to me:
"The MacOS port of R on CRAN is linked to ATLAS, a "tuned" BLAS that uses multiple cores for computations."
Never have I seen CPU usage above 100% (I have four cores), and I see that as an indication that R is *not* using multithreading. Am I missing something here?
Posted by: Michael Knudsen | June 13, 2010 at 23:01
@Doug -- many thanks for the insightful comments, I've updated the post accordingly. Good point about the slightly odd Cholesky example (a function of running code for the purposes of benchmarking than actual analysis).
Posted by: David Smith | June 14, 2010 at 15:36
@Michael -- interesting, thanks. I use Revolution R on the Mac, but I had thought that the CRAN binary for R was linked to a multithreaded library by default. This warrants more investigation.
Posted by: David Smith | June 14, 2010 at 16:31
@David -- Sorry! I just did some of the examples in your blog post, and my standard R 2.11.0 indeed *does* use multithreading. I probably haven't used large enough matrix calculation myself to notice it before.
Posted by: Michael Knudsen | June 15, 2010 at 04:50
Большой объем информации, что я могу использовать. Следите за блогами.
Posted by: Peter L. | December 20, 2010 at 09:53
Will lmer (from the lme4 package):
(1) be faster using Revolution than with base R, since it relies on functions like chol()?
(2) produce correct answers, given the concerns raised in the above comments by lme4 architect Doug Bates?
If so, it will dramatically improve my use of R (I currently experience very long waits to compute multilevel models with a large data set).
Posted by: Matt | July 28, 2011 at 10:56
Thanks for the questions, Matt,
1) It will probably be a little faster on Windows if you have a multi-core processor, but the difference will depend on what analysis you do. If you try it out, let us know how it goes.
2) Yes, it will give the correct answers, exactly the same as regular R. (Doug's comment was about the benchmark we used, not the accuracy of R itself.)
Posted by: David Smith | July 28, 2011 at 11:35
Hi David/all,
I ran lmer on some of my data using both Revolution R 4.3 (enterprise/academic) and in base R 2.3.1. I loaded the same workspace and performed identical functions. In every case, Revolution R was faster, although how much faster varied. None were so much faster as to revolutionize my workflow but the gains were impressive nonetheless. I am running Windows Vista on a core i7 920 2.67GHz, 12GB of RAM machine.
RevoR 4.3 R 2.13.1
3.33 5.26
4.94 5.82 # only model with random slope, all others are random intercept only
6.08 6.45
33.13 36.83
243.46 348.32
Posted by: Matt | July 28, 2011 at 13:43
Sorry for the poor formatting; my tabs were turned into spaces. The first number is for Revolution R and the second is for base R; e.g., 3.33 vs. 5.26. I should specify that the times are total elapsed time reported by system.time()for the five different models.
Posted by: Matt | July 28, 2011 at 13:48
great post! its great to see that there are people out there still working on improving the technology that windows or Mac provides, the performance improvements seem pretty noticeable from the reports shown
Posted by: VC Independent | September 01, 2011 at 11:32
Would it be possible to port this over from the Linux to the Solaris OS? The Solaris runs on the x86 and there are versions of R for the Solaris x86.
Posted by: Roger Pickering | October 18, 2011 at 08:30