by John Mount Ph.D.
Data Scientist at Win-Vector LLC
Let's talk about the use and benefits of parallel computation in R.
IBM's Blue Gene/P massively parallel supercomputer (Wikipedia).
Parallel computing is a type of computation in which many calculations are carried out simultaneously."
The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability. Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:
- Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
- Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
- Parallelization abstraction frameworks such as Thrust/Rth (see here).
- Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)
In addition to having a task ready to "parallelize" you need a facility willing to work on it in a parallel manner. Examples include:
- Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
- Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
- Clusters of computers (such as Amazon ec2, Hadoop backends and more).
Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly. In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. To read on please click here.