I'm a big fan using R to simulate data. When I'm trying to understand a data set, my first step is sometimes to simulate data from a model and compare the results to the data, before I go down the path of fitting an analytical model directly. Simulations are easy to code in R, but they can sometimes take a while to run — especially if there are a bunch of parameters you want to explore, which in turn requires a bunch of simulations.
When your pc only has four core processors and your parallel processing across three of the.#DataScience #RStats #Waiting #Multitasking pic.twitter.com/iVkkr7ibox
— Patrick Williams (@unbalancedDad) January 24, 2018
In this post, I'll provide a simple example of running multiple simulations in R, and show how you can speed up the process by running the simulations in parallel: either on your own machine, or on a cluster of machines in the Azure cloud using the doAzureParallel package.
To demonstrate this, let's use a simple simulation example: the birthday problem. Simply stated, the goal is to calculate for a room of \(N\) people the probability that someone in the room shares a birthday with someone else in the room. Now, you can calculate this probability analytically — R even has a function for it — but this one of those situations where it's quicker for me to write a simulation that it would be to figure out the analytical result. (Better yet, my simulation accounts for February 29 birthdays, which the standard result doesn't. Take that, distribution analysis!) Here's an R function that simulates 10,000 rooms, and counts the number of times a room of n
people includes a shared birthday:
You can compare the results to the built-in function pbirthday
to make sure it's working, though you should include the feb29=FALSE
option for an apples-to-apples comparison. The more simulations (nsims
) you use, the closer the results will be.
We want to find the number of people in the room where the probability of a match is closest to 50%. We're not exactly sure what that number is, but we can fund out by calculating the probability for a range of room sizes, plotting the results, and see where the probability crosses 0.50. Here's a simple for loop that calculates the probability for room sizes from 1 to 100 people:
bdayp <- 1:100 for (n in 1:100) bdayp[n] <- pbirthdaysim(n) plot(bdayp, xlab="People in room", ylab="Probability of shared birthday") abline(h=0.5)
If you look at where the horizontal line crosses the curve, you can see that with 23 people in the room there's a 50% probability of a match. With more than 60 people, a match is practically guaranteed.
On my SurfaceBook (16Gb RAM, 2 2.6Gz cores) using R 3.4.3, that simulation takes about 6 minutes (361 seconds). And even though I have 2 cores, according to Task Manager the CPU utilization hovered at little over 50%, because R — as a single-threaded application — was using just one core. (The operating system and my other apps accounted for the rest of the usage.)
So here's one opportunity to speed things up: we could run two R sessions on my laptop at the same time, each doing half the simulations, and then combine the results at the end. Fortunately, we don't have to break up our simulation manually: there are R packages to streamline that process. One such package is the built-in R package parallel, which provides parallel equivalents to the "apply" family of functions. Here though, I'm going to use the foreach
function, which is similar to a for
loop and — with the %dopar%
operator — runs iterations in parallel. (You can find the foreach package on CRAN.) On my laptop, I can use the doParallel package to make foreach
use the aforementioned parallel package in the background to run two iterations at a time.
This time, my CPU utilization stayed above 90% across both cores (while the fans whirred madly), and the simulation took 316 seconds. Not a huge improvement, but definitely faster. (To be fair, I was also on a video call while the simulation was running, which itself consumed quite a bit of computer CPU.) Which suggests the question: can we go faster if we use more than one computer (and ideally ones dedicated to R and not other desktop stuff)? The answer, as you might have guessed, is yes.
The main problem here is getting access to a cluster: not many of us have a rack of machines to ourselves. But in this age of cloud computing, we can rent one for as long as we like, and for small simulations this can be pretty cheap, too. With the doAzureParallel package I can spin up a cluster of any desired size and power, and use the same foreach
code as above to run my simulation on it, all controlled from my local R installation.
To use doAzureParallel, you'll need the following:
- An Azure account (if you don't have one, a new free account includes $200 in credits), and
- An Azure Storage service and an Azure Batch service set up in your account, and they associated keys provided in a
credentials.json
file. (The instructions explain how to do this, and it only takes a few minutes to set up.) Azure Batch is the cluster management service we'll be using to create our clusters for the R simulations.
It can take a little while to boot the cluster, but once it's up and running you can use it as much as you like. And best of all, you can launch the simulation on the cluster from your local laptop using almost exactly same code as before:
library(doAzureParallel) setCredentials("credentials.json") cluster <- makeCluster("cluster.json") registerDoAzureParallel(cluster) bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
(That's why I used foreach instead of the parallel package for the local analysis above: you can change the "backend" for the computation without changing the foreach
statement itself.) Using a cluster of 8 2-code nodes (as specified in this cluster.json file), the simulation took just 54-seconds. That doesn't include spinning up the cluster itself (the first 4 lines above), but does include sending the jobs to the nodes, waiting for tasks to complete, and merging the results (which happens in line 5). Of course, once you've launched a cluster you can re-use it instantly, and you can even resize it dynamically from R to provide more (or less) computing power.
In my local region that particular virtual machine (VM) type currently costs US$0.10 an hour. (There are many VM sizes available.) We used 8 VMs for about a minute, so the total cost of that simulation was less than 2 cents. And we can make things even cheaper — up to 80% discount! — by using pre-emptable VMs: you can add some (or all!) to your cluster with the lowPriorityNodes
field in the cluster.json file. Using low-priority VMs doesn't affect VM performance, but Azure may shut down some of them if surplus capacity runs out. If this happens when you're using doAzureParallel you might not even notice, however: it will automatically resubmit iterations to remaining nodes or wait for new VMs. You'll still get the same results, but it may possibly take longer than usual.
Now, simulating the birthday problem is a particularly simple example — deliberately so. Most importantly, it doesn't require any data to be moved to the cluster, which is an important consideration for performance. (We'll look at that issue in a later blog post.) Our simulation also doesn't use any specialized R packages that need to be installed on the cluster. As it happens, in the cluster specification we set things up to use the tidyverse Docker image from Rocker, so most packages we'd need were already there. But if you want to use a different image for the nodes, or need to ship a custom R package or startup script to each node, you can specify that in the cluster.json
file as well.
If you'd like to try out the doAzureParallel package yourself, you can install it and find setup instructions and documentation at the Github repository below.
Github (Azure): doAzureParallel
Using
anydup <- function(ignored = NULL) {
s <- sample.int(length(bdays), n, replace = TRUE, prob = probs)
a <- anyDuplicated.default(s)
as.logical(a != 0L)
}
loop should be around 4 times faster.
Posted by: minem | January 26, 2018 at 00:50
Thanks @minem. I'm sure there are lots of ways to speed up that code -- the easiest would be to use pbirthday() directly. You can also get a good estimate with way fewer than 100k simulations. But in this case it was useful to me for it to be slower rather than faster: it made it easier to show the changes when you go parallel. :)
Posted by: David Smith | January 26, 2018 at 07:05