I'm a big fan using R to simulate data. When I'm trying to understand a data set, my first step is sometimes to simulate data from a model and compare the results to the data, before I go down the path of fitting an analytical model directly. Simulations are easy to code in R, but they can sometimes take a while to run — especially if there are a bunch of parameters you want to explore, which in turn requires a bunch of simulations.
In this post, I'll provide a simple example of running multiple simulations in R, and show how you can speed up the process by running the simulations in parallel: either on your own machine, or on a cluster of machines in the Azure cloud using the doAzureParallel package.
To demonstrate this, let's use a simple simulation example: the birthday problem. Simply stated, the goal is to calculate for a room of \(N\) people the probability that someone in the room shares a birthday with someone else in the room. Now, you can calculate this probability analytically — R even has a function for it — but this one of those situations where it's quicker for me to write a simulation that it would be to figure out the analytical result. (Better yet, my simulation accounts for February 29 birthdays, which the standard result doesn't. Take that, distribution analysis!) Here's an R function that simulates 10,000 rooms, and counts the number of times a room of n
people includes a shared birthday:
You can compare the results to the built-in function pbirthday
to make sure it's working, though you should include the feb29=FALSE
option for an apples-to-apples comparison. The more simulations (nsims
) you use, the closer the results will be.
We want to find the number of people in the room where the probability of a match is closest to 50%. We're not exactly sure what that number is, but we can fund out by calculating the probability for a range of room sizes, plotting the results, and see where the probability crosses 0.50. Here's a simple for loop that calculates the probability for room sizes from 1 to 100 people: