by Andrie de Vries

Way back, in 2010, David Smith referred to an article in the New York times about puzzles and other unintuitive results in statistics, including the famous Monty Hall problem.

For the past 5 years I have been training as a yoga teacher, and something curious happens every year. Every year, each of the 10 students has two practical assignments: to teach a yoga class, and to make a presentation about anatomy or physiology.

## The setup

To determine the order of events, we draw a number out of a hat for each assignment. This happens in two rounds.

- In the first round, each person draws a number from 1:10 until all numbers in the hat are exhausted. At the end of the first round, you have a number that indicates in which session you teach.
- In the second round, the numbers are put back into the hat, and every person draws a second time, this time indicating in which session you do your presentation.

This means at the end of second round each person has two numbers, the first indicating the session for teaching class, the second the session for doing the presentation.

## How unusual is this?

The curious thing was that every year, 2-3 people had to do **both assignments on the same day**.

This left me wondering *is this very unusual?*

**And if unusual, just what is the probability of 3 people out of 10 doing both assignments on the same day?**

## Answer first

You can find a discussion of the results and my methods below. I find this mildly astonishing:

- The probability of
**at least one person doing two assignments on the same day**is ~63.2% - And the probability that
**two or more people have this unfortunate fate**is ~26.4%

## Conclusions

Using the simple method of successive draws from a hat is going to lead to double duty a surprisingly large number of times.

To make the process more fair, one will have to change the process of drawing lots. For example, if you draw an identical number, you have to make another draw, then replace the original number in the hat.

## The red herring

The question, as posed, describes **two** consecutive draws from a hat. However, the first draw from the hat doesn't really matter. The reason is that the first set of draws assigns a position to each person. But we are only interested whether the second draw matches the person's position after the first round. So it doesn't really matter how the first position was determined. We only care that the positions are the same.

This means the first hat is completely irrelevant to the solution.

## My first, incorrect attempt at a solution

My first line of thought was that the probability of drawing two identical numbers must be 1/10. And if this is true, then one can model the events as a binomial distribution with a probability of 0.1. (See the code at the bottom of this post).

Now, in this analysis it seems that the probability of having one or more people unlucky enough to present on the same day is about 40%.

However, this is **incorrect thinking** because it assumes each draw from the hat is completely **independent** from the previous draw. Clearly this assumption is incorrect, since each person draws a number from the hat in succession, until all the numbers are exhausted.

Thus the numbers available to person 2 are dependent on the number already drawn by person 1, and so on down the chain.

To correct for this, we can calculate all the permutations (orderings of the outcome) before calculating the probability.

## Enumerating all possible outcomes

In classical statistics, the probability of an event is the ratio of observed outcomes to the total theoretically possible outcomes.

In our example, this means we have to enumerate (write out) all of the possible draws from the hat, then count how many times a match occurs.

To illustrate this, here is the enumerated list of all possible outcomes of three numbers from a hat. The first row contains (1, 2, 3). The last columns counts how many times there is a match between the outcome and the position. For the first permutation, all three outcomes matched, and thus the count is equal to 3. However, for the second row of outcomes, only a single outcome matched the position.

Permutation number |
Position One |
Position Two |
Position Three |
Count of (Position == draw) |

1 | 1 |
2 |
3 |
3 |

2 | 1 |
3 |
2 |
1 |

3 | 2 |
3 |
1 |
0 |

4 | 2 |
1 |
3 |
1 |

5 | 3 | 1 |
2 |
0 |

6 | 3 |
2 |
1 |
1 |

In mathematics, these enumerations are called permutations.

To correctly estimate the probability in this example, we have to generate a list of all possible outcomes of the draw. In other words, compute all permutations of the number sequence 1:10.

Essentially, this means using brute force to generate all permutations and then see how many match up with a 1 in the first place, 2 in the second place, etc.

The final step is to use the permutation table to compute the probabilities. Making a table of the counts in the final column of the previous table you get:

Count of (Position == draw) |
Number of times this happened |
Probability |

0 | 2 | 33.3% |

1 | 3 | 50.0% |

2 | 0 | 0.0% |

3 | 1 | 16.7% |

6 |
100% |

This table says that, with only three people drawing from a hat, the probability of nobody doing their assignments on the same day is 2/6, i.e. 33.3%.

So, having done this calculation by hand for a small number of participants, how can we use R to do the same calculation for larger groups?

Calculating the permutations using R

Using R, you can easily do this with the function permn() in the package combinat, available from CRAN or MRAN.

(Again, the code is at the bottom of this post).

In numbers:

0 1 2 3 4 5 6 7 8 10

0.368 0.368 0.184 0.061 0.015 0.003 0.001 0.000 0.000 0.000

## Using simulation rather than computing all permutations

It is computationally very expensive to compute **all** possible permutations. Also, depending on the implementation of the permutation function, you may run out of memory before completing the calculation. On my laptop, I can quite easily compute all permutation of 10 numbers, but I start to run into memory problems with 14 permutation.

Thus you may want to make use of **simulatation** rather than the exact calculation. In a simulation, you use random numbers to generated a large number of samples. If the number of samples is sufficiently large, your simulated distribution will be close to the exact distribution.

I used sampling to simulate 1 million draws from a hat, and created the simulation plot below.

The overall shape is very, very similar to the theoretical distribution.

However, notice that one of the differences is that the **very rare event** of all 10 numbers being equal doesn't show up at all, while it showed up in the exact permutation calculation. (Notice the x-axis doesn't contain the 9 or 10, while in the previous graphic both appear.)

The code

The code is reasonably simple, although computing all permutations will take some seconds to run:

## Comments

You can follow this conversation by subscribing to the comment feed for this post.