*by Allison Koenecke, Data Scientist, AI & Research Group at Microsoft, with acknowledgements to Amita Gajewar and John-Mark Agosta.*

*In this tutorial, Allison Koenecke demonstrates how Microsoft could recommend to customers the next set of services they should acquire as they expand their use of the Azure Cloud, by using a temporal extension to conventional Market Basket Analysis.** *

Market Basket Analysis (MBA) answers a standard business question: given a set of grocery store receipts, can we find bundles of products often purchased together (e.g., peanut butter and jelly)? Suppose we instead want to model the evolution of a customer’s services, such as determining whether buying peanut butter in the past indicates a higher likelihood of buying bread in the future. For this, we apply a sequential version of MBA, sometimes called “sequential itemset mining” or “sequential pattern mining”, to introduce a time component to the analysis [1]. Sequential itemset mining has been applied across many industries, from determining a patient’s sequence of medical prescriptions [2] to detecting misuse intrusions such as application layer attacks [3]. In this tutorial, when given a series of purchases over time, we determine whether we can (1) find product bundles that we expect to be consumed simultaneously, and also (2) examine how these bundles evolve over time.

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time. We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix. We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month. Actual Azure Service names have been removed for confidentiality reasons.

`read_baskets`

function within the *Figure 1: Sample conversion of data input to transaction matrix. The stand-in letters in the “ServiceLevel” column represent specific Azure services, such as Compute, Networking, Data + Storage, Web + Mobile, etc.*

We will use the SPADE (**S**equential **Pa**ttern **D**iscovery using **E**quivalence classes) algorithm for temporal market basket analysis, which is invoked with the `cspade`

function. This method, recursive by length of sequence, is detailed in Figure 2 below. For example, in the first pass we look at sequences of length 1 (i.e., we determine which individual Azure Services appear in our transaction data). Based on the most frequent single-length sequences (e.g., “A” appears more often than “D”), we observe two types of two-element sequences. First, we observe two-element temporal sequences** (**“B → A” requires “B” to be purchased before “A”). Next, we observe two-element item groupings (“AB” requires “A” and “B” to exist at a certain time simultaneously). Then, based on the most frequent length-two outputs, we move on to finding three-element sequences (e.g., “D → B → A”) and three-element item groupings (e.g., “ABF”). This continues until we reach a user-defined maximum length, or until we reach a length at which we can no longer find frequent outputs.

Going forward, we will refer to the term “itemset” as any collection of items purchased by the customer, so an itemset can consist of item groupings (which could be a single item) or temporal sequences.

*Figure 2: SPADE Algorithm frequent sequence generation (original image from [5]).*

We’ve used the word “frequent” several times in the SPADE algorithm description. What exactly does this mean? There are three different terms that quantify frequency for MBA generally, and a best-case scenario involves high values in all three metrics: support, confidence, and lift. Here are their definitions:

**Support(****{ɑ****}****): **The fraction of transactions that contain an itemset {ɑ}. High support values correspond to commonly-found itemsets that are applicable to many transactions.

**Confidence(****{ɑ****} →** **{β****}****): **The likelihood that the sequential rule {ɑ} → {β} actually occurs among transactions containing itemset {ɑ}, under the constraint that itemset {ɑ} is transacted prior to {β}. Confidence can be interpreted as a conditional probability given a pre-existing transaction with itemset {ɑ}; high confidence implies a high likelihood that {β} occurs in a future transaction. Technically, confidence is defined as the ratio: Support({ɑ} and {b}) / Support({ɑ}).

**Lift({ɑ} → {β}****)**: The degree of improvement from applying the sequential rule {ɑ} → {β} over observing itemsets {ɑ} and {β} independently occurring in the wild. The lift ratio is defined by Support({ɑ} and {β}) / (Support({ɑ}) * Support({β})). Note that if itemsets {ɑ} and {β} were independently occurring events, then the denominator would equate to the numerator, and the lift value would be 1. High lift – greater than 1 – implies that the presence of pre-existing itemset {ɑ} has increased the probability that {β} will occur in a future transaction; this can be interpreted as a higher degree of dependence between itemsets {ɑ} and {β}. Conversely, a low lift value – less than 1 – implies a negative dependence between itemsets {ɑ} and {β}. This metric tells you not just what is popular with all customers, but what is most useful for a particular customer given their history (e.g., a low lift value could be interpreted as {β} acting as a substitute good for {ɑ}).

Returning to our example, recall that our input (`trans_matrix`

created in Step 1) is the month-over-month sequence of Azure services purchased. With just one line of code, the `cspade`

function returns frequent sequences mined in descending order of support. Further, to cut down on runtimes, we can define a minimum support value to be output – in the below code, we use a minimum of 0.3, which in this case gives us 18 different sequences to observe. We will see how confidence and lift play a role in Step 3 of this tutorial.

We can view results in the data frame `s1.df`

thus created, and look at the following statistics by calling `summary`

on `s1`

. Specifically, we see:

“(1) the list of the most frequent individual items (A,B, ..),

(2) the list of the most frequent set of items that occur in events (referred to as elements),

(3) the distribution of the sizes of the set of items,

(4) the distribution of the number of events in a sequence (referred to as sequence length),

(5) the minimum, maximum, mean and median support values, and

(6) the set of frequent sequences mined, ordered by its support value” [6].

*Figure 3: Sample output from the **cspade** algorithm shows 18 frequent itemsets, as well as statistics on these values (e.g., item A appears 11 out of 18 times overall as part of various itemsets, but appears by itself in a transaction as {A} only 8 times). Note that the comma delimiter between sets within each sequence implies a temporal sequence as seen in Figure 2.*

It takes just one more line of code to convert this set of sequences into a set of rules. Specifically, the “strong association rules” generated must satisfy minimum support and confidence values, and the left-hand side of the rule must occur at a time before the rule’s right-hand side. We have set the lower confidence bound to 0.5 in this example; a common default is a minimum of 0.8.

This returns the data frame shown in Figure 4. We can interpret the “rule” column as showing which bundles of Azure services imply consumption of additional services in coming months. Further, for all rules, we can compare the three previously-defined metrics: support, confidence, and lift.

*Figure 4: Sample output from the **ruleInduction** method.*

How do we interpret this? We can say, for example, every single time that we have seen service D purchased by a customer, the same customer will always buy both B and F simultaneously at a later time. This is what a confidence score of 1.0 implies in Row 2 of Figure 4. Going back to Figure 1, we see that this is the case for user IDs 1 and 4. Meanwhile, the lift value of 1.0 implies that the probability of D being purchased, and the probability of both B and F being purchased afterwards, are independent of each other. So, we cannot claim that buying both B and F is dependent on having previously bought item D, but also, these events do not appear negatively dependent (i.e., it is also not the case that buying D previously makes it relatively unnecessary for the user to later buy B and F, as would be the case for the rules having a lift value of 0.5). While this output is fairly straightforward, we can do some simple cleaning to parse out the “before” and “after” steps of the rule, and sort by the metrics deemed to be most important for the specific use case.

*Figure 5: Readable output in Excel stored in **TemporalRules.csv**. The values stored in **FreqItemGroupingsInRules.csv** are a list containing items: D; B; F; B,F; and A.*

Now, our two main results are stored in the csv files, `FreqItemGroupingsInRules.csv`

and `TemporalRules.csv`

. Taken together, all itemsets that we care about will be defined by either the frequent item groupings or temporal rules. But, how do we apply these to our business case?

First, the frequent item groupings themselves are useful domain knowledge. Even if the sequential ordering of B and F are ignored, knowing that items B and F are frequently purchased together raises some business opportunities: is there a reason that they are sold separately? Do customers have some need for both items that would be served better by rolling them into one product?

Second, the temporal rules allow us to propose better-targeted recommendations over time to customers. For example, assume that we see the rule stating that “<{D}> => <{B,F}>“ with high confidence and reasonable lift (both scoring 1.0 in our example). For a customer who has previously purchased item D, we can recommend that they now buy the bundle of two items, B and F. In a future timestep, suppose our customer loves this suggestion and has now purchased B and F simultaneously. Now, we can consult the temporal rules again and find the rule “<{B,F}> => <{A}>” (which we choose due to the correct starting bundle of goods, as well as the high confidence). Now, we can further recommend that our customer adopt item A at a future time. In fact, we can parse these rules more easily using the cleaned Excel output shown in Figure 5 – our recommendation path is shown in Row 4. This style of targeted personalized recommendations is helpful to both our customers and our salespeople, and future recommendations will become increasingly accurate as more data are collected to contribute to the confidence and lift calculations.

The above steps show a computationally efficient method to find temporal rules; this is far faster than a brute-force method, such as finding frequent itemsets in each timestep using the arules package, and then manually comparing them across time. However, depending on the support, confidence, and lift cut-offs desired, it is possible for even the `arulesSequences`

method to take significant amounts of computation time. Hence, if the main desired outcome is expected to be one of the highest-scoring rules, it may be prudent to start from a high threshold frequency cut-off before testing lower values.

Aside from frequency cut-off parameters, it is worth considering the product aggregation level at which results are measured. Let us assume that we have products or services being sold that are organized in a hierarchy. It will take less time to run the SPADE algorithm on the least-granular level since the algorithm will see more repeated services being sold to different customers. If there is a desired minimum cut-off that takes too long to run on the specific product levels you desire, consider running the above code on a higher, more aggregate level of the product hierarchy, and then drill down in that set to determine which products are best to recommend in future timesteps. For example, at a higher product level within Azure Services (such as Analytics, Compute, Web + Mobile, etc.), suppose we find that we should recommend an Analytics product to a specific customer. Upon further examination, perhaps we find that Analytics-based temporal rules on more granular products indicate to recommend Machine Learning tools as opposed to Data Lake products. Hence, we are still able to make a reasonable suggestion for lower-level products without spending excessive computation time on the entire dataset.

Lastly, the above sequential pattern mining code may not be directly applicable if you: (1) care about the quantity of items being bought at any given point in time (since we simply observe the presence or absence of an itemset in this tutorial), or (2) have data that are irregular over time, but aim to predict a recommendation for a specific time interval in the future. In the former case, encoding a repeated item with separate item names (i.e., one name for each unit purchased) can allow quantity to be expressed. For the latter case, it is a best practice to use regular time intervals; this can be done by bucketing purchases by, e.g., month (if predictions are to be made over a monthly timeframe). Depending on sales structure, it may be necessary to either interpolate or report zero purchases in intermediate timesteps.

In summary, given the wide usage of Excel by finance teams tracking historical purchases, this approach offers an effective tool for providing insights on product or service adoption over time. Sequential pattern mining as implemented here can potentially be harnessed to recommend deals to a company’s salespeople, find what customers are likely to consume next, and discover which product bundles remain popular over time. Let us know what use cases you have found for this method!

- Srikant, R., Agrawal, R., Apers, P., Bouzeghoub, M., Gardarin, G. (1996) "Mining sequential patterns: Generalizations and performance improvements", Advances in Database Technology — EDBT '96 vol. 1 no. 17.
- Aileen P. Wright, Adam T. Wright, Allison B. McCoy, Dean F. Sittig, The use of sequential pattern mining to predict next prescribed medications, Journal of Biomedical Informatics, Volume 53, 2015, Pages 73-80, ISSN 1532-0464, https://doi.org/10.1016/j.jbi.2014.09.003.
- Song SJ., Huang Z., Hu HP., Jin SY. (2004) A Sequential Pattern Mining Algorithm for Misuse Intrusion Detection. In: Jin H., Pan Y., Xiao N., Sun J. (eds) Grid and Cooperative Computing - GCC 2004 Workshops. GCC 2004. Lecture Notes in Computer Science, vol 3252. Springer, Berlin, Heidelberg.
- Package ‘arulesSequences’ documentation: https://cran.r-project.org/web/packages/arulesSequences/arulesSequences.pdf.
- J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31–60.
- Data Mining Algorithms in R: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Sequence_Mining/SPADE.
- Analyzing Transaction Data like a Data Scientist: https://rpubs.com/Mahsa_A/Part4_AnalyzeTransactionData.

Anscombe's Quartet is a famous collection of four small data sets — just 11 (x,y) pairs each — that was developed in the 1970s to emphasize the fact that sometimes, numerical summaries of data aren't enough. (For a modern take on this idea, see also the Datasaurus Dozen.) In this case, it takes visualizing the data to realize that the for data sets are qualitatively very different, even though the means, variances, and regression coefficients are all the same. In the video below for Guy in a Cube, Buck Woody uses R to summarize the data (which is conveniently built into R) and visualize it using an R script in Power BI.

The video also makes for a nice introduction to R for those new to statistics, and also demonstrates using R code to generate graphics in Power BI.

One of the differences between statistical data scientists and machine learning engineers is that while the latter group are concerned primarily with the predictive performance of a model, the former group are also concerned with the *fit* of the model. A model that misses important structures in the data — for example, seasonal trends, or a poor fit to specific subgroups — is likely to be lacking important variables or features in the source data. You can try different machine learning techniques or adjust hyperparameters to your heart's content, but you're unlikely to discover problems like this without evaluating the model fit.

One of the most powerful tools for assessing model fit is the residual plot: a scatterplot of the predicted values of the model versus the residuals (the difference between the predictions and the original data). If the model fits well, there should be no obvious relation between the two. For example, visual inspection shows us that the model below first fairly well for 19 of the 20 subgroups in the data, but there may be a missing variable that would explain the apparent residual trend for group 19.

Assessing residual plots like these is something of an art, which is probably why it isn't routine in machine learning circles. But what if we could use deep learning and computer vision to assess residual plots like these automatically? That's what Professor Di Cook from Monash University proposes, in the 2018 Belz Lecture for the Statistical Society of Australia, Human vs computer: when visualising data, who wins?

(The slides for the presentation, created using R, are also available online.) The first part of the talk also includes a statistical interpretation of neural networks, and an intuitive explanation of deep learning for computer vision. The talk also references a related paper, Visualizing Statistical Models : Removing the Blindfold (by Hadley Wickham, Dianne Cook and Jeike Hofman and published in the ASA Data Science Journal in 2015) which is well worth a read as well.

President Trump is once again causing distress by downplaying the number of deaths caused by Hurricane Maria's devastation of Puerto Rico last year. Official estimates initially put the death toll at 15 before raising it to 64 months later, but it was clear even then that those numbers were absurdly low. The government of Puerto Rico commissioned an official report from the Millikan Institute of Public Health at George Washington University (GWU) to obtain a more accurate estimate, and with its interim publication official toll stands at 2,975.

Why were the initial estimates so low? I read the interim GWU report to find out. The report itself is clearly written, quite detailed, and composed by an expert team of social and medical scientists, demographers, epidemiologists and biostatisticians, and I find its analysis and conclusions compelling. (Sadly however the code and data behind the analysis have not yet been released; hopefully they will become available when the final report is published.) In short:

- In the earliest days of the hurricane, the death-recording office was closed and without power, which suppressed the official count.
- Even once death certificates were collected, it became clear that officials throughout Puerto Rico has not been trained on how to record deaths in the event of a natural disaster, and most deaths were not attributed correctly in official records.

Given these deficiencies in the usual data used to calculate death tolls (death certificates) the GWU team used a different approach to calculate the death toll. The basis of the method was to estimate **excess mortality**, in other words, how many deaths occurred in the post-Maria period compared to the number of deaths that would have been expected if it had never happened. This calculation required two quantitative studies:

- An estimate of what the population would have been if the hurricane hadn't happened. This was based on a GLM model of monthly data from the prior years, accounting for factors including recorded population, normal emigration and mortality rates.
- The total number of deaths in the post-Maria period, based on death certificates from the Puerto Rico government (irrespective of how the cause of death was coded).
- (A third study examined the communication protocols before, during and after the disaster. This study did not affect the quantiative conclusions, but formed the basis of some of the report's recommendations.)

The difference between the actual mortality, and the estimated "normal" mortality formed the basis for the estimate of excess deaths attributed to the hurricane. You can see those estimates of excess deaths one month, three months, and five months after the event in the table below; the last column represents the current official estimate.

These results are consistent in scale with another earlier study by Nishant Kishore et al. (The data and R code behind this study is available on GitHub.) This study attempted to quantify deaths attributed to the hurricane directly, by visiting 3299 randomly chosen households across Puerto Rico. At each household, inhabitants were asked about any household members who had died and their cause of death (related to or unrelated to the hurricane), and whether anyone had left Puerto Rico because of the hurricane. From this survey, the paper's authors extrapolated the number hurricane-related deaths to the entire island. The headline estimate of 4,625 at three months is somewhat larger than the middle column of the study above, but due to the small number of recorded deaths in the survey sample the 95% confidence interval is also much larger: 793 to 8498 excess deaths. (Gelman's blog has some good discussion of this earlier study, including some commentary from the authors.)

With two independent studies reporting excess deaths well into the thousands attributable directly to Hurricane Maria, it's a fair question to ask whether a more effective response before and after the storm could have reduced the scale of this human tragedy.

Milken Institute School of Public Health: Study to Estimate the Excess Deaths from Hurricane Maria in Puerto Rico

*by Antony Unwin, **University of Augsburg, Germany*

There are many different methods for identifying outliers and a lot of them are available in **R**. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the **OutliersO3** package and was presented at last year’s useR! in Brussels. Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

```
## HDo PCS BAC adjOut DDC MCD
## 14 4 5 0 6 5
```

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why `adjOutlyingness`

finds few or no outliers (results of this method are mildly random). The default value, according to `adjOutlyingness`

’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

Bias is a major issue in machine learning. But can we develop a system to "un-bias" the results? In this keynote at NIPS 2017, Kate Crawford argues that treating this as a technical problem means ignoring the underlying social problem, and has the potential to make things worse.

You can read more about biases in AI systems in this article at the Microsoft AI blog.

Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”

— Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”

— Hanna Wallach, Senior Researcher at Microsoft Research NYC.

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots ... the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.

All slot machines are designed to have a "house edge" — the percentage of player bets retained by the machine in the long run — greater than zero. Some may take 1% of each bet (over a long-run average); some may take as much as 15%. But every slot machine takes *something*.

That being said, with all those complex rules and features, trying to calculate the house edge, even when you know all of the underlying probabilities and frequencies, is no easy task. Giora Simchoni demonstrates the problem with an R script to calculate the house edge of an "open source" slot machine *Atkins Diet*. Click the image below to try it out.

This virtual machine is at a typical level of complexity of modern slot machines. Even though we know the pay table (which is always public) and the relative frequency of the symbols on the reels (which usually isn't), calculating the house edge for this machine requires several pages of code. You could calculate the expected return analytically, of course, but it turns out to be a somewhat error-prone combinatorial problem. The simplest approach is to simulate playing the machine 100,000 times or so. Then we can have a look at the distribution of the payouts over all of these spins:

The *x* axis here is log(Total Wins + 1), in log-dollars, from a single spin. It's interesting to see the impact of the bet size (which increases variance but doesn't change the distribution), and the number of lines played. Playing one 20-line game isn't the same as playing 20 1-line games, because the re-use of the symbols means multi-line wins are *not* independent: a high-value symbol (like a wild) may contribute to wins on multiple lines. Conversely, losing combinations have a tendency to cluster together, too. It all balances in the end, but the possibility of more frequent wins (coupled with higher-value losses) is apparently appealing to players, since many machines encourage multi-line play.

Nonetheless, whichever method you play, the house edge is always positive. For *Atkins Diet*, it's about between 3% and 4%. (The simulations suggest 4% for single-line play and about 3.2% for multi-line play, but per-line expected returns are the same in each case.) You can see the details of the calculation, and the complete R code behind it, at the link below.

Giora Simchoni: Don't Drink and Gamble (via the author)

*by Jocelyn Barker, Data Scientist at Microsoft*

I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous.

Partially because it means I get to think about one of my hobbies at work. More practically, because consequences of probability distributions can be hard to examine in the real world. How do you quantify the impact of having your driverless car misclassify objects on the road? Games like DnD on the other hand were built around quantifying the impact of decisions. You decide to do something, add up some numbers that represent the difficulty of what you want to do, and then roll dice to add in some randomness. It also means it is a great environment to study how the distribution of the randomness impacts the outcomes.

One of the core mechanics of playing DnD and related role-playing games involve rolling a 20 sided die (often referred to as a d20). If you want your character to do something like climb a tree, there is some assigned difficulty for it (eg. 10) and if you roll higher than that number, you achieve your goal. If your character is good at that thing, they get to add a skill modifier (eg. 5) to the number they roll making it more likely that they can do what they wanted to do. If the thing you want to do involves another character, things change a little. Instead of having a set difficulty like for climbing a tree, the difficulty is an opposed roll from the other player. So if Character A wants to sneak past Character B, both players roll d20s and Character A adds their “stealth” modifier against Character B’s “perception” modifier. Whoever between them gets a higher number wins with a tie going to the “perceiver”. Ok, I promise, that is all the DnD rules you need to know for this blog post.

So here is where the stats nerd in me got excited. Some people change the rules of rolling to make different distributions. The default distribution is pretty boring, 20 numbers with equal probability:

One common way people modify this is with the idea of “critical”. The idea is that sometimes people do way better or worse than average. To reflect this, if you roll a 20, instead of adding 20 to your modifier, you add 30. If you roll a 1, you subtract 10 from your modifier.

Another stats nerd must have made up the last distribution. The idea for constructing it is weird, but the behavior is much more Gaussian. It is called 3z8 because you roll 3 eight-sided dice that are numbered 0-7 and sum them up giving a value between 0 and 21. 1-20 act as in the standard rules, but 0 and 21 are now treated like criticals (but at a much lower frequency than before).

The cool thing is these distributions have almost identical expected values (10.5 for d20, 10.45 with criticals, and 10.498 for 3z8), but very different distributions. How do these distributions affect the game? What can we learn from this as statisticians?

To examine how our distributions affects outcome, we will look at a scenario where a character, who we will call the rogue, wants to sneak past three guards. If any of the guard’s perception is greater than or equal to the rogue’s stealth, we will say the rogue loses the encounter, if they are all lower, the rogue is successful. We can already see the rogue is at a disadvantage; any one of the guards succeeding is a failure for her. We note that assuming all the guards have the same perception modifier, the actual value of the modifier for the guards doesn’t matter, just the difference between their modifier and the modifier of the rogue because the two modifiers are just a scalar adjustment of the value rolled. In other words, it doesn’t matter if the guards are average Joes with a 0 modifier and the rogue is reasonably sneaky with a +5 or if the guards are hyper alert with a +15 and the rogue is a ninja with a +20; the probability of success is the same in the two scenarios.

Lets start off getting a feeling for what the dice rolls will look like. Since the rogue is only rolling one die, her probability distribution looks the same as the distribution of the dice from the previous section. Now, lets consider the guards. In order for the rogue to fail to sneak by, she only needs to be seen by one of the guards. That means we just need to look at the probability that the maximum roll for one of the guards is \(n\) for \(n \in 1,..,20\). We will start with our default distribution. The number of ways you can have 3 dice roll a value of \(n\) or less is \(n^3\). Therefore the number of ways you can have the max value of the dice be exactly \(n\) is the number of ways you can roll \(n\) or less minus the number of ways where all the dice are \(n - 1\) or less giving us \(n^3 - (n - 1)^3\) ways to roll a max value of \(n\). Finally, we can divide by the total number of roll combinations for an 20-sided dice, \(20^3\), giving us our final probabilities of:

\[\frac{n^3 - (n-1)^3}{20^3} \textrm{for} \{n \in 1, ..., 20\}\]

The only thing that changes when we add criticals to the mix is that now the probabilities previously assigned to 1 get re-assigned to -10 and those assigned to 20 get reassigned to 30 giving us the following distribution.

This means our guards get a critical success ~14% of the time! This will have a big impact on our final distributions.

Finally, lets look at the distribution for the guards using the 3z8 system.

In the previous distributions, the maximum value became the single most likely roll. Because of the the low probability of rolling a 21 in the 3z8 distribution, this distribution still skews right, but peaks at 14. In this distribution, criticals only occur ~0.6% of the time; much less than the previous distribution.

Now that we have looked at the distributions of the rolls for the rogue and the guards, lets see what our final outcomes look like. As previously mentioned, we don’t need to worry about the specific modifiers for the two groups, just the difference between them. Below is a plot showing the relative modifier for the rogue on the x-axis and the probability of success on the y-axis for our three different probability distributions.

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.

To illustrate how much of a difference there is between the two, lets consider what would be the minimum modifier needed to have a certain probability of success.

Probability | Roll 1d20 | With Criticals | Roll 3z8 |
---|---|---|---|

50% | 6 | 7 | 4 |

75% | 11 | 13 | 8 |

90% | 15 | 22 | 11 |

95% | 17 | 27 | 13 |

We see from the table that reasonably small modifiers make a big difference in the 3z8 system, where as very large modifiers are needed to have a reasonable chance of success when criticals are added. To give context on just how large this is, when a someone is invisible, this only adds +20 to their stealth checks when they are moving. In other words, in the system with criticals, our rogue could literally be invisible sneaking past a group of not particularly observant gaurds and have a reasonable chance of failing.

The next thing to consider is our rogue may have to make multiple checks to sneak into a place (eg. one to sneak into the courtyard, one to sneak from bush to bush, and then a final one to sneak over the wall). If we look at the results of our rogue making three successful checks in a row, our probabilities change even more.

Probability | Roll 1d20 | With Criticals | Roll 3z8 |
---|---|---|---|

50% | 12 | 15 | 9 |

75% | 16 | 23 | 11 |

90% | 18 | 28 | 14 |

95% | 19 | 29 | 15 |

Making multiple checks exaggerates the differences we saw previously. Part of the reason for the poor performance with the addition of criticals (and for the funny shape of the critical curve) is there is a different cost associated with criticals for the rogue compared to the guards. If the guards roll a 20 or the rogue rolls a 1, when criticals are in play, the guards will almost certainly win, even if the rogue has a much higher modifier. On the other hand, if the guard rolls a 1 or the rogue rolls a 20, there isn’t much difference in outcome between getting that critical and any other low/high roll; play continues to the next round.

Many times as data scientists, we think of the predictions we make as discrete data points and when we evaluate our models we use aggregate metrics. It is easy to lose sight that our predictions are samples from a probability distribution, and that aggregate measures can obscure how well our model is really performing. We saw in the example with criticals where big hits and misses can make a huge impact on outcomes, even if the average performance is largely the same. We also saw with the 3z8 system where decreasing the expected value of the roll can actually increase performance by making the “average” outcome more likely.

Does all of this sound contrived to you, like I am trying to force an analogy? Let me make a concrete example from my real life data science job. I am responsible for making the machine learning revenue forecasts for Microsoft. Twice a quarter, I forecast the revenue for all of the products at Microsoft world wide. While these product forecasts do need to be accurate for internal use, the forecasts are also summed up to create segment level forecasts. Microsoft’s segment level forecasts go to Wall Street and having our forecasts fail to meet actuals can be a big problem for the company. We can think about our rogue sneaking past our guards as being an analogy for nailing the segment level forecast. If I succeed for most of the products (our individual guards) but have a critical miss of $1 billion error on one of them, then I have a $1 billion error for the segment and I failed. Also like our rogue, one success doesn’t mean we have won. There is always another quarter and doing well one quarter doesn’t mean Wall Street will cut you some slack the next. Finally, a critical success is less valuable than a critical failure is problematic. Getting the forecasts perfect one quarter will just get your a “good job” and a pat on the back, but a big miss costs the company. In this context, it is easy to see why the finance team doesn’t take the machine learning forecasts as gospel, even with our track record of high accuracy.

So as you evaluate your models, keep our sneaky friend in mind. Rather than just thinking about your average metrics, think about your distribution of errors. Are your errors clustered nicely around the mean or are they scattershot of low and high? What does that mean for your application? Are those really low errors valuable enough to be worth getting the really high ones from time to time? Many times having a reliable model may be more valuable than a less reliable one with higher average performance, so when you evaluate, think distributions, not means.

*The charts in this post were all produced using the R language. To see the code behind the charts, take a look at this R Markdown file.*