Anscombe's Quartet is a famous collection of four small data sets — just 11 (x,y) pairs each — that was developed in the 1970s to emphasize the fact that sometimes, numerical summaries of data aren't enough. (For a modern take on this idea, see also the Datasaurus Dozen.) In this case, it takes visualizing the data to realize that the for data sets are qualitatively very different, even though the means, variances, and regression coefficients are all the same. In the video below for Guy in a Cube, Buck Woody uses R to summarize the data (which is conveniently built into R) and visualize it using an R script in Power BI.

The video also makes for a nice introduction to R for those new to statistics, and also demonstrates using R code to generate graphics in Power BI.

One of the differences between statistical data scientists and machine learning engineers is that while the latter group are concerned primarily with the predictive performance of a model, the former group are also concerned with the *fit* of the model. A model that misses important structures in the data — for example, seasonal trends, or a poor fit to specific subgroups — is likely to be lacking important variables or features in the source data. You can try different machine learning techniques or adjust hyperparameters to your heart's content, but you're unlikely to discover problems like this without evaluating the model fit.

One of the most powerful tools for assessing model fit is the residual plot: a scatterplot of the predicted values of the model versus the residuals (the difference between the predictions and the original data). If the model fits well, there should be no obvious relation between the two. For example, visual inspection shows us that the model below first fairly well for 19 of the 20 subgroups in the data, but there may be a missing variable that would explain the apparent residual trend for group 19.

Assessing residual plots like these is something of an art, which is probably why it isn't routine in machine learning circles. But what if we could use deep learning and computer vision to assess residual plots like these automatically? That's what Professor Di Cook from Monash University proposes, in the 2018 Belz Lecture for the Statistical Society of Australia, Human vs computer: when visualising data, who wins?

(The slides for the presentation, created using R, are also available online.) The first part of the talk also includes a statistical interpretation of neural networks, and an intuitive explanation of deep learning for computer vision. The talk also references a related paper, Visualizing Statistical Models : Removing the Blindfold (by Hadley Wickham, Dianne Cook and Jeike Hofman and published in the ASA Data Science Journal in 2015) which is well worth a read as well.

President Trump is once again causing distress by downplaying the number of deaths caused by Hurricane Maria's devastation of Puerto Rico last year. Official estimates initially put the death toll at 15 before raising it to 64 months later, but it was clear even then that those numbers were absurdly low. The government of Puerto Rico commissioned an official report from the Millikan Institute of Public Health at George Washington University (GWU) to obtain a more accurate estimate, and with its interim publication official toll stands at 2,975.

Why were the initial estimates so low? I read the interim GWU report to find out. The report itself is clearly written, quite detailed, and composed by an expert team of social and medical scientists, demographers, epidemiologists and biostatisticians, and I find its analysis and conclusions compelling. (Sadly however the code and data behind the analysis have not yet been released; hopefully they will become available when the final report is published.) In short:

- In the earliest days of the hurricane, the death-recording office was closed and without power, which suppressed the official count.
- Even once death certificates were collected, it became clear that officials throughout Puerto Rico has not been trained on how to record deaths in the event of a natural disaster, and most deaths were not attributed correctly in official records.

Given these deficiencies in the usual data used to calculate death tolls (death certificates) the GWU team used a different approach to calculate the death toll. The basis of the method was to estimate **excess mortality**, in other words, how many deaths occurred in the post-Maria period compared to the number of deaths that would have been expected if it had never happened. This calculation required two quantitative studies:

- An estimate of what the population would have been if the hurricane hadn't happened. This was based on a GLM model of monthly data from the prior years, accounting for factors including recorded population, normal emigration and mortality rates.
- The total number of deaths in the post-Maria period, based on death certificates from the Puerto Rico government (irrespective of how the cause of death was coded).
- (A third study examined the communication protocols before, during and after the disaster. This study did not affect the quantiative conclusions, but formed the basis of some of the report's recommendations.)

The difference between the actual mortality, and the estimated "normal" mortality formed the basis for the estimate of excess deaths attributed to the hurricane. You can see those estimates of excess deaths one month, three months, and five months after the event in the table below; the last column represents the current official estimate.

These results are consistent in scale with another earlier study by Nishant Kishore et al. (The data and R code behind this study is available on GitHub.) This study attempted to quantify deaths attributed to the hurricane directly, by visiting 3299 randomly chosen households across Puerto Rico. At each household, inhabitants were asked about any household members who had died and their cause of death (related to or unrelated to the hurricane), and whether anyone had left Puerto Rico because of the hurricane. From this survey, the paper's authors extrapolated the number hurricane-related deaths to the entire island. The headline estimate of 4,625 at three months is somewhat larger than the middle column of the study above, but due to the small number of recorded deaths in the survey sample the 95% confidence interval is also much larger: 793 to 8498 excess deaths. (Gelman's blog has some good discussion of this earlier study, including some commentary from the authors.)

With two independent studies reporting excess deaths well into the thousands attributable directly to Hurricane Maria, it's a fair question to ask whether a more effective response before and after the storm could have reduced the scale of this human tragedy.

Milken Institute School of Public Health: Study to Estimate the Excess Deaths from Hurricane Maria in Puerto Rico

*by Antony Unwin, **University of Augsburg, Germany*

There are many different methods for identifying outliers and a lot of them are available in **R**. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the **OutliersO3** package and was presented at last year’s useR! in Brussels. Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

```
## HDo PCS BAC adjOut DDC MCD
## 14 4 5 0 6 5
```

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why `adjOutlyingness`

finds few or no outliers (results of this method are mildly random). The default value, according to `adjOutlyingness`

’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

Bias is a major issue in machine learning. But can we develop a system to "un-bias" the results? In this keynote at NIPS 2017, Kate Crawford argues that treating this as a technical problem means ignoring the underlying social problem, and has the potential to make things worse.

You can read more about biases in AI systems in this article at the Microsoft AI blog.

Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”

— Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”

— Hanna Wallach, Senior Researcher at Microsoft Research NYC.

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots ... the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.

All slot machines are designed to have a "house edge" — the percentage of player bets retained by the machine in the long run — greater than zero. Some may take 1% of each bet (over a long-run average); some may take as much as 15%. But every slot machine takes *something*.

That being said, with all those complex rules and features, trying to calculate the house edge, even when you know all of the underlying probabilities and frequencies, is no easy task. Giora Simchoni demonstrates the problem with an R script to calculate the house edge of an "open source" slot machine *Atkins Diet*. Click the image below to try it out.

This virtual machine is at a typical level of complexity of modern slot machines. Even though we know the pay table (which is always public) and the relative frequency of the symbols on the reels (which usually isn't), calculating the house edge for this machine requires several pages of code. You could calculate the expected return analytically, of course, but it turns out to be a somewhat error-prone combinatorial problem. The simplest approach is to simulate playing the machine 100,000 times or so. Then we can have a look at the distribution of the payouts over all of these spins:

The *x* axis here is log(Total Wins + 1), in log-dollars, from a single spin. It's interesting to see the impact of the bet size (which increases variance but doesn't change the distribution), and the number of lines played. Playing one 20-line game isn't the same as playing 20 1-line games, because the re-use of the symbols means multi-line wins are *not* independent: a high-value symbol (like a wild) may contribute to wins on multiple lines. Conversely, losing combinations have a tendency to cluster together, too. It all balances in the end, but the possibility of more frequent wins (coupled with higher-value losses) is apparently appealing to players, since many machines encourage multi-line play.

Nonetheless, whichever method you play, the house edge is always positive. For *Atkins Diet*, it's about between 3% and 4%. (The simulations suggest 4% for single-line play and about 3.2% for multi-line play, but per-line expected returns are the same in each case.) You can see the details of the calculation, and the complete R code behind it, at the link below.

Giora Simchoni: Don't Drink and Gamble (via the author)

*by Jocelyn Barker, Data Scientist at Microsoft*

I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous.

Partially because it means I get to think about one of my hobbies at work. More practically, because consequences of probability distributions can be hard to examine in the real world. How do you quantify the impact of having your driverless car misclassify objects on the road? Games like DnD on the other hand were built around quantifying the impact of decisions. You decide to do something, add up some numbers that represent the difficulty of what you want to do, and then roll dice to add in some randomness. It also means it is a great environment to study how the distribution of the randomness impacts the outcomes.

One of the core mechanics of playing DnD and related role-playing games involve rolling a 20 sided die (often referred to as a d20). If you want your character to do something like climb a tree, there is some assigned difficulty for it (eg. 10) and if you roll higher than that number, you achieve your goal. If your character is good at that thing, they get to add a skill modifier (eg. 5) to the number they roll making it more likely that they can do what they wanted to do. If the thing you want to do involves another character, things change a little. Instead of having a set difficulty like for climbing a tree, the difficulty is an opposed roll from the other player. So if Character A wants to sneak past Character B, both players roll d20s and Character A adds their “stealth” modifier against Character B’s “perception” modifier. Whoever between them gets a higher number wins with a tie going to the “perceiver”. Ok, I promise, that is all the DnD rules you need to know for this blog post.

So here is where the stats nerd in me got excited. Some people change the rules of rolling to make different distributions. The default distribution is pretty boring, 20 numbers with equal probability:

One common way people modify this is with the idea of “critical”. The idea is that sometimes people do way better or worse than average. To reflect this, if you roll a 20, instead of adding 20 to your modifier, you add 30. If you roll a 1, you subtract 10 from your modifier.

Another stats nerd must have made up the last distribution. The idea for constructing it is weird, but the behavior is much more Gaussian. It is called 3z8 because you roll 3 eight-sided dice that are numbered 0-7 and sum them up giving a value between 0 and 21. 1-20 act as in the standard rules, but 0 and 21 are now treated like criticals (but at a much lower frequency than before).

The cool thing is these distributions have almost identical expected values (10.5 for d20, 10.45 with criticals, and 10.498 for 3z8), but very different distributions. How do these distributions affect the game? What can we learn from this as statisticians?

To examine how our distributions affects outcome, we will look at a scenario where a character, who we will call the rogue, wants to sneak past three guards. If any of the guard’s perception is greater than or equal to the rogue’s stealth, we will say the rogue loses the encounter, if they are all lower, the rogue is successful. We can already see the rogue is at a disadvantage; any one of the guards succeeding is a failure for her. We note that assuming all the guards have the same perception modifier, the actual value of the modifier for the guards doesn’t matter, just the difference between their modifier and the modifier of the rogue because the two modifiers are just a scalar adjustment of the value rolled. In other words, it doesn’t matter if the guards are average Joes with a 0 modifier and the rogue is reasonably sneaky with a +5 or if the guards are hyper alert with a +15 and the rogue is a ninja with a +20; the probability of success is the same in the two scenarios.

Lets start off getting a feeling for what the dice rolls will look like. Since the rogue is only rolling one die, her probability distribution looks the same as the distribution of the dice from the previous section. Now, lets consider the guards. In order for the rogue to fail to sneak by, she only needs to be seen by one of the guards. That means we just need to look at the probability that the maximum roll for one of the guards is \(n\) for \(n \in 1,..,20\). We will start with our default distribution. The number of ways you can have 3 dice roll a value of \(n\) or less is \(n^3\). Therefore the number of ways you can have the max value of the dice be exactly \(n\) is the number of ways you can roll \(n\) or less minus the number of ways where all the dice are \(n - 1\) or less giving us \(n^3 - (n - 1)^3\) ways to roll a max value of \(n\). Finally, we can divide by the total number of roll combinations for an 20-sided dice, \(20^3\), giving us our final probabilities of:

\[\frac{n^3 - (n-1)^3}{20^3} \textrm{for} \{n \in 1, ..., 20\}\]

The only thing that changes when we add criticals to the mix is that now the probabilities previously assigned to 1 get re-assigned to -10 and those assigned to 20 get reassigned to 30 giving us the following distribution.

This means our guards get a critical success ~14% of the time! This will have a big impact on our final distributions.

Finally, lets look at the distribution for the guards using the 3z8 system.

In the previous distributions, the maximum value became the single most likely roll. Because of the the low probability of rolling a 21 in the 3z8 distribution, this distribution still skews right, but peaks at 14. In this distribution, criticals only occur ~0.6% of the time; much less than the previous distribution.

Now that we have looked at the distributions of the rolls for the rogue and the guards, lets see what our final outcomes look like. As previously mentioned, we don’t need to worry about the specific modifiers for the two groups, just the difference between them. Below is a plot showing the relative modifier for the rogue on the x-axis and the probability of success on the y-axis for our three different probability distributions.

We see that for the entire curve, our odds of success goes down when we add criticals and for most of the curve, it goes up for 3z8. Lets think about why. We know the guards are more likely to roll a 20 and less likely to roll a 1 from the distribution we made earlier. This happens about 14% of the time, which is pretty common, and when it happens, the rogue has to have a very high modifier and still roll well to overcome it unless they also roll a 20. On the other hand, with 3z8 system, criticals are far less common and everyone rolls close to average more of the time. The expected value for the rogue is ~10.5, where as it is ~14 for the guards, so when everyone performs close to average, the rogue only needs a small modifier to have a reasonable chance of success.

To illustrate how much of a difference there is between the two, lets consider what would be the minimum modifier needed to have a certain probability of success.

Probability | Roll 1d20 | With Criticals | Roll 3z8 |
---|---|---|---|

50% | 6 | 7 | 4 |

75% | 11 | 13 | 8 |

90% | 15 | 22 | 11 |

95% | 17 | 27 | 13 |

We see from the table that reasonably small modifiers make a big difference in the 3z8 system, where as very large modifiers are needed to have a reasonable chance of success when criticals are added. To give context on just how large this is, when a someone is invisible, this only adds +20 to their stealth checks when they are moving. In other words, in the system with criticals, our rogue could literally be invisible sneaking past a group of not particularly observant gaurds and have a reasonable chance of failing.

The next thing to consider is our rogue may have to make multiple checks to sneak into a place (eg. one to sneak into the courtyard, one to sneak from bush to bush, and then a final one to sneak over the wall). If we look at the results of our rogue making three successful checks in a row, our probabilities change even more.

Probability | Roll 1d20 | With Criticals | Roll 3z8 |
---|---|---|---|

50% | 12 | 15 | 9 |

75% | 16 | 23 | 11 |

90% | 18 | 28 | 14 |

95% | 19 | 29 | 15 |

Making multiple checks exaggerates the differences we saw previously. Part of the reason for the poor performance with the addition of criticals (and for the funny shape of the critical curve) is there is a different cost associated with criticals for the rogue compared to the guards. If the guards roll a 20 or the rogue rolls a 1, when criticals are in play, the guards will almost certainly win, even if the rogue has a much higher modifier. On the other hand, if the guard rolls a 1 or the rogue rolls a 20, there isn’t much difference in outcome between getting that critical and any other low/high roll; play continues to the next round.

Many times as data scientists, we think of the predictions we make as discrete data points and when we evaluate our models we use aggregate metrics. It is easy to lose sight that our predictions are samples from a probability distribution, and that aggregate measures can obscure how well our model is really performing. We saw in the example with criticals where big hits and misses can make a huge impact on outcomes, even if the average performance is largely the same. We also saw with the 3z8 system where decreasing the expected value of the roll can actually increase performance by making the “average” outcome more likely.

Does all of this sound contrived to you, like I am trying to force an analogy? Let me make a concrete example from my real life data science job. I am responsible for making the machine learning revenue forecasts for Microsoft. Twice a quarter, I forecast the revenue for all of the products at Microsoft world wide. While these product forecasts do need to be accurate for internal use, the forecasts are also summed up to create segment level forecasts. Microsoft’s segment level forecasts go to Wall Street and having our forecasts fail to meet actuals can be a big problem for the company. We can think about our rogue sneaking past our guards as being an analogy for nailing the segment level forecast. If I succeed for most of the products (our individual guards) but have a critical miss of $1 billion error on one of them, then I have a $1 billion error for the segment and I failed. Also like our rogue, one success doesn’t mean we have won. There is always another quarter and doing well one quarter doesn’t mean Wall Street will cut you some slack the next. Finally, a critical success is less valuable than a critical failure is problematic. Getting the forecasts perfect one quarter will just get your a “good job” and a pat on the back, but a big miss costs the company. In this context, it is easy to see why the finance team doesn’t take the machine learning forecasts as gospel, even with our track record of high accuracy.

So as you evaluate your models, keep our sneaky friend in mind. Rather than just thinking about your average metrics, think about your distribution of errors. Are your errors clustered nicely around the mean or are they scattershot of low and high? What does that mean for your application? Are those really low errors valuable enough to be worth getting the really high ones from time to time? Many times having a reliable model may be more valuable than a less reliable one with higher average performance, so when you evaluate, think distributions, not means.

*The charts in this post were all produced using the R language. To see the code behind the charts, take a look at this R Markdown file.*

*by Błażej Moska, computer science student and data science intern*

One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

\[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

\[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

Bias is defined as follows:

\[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

\(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over `nsampl`

times, where `nsampl`

is the parameter describing number of bootstrap samples.

Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

Because bias and variance could be controlled by parameters sent to the `rpart`

function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are `cp`

(complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and `minsplit`

, which defines minimum number of observations needed to attempt a split.

Operations mentioned above is rather exhaustive in computational terms: we need to create *nsampl* bootstrap samples, grow `nsampl`

trees, calculate `nsampl`

predictions, `nrow`

variances, `nrow`

biases and repeat those operations for the number of parameters (length of the vector `cp`

or `minsplit`

). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

So, summing up, the procedure looks as follows:

- Create bootstrap samples (by bootstrapping original dataset)
- Train model on each of these bootstrap datasets
- Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
- Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
- Calculate mean bias/absolute bias and mean variance