*by Allison Koenecke, Data Scientist, AI & Research Group at Microsoft, with acknowledgements to Amita Gajewar and John-Mark Agosta.*

*In this tutorial, Allison Koenecke demonstrates how Microsoft could recommend to customers the next set of services they should acquire as they expand their use of the Azure Cloud, by using a temporal extension to conventional Market Basket Analysis.** *

Market Basket Analysis (MBA) answers a standard business question: given a set of grocery store receipts, can we find bundles of products often purchased together (e.g., peanut butter and jelly)? Suppose we instead want to model the evolution of a customer’s services, such as determining whether buying peanut butter in the past indicates a higher likelihood of buying bread in the future. For this, we apply a sequential version of MBA, sometimes called “sequential itemset mining” or “sequential pattern mining”, to introduce a time component to the analysis [1]. Sequential itemset mining has been applied across many industries, from determining a patient’s sequence of medical prescriptions [2] to detecting misuse intrusions such as application layer attacks [3]. In this tutorial, when given a series of purchases over time, we determine whether we can (1) find product bundles that we expect to be consumed simultaneously, and also (2) examine how these bundles evolve over time.

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time. We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix. We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month. Actual Azure Service names have been removed for confidentiality reasons.

`read_baskets`

function within the *Figure 1: Sample conversion of data input to transaction matrix. The stand-in letters in the “ServiceLevel” column represent specific Azure services, such as Compute, Networking, Data + Storage, Web + Mobile, etc.*

We will use the SPADE (**S**equential **Pa**ttern **D**iscovery using **E**quivalence classes) algorithm for temporal market basket analysis, which is invoked with the `cspade`

function. This method, recursive by length of sequence, is detailed in Figure 2 below. For example, in the first pass we look at sequences of length 1 (i.e., we determine which individual Azure Services appear in our transaction data). Based on the most frequent single-length sequences (e.g., “A” appears more often than “D”), we observe two types of two-element sequences. First, we observe two-element temporal sequences** (**“B → A” requires “B” to be purchased before “A”). Next, we observe two-element item groupings (“AB” requires “A” and “B” to exist at a certain time simultaneously). Then, based on the most frequent length-two outputs, we move on to finding three-element sequences (e.g., “D → B → A”) and three-element item groupings (e.g., “ABF”). This continues until we reach a user-defined maximum length, or until we reach a length at which we can no longer find frequent outputs.

Going forward, we will refer to the term “itemset” as any collection of items purchased by the customer, so an itemset can consist of item groupings (which could be a single item) or temporal sequences.

*Figure 2: SPADE Algorithm frequent sequence generation (original image from [5]).*

We’ve used the word “frequent” several times in the SPADE algorithm description. What exactly does this mean? There are three different terms that quantify frequency for MBA generally, and a best-case scenario involves high values in all three metrics: support, confidence, and lift. Here are their definitions:

**Support(****{ɑ****}****): **The fraction of transactions that contain an itemset {ɑ}. High support values correspond to commonly-found itemsets that are applicable to many transactions.

**Confidence(****{ɑ****} →** **{β****}****): **The likelihood that the sequential rule {ɑ} → {β} actually occurs among transactions containing itemset {ɑ}, under the constraint that itemset {ɑ} is transacted prior to {β}. Confidence can be interpreted as a conditional probability given a pre-existing transaction with itemset {ɑ}; high confidence implies a high likelihood that {β} occurs in a future transaction. Technically, confidence is defined as the ratio: Support({ɑ} and {b}) / Support({ɑ}).

**Lift({ɑ} → {β}****)**: The degree of improvement from applying the sequential rule {ɑ} → {β} over observing itemsets {ɑ} and {β} independently occurring in the wild. The lift ratio is defined by Support({ɑ} and {β}) / (Support({ɑ}) * Support({β})). Note that if itemsets {ɑ} and {β} were independently occurring events, then the denominator would equate to the numerator, and the lift value would be 1. High lift – greater than 1 – implies that the presence of pre-existing itemset {ɑ} has increased the probability that {β} will occur in a future transaction; this can be interpreted as a higher degree of dependence between itemsets {ɑ} and {β}. Conversely, a low lift value – less than 1 – implies a negative dependence between itemsets {ɑ} and {β}. This metric tells you not just what is popular with all customers, but what is most useful for a particular customer given their history (e.g., a low lift value could be interpreted as {β} acting as a substitute good for {ɑ}).

Returning to our example, recall that our input (`trans_matrix`

created in Step 1) is the month-over-month sequence of Azure services purchased. With just one line of code, the `cspade`

function returns frequent sequences mined in descending order of support. Further, to cut down on runtimes, we can define a minimum support value to be output – in the below code, we use a minimum of 0.3, which in this case gives us 18 different sequences to observe. We will see how confidence and lift play a role in Step 3 of this tutorial.

We can view results in the data frame `s1.df`

thus created, and look at the following statistics by calling `summary`

on `s1`

. Specifically, we see:

“(1) the list of the most frequent individual items (A,B, ..),

(2) the list of the most frequent set of items that occur in events (referred to as elements),

(3) the distribution of the sizes of the set of items,

(4) the distribution of the number of events in a sequence (referred to as sequence length),

(5) the minimum, maximum, mean and median support values, and

(6) the set of frequent sequences mined, ordered by its support value” [6].

*Figure 3: Sample output from the **cspade** algorithm shows 18 frequent itemsets, as well as statistics on these values (e.g., item A appears 11 out of 18 times overall as part of various itemsets, but appears by itself in a transaction as {A} only 8 times). Note that the comma delimiter between sets within each sequence implies a temporal sequence as seen in Figure 2.*

It takes just one more line of code to convert this set of sequences into a set of rules. Specifically, the “strong association rules” generated must satisfy minimum support and confidence values, and the left-hand side of the rule must occur at a time before the rule’s right-hand side. We have set the lower confidence bound to 0.5 in this example; a common default is a minimum of 0.8.

This returns the data frame shown in Figure 4. We can interpret the “rule” column as showing which bundles of Azure services imply consumption of additional services in coming months. Further, for all rules, we can compare the three previously-defined metrics: support, confidence, and lift.

*Figure 4: Sample output from the **ruleInduction** method.*

How do we interpret this? We can say, for example, every single time that we have seen service D purchased by a customer, the same customer will always buy both B and F simultaneously at a later time. This is what a confidence score of 1.0 implies in Row 2 of Figure 4. Going back to Figure 1, we see that this is the case for user IDs 1 and 4. Meanwhile, the lift value of 1.0 implies that the probability of D being purchased, and the probability of both B and F being purchased afterwards, are independent of each other. So, we cannot claim that buying both B and F is dependent on having previously bought item D, but also, these events do not appear negatively dependent (i.e., it is also not the case that buying D previously makes it relatively unnecessary for the user to later buy B and F, as would be the case for the rules having a lift value of 0.5). While this output is fairly straightforward, we can do some simple cleaning to parse out the “before” and “after” steps of the rule, and sort by the metrics deemed to be most important for the specific use case.

*Figure 5: Readable output in Excel stored in **TemporalRules.csv**. The values stored in **FreqItemGroupingsInRules.csv** are a list containing items: D; B; F; B,F; and A.*

Now, our two main results are stored in the csv files, `FreqItemGroupingsInRules.csv`

and `TemporalRules.csv`

. Taken together, all itemsets that we care about will be defined by either the frequent item groupings or temporal rules. But, how do we apply these to our business case?

First, the frequent item groupings themselves are useful domain knowledge. Even if the sequential ordering of B and F are ignored, knowing that items B and F are frequently purchased together raises some business opportunities: is there a reason that they are sold separately? Do customers have some need for both items that would be served better by rolling them into one product?

Second, the temporal rules allow us to propose better-targeted recommendations over time to customers. For example, assume that we see the rule stating that “<{D}> => <{B,F}>“ with high confidence and reasonable lift (both scoring 1.0 in our example). For a customer who has previously purchased item D, we can recommend that they now buy the bundle of two items, B and F. In a future timestep, suppose our customer loves this suggestion and has now purchased B and F simultaneously. Now, we can consult the temporal rules again and find the rule “<{B,F}> => <{A}>” (which we choose due to the correct starting bundle of goods, as well as the high confidence). Now, we can further recommend that our customer adopt item A at a future time. In fact, we can parse these rules more easily using the cleaned Excel output shown in Figure 5 – our recommendation path is shown in Row 4. This style of targeted personalized recommendations is helpful to both our customers and our salespeople, and future recommendations will become increasingly accurate as more data are collected to contribute to the confidence and lift calculations.

The above steps show a computationally efficient method to find temporal rules; this is far faster than a brute-force method, such as finding frequent itemsets in each timestep using the arules package, and then manually comparing them across time. However, depending on the support, confidence, and lift cut-offs desired, it is possible for even the `arulesSequences`

method to take significant amounts of computation time. Hence, if the main desired outcome is expected to be one of the highest-scoring rules, it may be prudent to start from a high threshold frequency cut-off before testing lower values.

Aside from frequency cut-off parameters, it is worth considering the product aggregation level at which results are measured. Let us assume that we have products or services being sold that are organized in a hierarchy. It will take less time to run the SPADE algorithm on the least-granular level since the algorithm will see more repeated services being sold to different customers. If there is a desired minimum cut-off that takes too long to run on the specific product levels you desire, consider running the above code on a higher, more aggregate level of the product hierarchy, and then drill down in that set to determine which products are best to recommend in future timesteps. For example, at a higher product level within Azure Services (such as Analytics, Compute, Web + Mobile, etc.), suppose we find that we should recommend an Analytics product to a specific customer. Upon further examination, perhaps we find that Analytics-based temporal rules on more granular products indicate to recommend Machine Learning tools as opposed to Data Lake products. Hence, we are still able to make a reasonable suggestion for lower-level products without spending excessive computation time on the entire dataset.

Lastly, the above sequential pattern mining code may not be directly applicable if you: (1) care about the quantity of items being bought at any given point in time (since we simply observe the presence or absence of an itemset in this tutorial), or (2) have data that are irregular over time, but aim to predict a recommendation for a specific time interval in the future. In the former case, encoding a repeated item with separate item names (i.e., one name for each unit purchased) can allow quantity to be expressed. For the latter case, it is a best practice to use regular time intervals; this can be done by bucketing purchases by, e.g., month (if predictions are to be made over a monthly timeframe). Depending on sales structure, it may be necessary to either interpolate or report zero purchases in intermediate timesteps.

In summary, given the wide usage of Excel by finance teams tracking historical purchases, this approach offers an effective tool for providing insights on product or service adoption over time. Sequential pattern mining as implemented here can potentially be harnessed to recommend deals to a company’s salespeople, find what customers are likely to consume next, and discover which product bundles remain popular over time. Let us know what use cases you have found for this method!

- Srikant, R., Agrawal, R., Apers, P., Bouzeghoub, M., Gardarin, G. (1996) "Mining sequential patterns: Generalizations and performance improvements", Advances in Database Technology — EDBT '96 vol. 1 no. 17.
- Aileen P. Wright, Adam T. Wright, Allison B. McCoy, Dean F. Sittig, The use of sequential pattern mining to predict next prescribed medications, Journal of Biomedical Informatics, Volume 53, 2015, Pages 73-80, ISSN 1532-0464, https://doi.org/10.1016/j.jbi.2014.09.003.
- Song SJ., Huang Z., Hu HP., Jin SY. (2004) A Sequential Pattern Mining Algorithm for Misuse Intrusion Detection. In: Jin H., Pan Y., Xiao N., Sun J. (eds) Grid and Cooperative Computing - GCC 2004 Workshops. GCC 2004. Lecture Notes in Computer Science, vol 3252. Springer, Berlin, Heidelberg.
- Package ‘arulesSequences’ documentation: https://cran.r-project.org/web/packages/arulesSequences/arulesSequences.pdf.
- J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31–60.
- Data Mining Algorithms in R: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Sequence_Mining/SPADE.
- Analyzing Transaction Data like a Data Scientist: https://rpubs.com/Mahsa_A/Part4_AnalyzeTransactionData.

Anscombe's Quartet is a famous collection of four small data sets — just 11 (x,y) pairs each — that was developed in the 1970s to emphasize the fact that sometimes, numerical summaries of data aren't enough. (For a modern take on this idea, see also the Datasaurus Dozen.) In this case, it takes visualizing the data to realize that the for data sets are qualitatively very different, even though the means, variances, and regression coefficients are all the same. In the video below for Guy in a Cube, Buck Woody uses R to summarize the data (which is conveniently built into R) and visualize it using an R script in Power BI.

The video also makes for a nice introduction to R for those new to statistics, and also demonstrates using R code to generate graphics in Power BI.

One of the differences between statistical data scientists and machine learning engineers is that while the latter group are concerned primarily with the predictive performance of a model, the former group are also concerned with the *fit* of the model. A model that misses important structures in the data — for example, seasonal trends, or a poor fit to specific subgroups — is likely to be lacking important variables or features in the source data. You can try different machine learning techniques or adjust hyperparameters to your heart's content, but you're unlikely to discover problems like this without evaluating the model fit.

One of the most powerful tools for assessing model fit is the residual plot: a scatterplot of the predicted values of the model versus the residuals (the difference between the predictions and the original data). If the model fits well, there should be no obvious relation between the two. For example, visual inspection shows us that the model below first fairly well for 19 of the 20 subgroups in the data, but there may be a missing variable that would explain the apparent residual trend for group 19.

Assessing residual plots like these is something of an art, which is probably why it isn't routine in machine learning circles. But what if we could use deep learning and computer vision to assess residual plots like these automatically? That's what Professor Di Cook from Monash University proposes, in the 2018 Belz Lecture for the Statistical Society of Australia, Human vs computer: when visualising data, who wins?

(The slides for the presentation, created using R, are also available online.) The first part of the talk also includes a statistical interpretation of neural networks, and an intuitive explanation of deep learning for computer vision. The talk also references a related paper, Visualizing Statistical Models : Removing the Blindfold (by Hadley Wickham, Dianne Cook and Jeike Hofman and published in the ASA Data Science Journal in 2015) which is well worth a read as well.

President Trump is once again causing distress by downplaying the number of deaths caused by Hurricane Maria's devastation of Puerto Rico last year. Official estimates initially put the death toll at 15 before raising it to 64 months later, but it was clear even then that those numbers were absurdly low. The government of Puerto Rico commissioned an official report from the Millikan Institute of Public Health at George Washington University (GWU) to obtain a more accurate estimate, and with its interim publication official toll stands at 2,975.

Why were the initial estimates so low? I read the interim GWU report to find out. The report itself is clearly written, quite detailed, and composed by an expert team of social and medical scientists, demographers, epidemiologists and biostatisticians, and I find its analysis and conclusions compelling. (Sadly however the code and data behind the analysis have not yet been released; hopefully they will become available when the final report is published.) In short:

- In the earliest days of the hurricane, the death-recording office was closed and without power, which suppressed the official count.
- Even once death certificates were collected, it became clear that officials throughout Puerto Rico has not been trained on how to record deaths in the event of a natural disaster, and most deaths were not attributed correctly in official records.

Given these deficiencies in the usual data used to calculate death tolls (death certificates) the GWU team used a different approach to calculate the death toll. The basis of the method was to estimate **excess mortality**, in other words, how many deaths occurred in the post-Maria period compared to the number of deaths that would have been expected if it had never happened. This calculation required two quantitative studies:

- An estimate of what the population would have been if the hurricane hadn't happened. This was based on a GLM model of monthly data from the prior years, accounting for factors including recorded population, normal emigration and mortality rates.
- The total number of deaths in the post-Maria period, based on death certificates from the Puerto Rico government (irrespective of how the cause of death was coded).
- (A third study examined the communication protocols before, during and after the disaster. This study did not affect the quantiative conclusions, but formed the basis of some of the report's recommendations.)

The difference between the actual mortality, and the estimated "normal" mortality formed the basis for the estimate of excess deaths attributed to the hurricane. You can see those estimates of excess deaths one month, three months, and five months after the event in the table below; the last column represents the current official estimate.

These results are consistent in scale with another earlier study by Nishant Kishore et al. (The data and R code behind this study is available on GitHub.) This study attempted to quantify deaths attributed to the hurricane directly, by visiting 3299 randomly chosen households across Puerto Rico. At each household, inhabitants were asked about any household members who had died and their cause of death (related to or unrelated to the hurricane), and whether anyone had left Puerto Rico because of the hurricane. From this survey, the paper's authors extrapolated the number hurricane-related deaths to the entire island. The headline estimate of 4,625 at three months is somewhat larger than the middle column of the study above, but due to the small number of recorded deaths in the survey sample the 95% confidence interval is also much larger: 793 to 8498 excess deaths. (Gelman's blog has some good discussion of this earlier study, including some commentary from the authors.)

With two independent studies reporting excess deaths well into the thousands attributable directly to Hurricane Maria, it's a fair question to ask whether a more effective response before and after the storm could have reduced the scale of this human tragedy.

Milken Institute School of Public Health: Study to Estimate the Excess Deaths from Hurricane Maria in Puerto Rico

*by Antony Unwin, **University of Augsburg, Germany*

There are many different methods for identifying outliers and a lot of them are available in **R**. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the **OutliersO3** package and was presented at last year’s useR! in Brussels. Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

```
## HDo PCS BAC adjOut DDC MCD
## 14 4 5 0 6 5
```

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why `adjOutlyingness`

finds few or no outliers (results of this method are mildly random). The default value, according to `adjOutlyingness`

’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

Bias is a major issue in machine learning. But can we develop a system to "un-bias" the results? In this keynote at NIPS 2017, Kate Crawford argues that treating this as a technical problem means ignoring the underlying social problem, and has the potential to make things worse.

You can read more about biases in AI systems in this article at the Microsoft AI blog.

Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”

— Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”

— Hanna Wallach, Senior Researcher at Microsoft Research NYC.

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots ... the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.

All slot machines are designed to have a "house edge" — the percentage of player bets retained by the machine in the long run — greater than zero. Some may take 1% of each bet (over a long-run average); some may take as much as 15%. But every slot machine takes *something*.

That being said, with all those complex rules and features, trying to calculate the house edge, even when you know all of the underlying probabilities and frequencies, is no easy task. Giora Simchoni demonstrates the problem with an R script to calculate the house edge of an "open source" slot machine *Atkins Diet*. Click the image below to try it out.

This virtual machine is at a typical level of complexity of modern slot machines. Even though we know the pay table (which is always public) and the relative frequency of the symbols on the reels (which usually isn't), calculating the house edge for this machine requires several pages of code. You could calculate the expected return analytically, of course, but it turns out to be a somewhat error-prone combinatorial problem. The simplest approach is to simulate playing the machine 100,000 times or so. Then we can have a look at the distribution of the payouts over all of these spins:

The *x* axis here is log(Total Wins + 1), in log-dollars, from a single spin. It's interesting to see the impact of the bet size (which increases variance but doesn't change the distribution), and the number of lines played. Playing one 20-line game isn't the same as playing 20 1-line games, because the re-use of the symbols means multi-line wins are *not* independent: a high-value symbol (like a wild) may contribute to wins on multiple lines. Conversely, losing combinations have a tendency to cluster together, too. It all balances in the end, but the possibility of more frequent wins (coupled with higher-value losses) is apparently appealing to players, since many machines encourage multi-line play.

Nonetheless, whichever method you play, the house edge is always positive. For *Atkins Diet*, it's about between 3% and 4%. (The simulations suggest 4% for single-line play and about 3.2% for multi-line play, but per-line expected returns are the same in each case.) You can see the details of the calculation, and the complete R code behind it, at the link below.

Giora Simchoni: Don't Drink and Gamble (via the author)