Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.” — Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’” — Hanna Wallach, Senior Researcher at Microsoft Research NYC.

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots ... the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.

All slot machines are designed to have a "house edge" — the percentage of player bets retained by the machine in the long run — greater than zero. Some may take 1% of each bet (over a long-run average); some may take as much as 15%. But every slot machine takes something.

That being said, with all those complex rules and features, trying to calculate the house edge, even when you know all of the underlying probabilities and frequencies, is no easy task. Giora Simchoni demonstrates the problem with an R script to calculate the house edge of an "open source" slot machine Atkins Diet. Click the image below to try it out.

This virtual machine is at a typical level of complexity of modern slot machines. Even though we know the pay table (which is always public) and the relative frequency of the symbols on the reels (which usually isn't), calculating the house edge for this machine requires several pages of code. You could calculate the expected return analytically, of course, but it turns out to be a somewhat error-prone combinatorial problem. The simplest approach is to simulate playing the machine 100,000 times or so. Then we can have a look at the distribution of the payouts over all of these spins:

The x axis here is log(Total Wins + 1), in log-dollars, from a single spin. It's interesting to see the impact of the bet size (which increases variance but doesn't change the distribution), and the number of lines played. Playing one 20-line game isn't the same as playing 20 1-line games, because the re-use of the symbols means multi-line wins are not independent: a high-value symbol (like a wild) may contribute to wins on multiple lines. Conversely, losing combinations have a tendency to cluster together, too. It all balances in the end, but the possibility of more frequent wins (coupled with higher-value losses) is apparently appealing to players, since many machines encourage multi-line play.

Nonetheless, whichever method you play, the house edge is always positive. For Atkins Diet, it's about between 3% and 4%. (The simulations suggest 4% for single-line play and about 3.2% for multi-line play, but per-line expected returns are the same in each case.) You can see the details of the calculation, and the complete R code behind it, at the link below.

I have a confession to make. I am not just a statistics nerd; I am also a role-playing games geek. I have been playing Dungeons and Dragons (DnD) and its variants since high school. While playing with my friends the other day it occurred to me, DnD may have some lessons to share in my job as a data scientist. Hidden in its dice rolling mechanics is a perfect little experiment for demonstrating at least one reason why practitioners may resist using statistical methods even when we can demonstrate a better average performance than previous methods. It is all about distributions. While our averages may be higher, the distribution of individual data points can be disastrous.

Why Use Role-Playing Games as an Example?

Partially because it means I get to think about one of my hobbies at work. More practically, because consequences of probability distributions can be hard to examine in the real world. How do you quantify the impact of having your driverless car misclassify objects on the road? Games like DnD on the other hand were built around quantifying the impact of decisions. You decide to do something, add up some numbers that represent the difficulty of what you want to do, and then roll dice to add in some randomness. It also means it is a great environment to study how the distribution of the randomness impacts the outcomes.

A Little Background on DnD

One of the core mechanics of playing DnD and related role-playing games involve rolling a 20 sided die (often referred to as a d20). If you want your character to do something like climb a tree, there is some assigned difficulty for it (eg. 10) and if you roll higher than that number, you achieve your goal. If your character is good at that thing, they get to add a skill modifier (eg. 5) to the number they roll making it more likely that they can do what they wanted to do. If the thing you want to do involves another character, things change a little. Instead of having a set difficulty like for climbing a tree, the difficulty is an opposed roll from the other player. So if Character A wants to sneak past Character B, both players roll d20s and Character A adds their “stealth” modifier against Character B’s “perception” modifier. Whoever between them gets a higher number wins with a tie going to the “perceiver”. Ok, I promise, that is all the DnD rules you need to know for this blog post.

Alternative Rolling Mechanics: What’s in a Distribution?

So here is where the stats nerd in me got excited. Some people change the rules of rolling to make different distributions. The default distribution is pretty boring, 20 numbers with equal probability:

One common way people modify this is with the idea of “critical”. The idea is that sometimes people do way better or worse than average. To reflect this, if you roll a 20, instead of adding 20 to your modifier, you add 30. If you roll a 1, you subtract 10 from your modifier.

Another stats nerd must have made up the last distribution. The idea for constructing it is weird, but the behavior is much more Gaussian. It is called 3z8 because you roll 3 eight-sided dice that are numbered 0-7 and sum them up giving a value between 0 and 21. 1-20 act as in the standard rules, but 0 and 21 are now treated like criticals (but at a much lower frequency than before).

The cool thing is these distributions have almost identical expected values (10.5 for d20, 10.45 with criticals, and 10.498 for 3z8), but very different distributions. How do these distributions affect the game? What can we learn from this as statisticians?

by Błażej Moska, computer science student and data science intern

One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over nsampl times, where nsampl is the parameter describing number of bootstrap samples.

Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

Because bias and variance could be controlled by parameters sent to the rpart function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are cp (complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and minsplit, which defines minimum number of observations needed to attempt a split.

Operations mentioned above is rather exhaustive in computational terms: we need to create nsampl bootstrap samples, grow nsampl trees, calculate nsampl predictions, nrow variances, nrow biases and repeat those operations for the number of parameters (length of the vector cp or minsplit). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

So, summing up, the procedure looks as follows:

Create bootstrap samples (by bootstrapping original dataset)

Train model on each of these bootstrap datasets

Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)

Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)

Calculate mean bias/absolute bias and mean variance

by Błażej Moska, computer science student and data science intern

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function. The equation is defined as follows (source: Wikipedia):

\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations for loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

Rcpp (C++ version)

#include <Rcpp.h>#include <math.h>usingnamespace Rcpp;
// [[Rcpp::export]]
NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) {
/* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters*/int c=centers.rows();
int rows = data.rows();
int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */double dist=0;
double denominator=0; //denominator of “main” equation
NumericMatrix result(rows,c); //declaration of matrix of resultsfor(int i=0;i<rows;i++){
for(int j=0;j<c;j++){
for(int k=0;k<c;k++){
for(int p=0;p<cols;p++){
tempDist = tempDist+pow(centers(j,p)-data(i,p),2);
//in innermost loop an euclidean distance is calculated.
dist = dist + pow(centers(k,p)-data(i,p),2);
/*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/
}
tempDist = sqrt(tempDist);
dist = sqrt(dist);
denominator = denominator+pow((tempDist/dist),(2/(m-1)));
tempDist =0;
dist =0;
}
result(i,j) =1/denominator;
// nominator/denominator in the main equation
denominator =0;
}
}
return result;
}

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function fuzzyClustering will then be available from R.

R version

fuzzyClustering=function(data,centers,m){
c <- nrow(centers)
rows <- nrow(data)
cols <- ncol(data)
result <- matrix(0,nrow=rows,ncol=c) #defining membership matrix
denominator <-0for(i in1:rows){
for(j in1:c){
tempDist <- sqrt(sum((centers[j,]-data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in1:c){
Dist <- sqrt(sum((centers[k,]-data[i,])^2)) #euclidean distance, denominator inside a sum operator
denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation
}
result[i,j] <-1/denominator #inserting value into membership matrix
denominator <-0
}
}
return(result);
}

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

Shown below is the output of Sys.time for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters. The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

user system elapsed
0.32 0.00 0.33

R version:

user system elapsed
15.75 0.02 15.94

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.

by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft

In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.

We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes 'viral' and gets weird?

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

library(bnlearn)
set.seed(3)
net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]")
plot(net)

The last step is to fit the parameters of the Bayesian network conditional on its structure, the bn.fit function runs the EM algorithm to learn CPT for all different nodes in the above graph.

Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.

probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian"& HusbandIsProfessional=="yes"& CaretakerEthnicity=="asian",n=1000000)
probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian"& HusbandIsProfessional=="yes"& CaretakerEthnicity=="asian",n=1000000)
[1] "The probability that the caretaker is his wife = 0.898718647764449"
[1] "The probability that the caretaker is the nanny = 0.110892031547457"

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!

The Consumer Data Research Centre, the UK-based organization that works with consumer-related organisations to open up their data resources, recently published a new course online: An Introduction to Spatial Data Analysis and Visualization in R. Created by James Cheshire (whose blog Spatial.ly regularly features interesting R-based data visualizations) and Guy Lansley, both of University College London Department of Geography, this practical series is designed to provide an accessible introduction to techniques for handling, analysing and visualising spatial data in R.

In addition to a basic introduction to R, the course covers specialized topics around handling spatial and geographic data in R, including:

Making maps in R

Mapping point data in R

Using R to create, explore and interact with data maps (like the one shown below)

Performing statistical analysis on spatial data: interpolation and kriging, spatial autocorrelation, geographically weighted regression and more.

The course, tutorials and associated data are freely available (a free registration to the CDRC website is required, however). You can access the course materials at the link below.

If you get a blood test to diagnose a rare disease, and the test (which is very accurate) comes back positive, what's the chance you have the disease? Well if "rare" means only 1 in a thousand people have the disease, and "very accurate" means the test returns the correct result 99% of the time, the answer is ... just 9%. There's less than a 1 in 10 chance you actually have the disease (which is why doctor will likely have you tested a second time).

Now that result might seem surprising, but it makes sense if you apply Bayes Theorem. (A simple way to think of it is that in a population of 1000 people, 10 people will have a positive test result, plus the one who actually has the disease. One in eleven of the positive results, or 9%, actually detect a true disease.) The video below from Veritasium explains this Bayesian Trap quite elegantly:

That's all from us here at the blog this week. We'll be back with more on Monday. In the meantime, have a great weekend!

According to job hunting site CareerCast, the best job to have in 2017 is: Statistician. This is according to their 2017 Jobs Rated report, based on an evaluation of Bureau of Labor Statistics metrics including environment, income, employment and income growth, and stress factors.

In their rankings, Statistician is the role that took the top spot with a "work environment" score of 4 out of 199 (lower is better) , a stress factors score of 39, and a projected growth score of 3. The median salary is reported at USD$80,110 and projected to grow at 34% (per annum, I assume).

Also in the top ten: Data Scientist, in fifth place. This role scored similarly in Work Environment (12 out of 199) and stress factors (37), but had slightly lower prospects for growth (37). Here, the median salary is reported at $111,267 and projected to grow at 15.75%.

The top 10 list is as follows:

Statistician

Medical Services Manager

Operations Research Analyst

Information Security Analyst

Data Scientist

University Professor

Mathematician

Software Engineer

Occupational Therapist

Speech Pathologist

You can see the full list of 200 jobs ranked by CareerCast at the link below.

There's a reason why data scientists spend so much time exploring data using graphics. Relying only on data summaries like means, variances, and correlations can be dangerous, because wildly different data sets can give similar results. This is a principle that has been demonstrated in statistics classes for decades with Anscombe's Quartet: four scatterplots which despite being qualitatively different all have the same mean and variance and the same correlation between them.

(You can easily check this in R by loading the data with data(anscombe).) But what you might not realize is that it's possible to generate bivariate data with a given mean, median, and correlation in any shape you like — even a dinosaur:

The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the "DataSaurus" (first created by Alberto Cairo). The authors have published a dataset they call the "DataSaurus Dozen" (also available as an R package on GitHub, with thanks to Steph Locke) of the 12 scatterplots shown. Interestingly, even the transitional frames in the animations above maintain the same summary statistics to two decimal places. Python was used to generate the data sets (and the code should be available at the link below soon.)

Read the paper linked below for more details, and always remember: look at your data!