*by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft *

In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.

The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.

We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes 'viral' and gets weird?

— Robert E Kelly (@Robert_E_Kelly) March 10, 2017

- the person is American Caucasian
- the person is professional
- there are two kids
- the caretaker is Asian

We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)

- Probability (Asian Wife | Caucasian Husband) = 1% [Married couples in the United States in 2010]
- Probability of (Household has Nanny | husband is professional) = 3.5% [The Three Faces of Work-Family Conflict, page 9, Figure 3]
- Probability of (Asian | Nanny) = 6% [Caregiver Statistics: Demographics]
- Probability of (Stay at home mom) = 14% and Probability of (Stay at home mom | Asian Wife) = 30% [Stay-at-Home Mothers by Demographic Group ]

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

library(bnlearn) set.seed(3) net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]") plot(net)

The last step is to fit the parameters of the Bayesian network conditional on its structure, the `bn.fit`

function runs the EM algorithm to learn CPT for all different nodes in the above graph.

yn <- c("yes", "no") ca <- c("caucacian","other") ao <- c("asian","other") nw <- c("nanny","wife") cptHusbandDemographics <- matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1] cptHusbandIsProfessional <- matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2] cptNannyDemographics <- matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3] cptWifeDemographics <- matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1] cptStayAtHomeMom <- c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4] dim(cptStayAtHomeMom) <- c(2, 2, 2) dimnames(cptStayAtHomeMom) <- list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn) cptHouseholdHasNanny <- c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5] dim(cptHouseholdHasNanny) <- c(2, 2, 2) dimnames(cptHouseholdHasNanny) <- list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn) cptCaretaker <- c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999) dim(cptCaretaker) <- c(2, 2, 2) dimnames(cptCaretaker) <- list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn) cptCaretakerEthnicity <- c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99) dim(cptCaretakerEthnicity) <- c(2, 2, 2,2) dimnames(cptCaretakerEthnicity) <- list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao) net.disc <- custom.fit(net, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))

Once we have the model, we can query the network using `cpquery`

to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is **90%** vs. 10% that she is the nanny.

probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) [1] "The probability that the caretaker is his wife = 0.898718647764449" [1] "The probability that the caretaker is the nanny = 0.110892031547457"

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!

The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.

The Consumer Data Research Centre, the UK-based organization that works with consumer-related organisations to open up their data resources, recently published a new course online: An Introduction to Spatial Data Analysis and Visualization in R. Created by James Cheshire (whose blog Spatial.ly regularly features interesting R-based data visualizations) and Guy Lansley, both of University College London Department of Geography, this practical series is designed to provide an accessible introduction to techniques for handling, analysing and visualising spatial data in R.

In addition to a basic introduction to R, the course covers specialized topics around handling spatial and geographic data in R, including:

- Making maps in R
- Mapping point data in R
- Using R to create, explore and interact with data maps (like the one shown below)
- Performing statistical analysis on spatial data: interpolation and kriging, spatial autocorrelation, geographically weighted regression and more.

The course, tutorials and associated data are freely available (a free registration to the CDRC website is required, however). You can access the course materials at the link below.

CDRC: An Introduction to Spatial Data Analysis and Visualisation in R

If you get a blood test to diagnose a rare disease, and the test (which is very accurate) comes back positive, what's the chance you have the disease? Well if "rare" means only 1 in a thousand people have the disease, and "very accurate" means the test returns the correct result 99% of the time, the answer is ... just 9%. There's less than a 1 in 10 chance you actually have the disease (which is why doctor will likely have you tested a second time).

Now that result might seem surprising, but it makes sense if you apply Bayes Theorem. (A simple way to think of it is that in a population of 1000 people, 10 people will have a positive test result, plus the one who actually has the disease. One in eleven of the positive results, or 9%, actually detect a true disease.) The video below from Veritasium explains this Bayesian Trap quite elegantly:

That's all from us here at the blog this week. We'll be back with more on Monday. In the meantime, have a great weekend!

According to job hunting site CareerCast, the best job to have in 2017 is: Statistician. This is according to their 2017 Jobs Rated report, based on an evaluation of Bureau of Labor Statistics metrics including environment, income, employment and income growth, and stress factors.

In their rankings, Statistician is the role that took the top spot with a "work environment" score of 4 out of 199 (lower is better) , a stress factors score of 39, and a projected growth score of 3. The median salary is reported at USD$80,110 and projected to grow at 34% (per annum, I assume).

Also in the top ten: Data Scientist, in fifth place. This role scored similarly in Work Environment (12 out of 199) and stress factors (37), but had slightly lower prospects for growth (37). Here, the median salary is reported at $111,267 and projected to grow at 15.75%.

The top 10 list is as follows:

- Statistician
- Medical Services Manager
- Operations Research Analyst
- Information Security Analyst
- Data Scientist
- University Professor
- Mathematician
- Software Engineer
- Occupational Therapist
- Speech Pathologist

You can see the full list of 200 jobs ranked by CareerCast at the link below.

CareerCast: Jobs Rated Report 2017: Ranking 200 Jobs (via Royal Statistical Society)

There's a reason why data scientists spend so much time exploring data using graphics. Relying only on data summaries like means, variances, and correlations can be dangerous, because wildly different data sets can give similar results. This is a principle that has been demonstrated in statistics classes for decades with Anscombe's Quartet: four scatterplots which despite being qualitatively different all have the same mean and variance and the same correlation between them.

(You can easily check this in R by loading the data with `data(anscombe)`

.) But what you might not realize is that it's possible to generate bivariate data with a given mean, median, and correlation in *any* shape you like — even a dinosaur:

The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the "DataSaurus" (first created by Alberto Cairo). The authors have published a dataset they call the "DataSaurus Dozen" (also available as an R package on GitHub, with thanks to Steph Locke) of the 12 scatterplots shown. Interestingly, even the transitional frames in the animations above maintain the same summary statistics to two decimal places. Python was used to generate the data sets (and the code should be available at the link below soon.)

Read the paper linked below for more details, and always remember: look at your data!

AutoDesk Research: Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

There's an ongoing debate in the academic community about whether Calculus is a necessary pre-requisite for teaching Statistics. But in age of ubiquitous computing resources (not to mention open source programming languages like R), there's a fair argument to be made that all you *really* need is simulation. However complex the statistical proposition, you can always find useful information about its properties simply by generating a bunch of random numbers and seeing what happens.

The same applies to education: rather than focusing on probability and calculus, students can simply see what happens when you flip a coin or roll dice, and how the statistics converge in the long run. That's the premise behind Seeing Theory, a visual introduction to probability and statistics created by Daniel Kunin, a senior at Brown University. It starts with probability: for example, rolling a fair die to show the long-term average is 3.5. There's also a neat method of estimating the value of pi by counting random points falling in (or out) of a circle, or this demonstration of the Central Limit Theorem that drops random samples out of a skew distribution and showing that their mean has a Normal distribution.

There's lots more to explore (though some of the units are still being developed), and it's a great calculus-free way to get a budding statistician interested in the topic.

That's all from us for this week. Enjoy your weekend, and we'll see you back here on Monday!

As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model:

- Categorical variables can include infrequently-used levels, which will cause problems if sampling leaves them unrepresented in the training set.
- Numerical variables can be in wildly different scales, which can cause instability when fitting models.
- The data set may include several highly-correlated columns, some of which could be pruned from the data without sacrificing predictive power.
- The data set may include missing values that need to be dealt with before analysis can begin.
- ... and many others

The vtreat package is designed to counter common data problems like these in a statistically sound manner. It's a data frame preprocessor which applies a number of data cleaning processes to the input data before analysis, using techniques such as impact coding and categorical variable encoding (the methods are described in detail in this paper). Further details can be found on the vtreat github page, where authors John Mount and Nina Zumel note:

Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are

commondata issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.

One final note: the main function in the package, prepare, is a little like model.matrix in that categorical variables are converted into numeric variables using contrast codings. This means that the output is suitable for many machine-learning functions (like xgboost) that don't accept categorical variables.

The vtreat package is available on CRAN now, and you can find a worked example using vtreat in the blog post linked below.

Win-Vector Blog: vtreat: prepare data

*by Bob Horton, Senior Data Scientist, Microsoft*

The area under an ROC curve (AUC) is commonly used in machine learning to summarize the performance of a predictive model with a single value. But you might be surprised to learn that the AUC is directly connected to the Mann-Whitney U-Statistic, which is commonly used in a robust, non-parametric alternative to Student’s t-test. Here I’ll use ‘literate analysis’ to demonstrate this connection and illustrate how the two measures are related.

In previous posts on ROC curves and AUC I described some simple ways to visualize and calculate these objects. Here is the simple data we used earlier to illustrate AUC:

```
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
library('pROC')
(official_auc <- auc(roc(category, prediction)))
```

`## Area under the curve: 0.825`

Here `category`

is a vector of Boolean labels marking the true status of a sequence of cases. The other vector, `prediction`

, represents a set of numeric scores as would normally be generated by some type of measurement or classifier algorithm. These scores could represent, for example, the expected probability of an object being a cat. But they don’t need to be probabilities; any value indicating the relative strength of the classifier’s confidence that the object is a cat can work, as long as the scores let us sort the cases into some order. Our fake scores are designed to put the cases in the order they start with, except that the scores of two cases have been replaced with their average; this gives us some instances where the scores are tied, which is a fairly reasonable condition we should be sure to handle. For this dataset the ‘official’ value for AUC is 0.825; when we try various other ways to calculate AUC, this is the number we want to see.

From Wikipedia we learn that

\[ {AUC}_1 = {U_1 \over n_1n_2} \]

where \(U_1\) is the Mann-Whitney U statistic, also known as the Wilcoxon rank-sum test statistic, or some combination and/or permutation of those names. That seems like a strange claim, but it is easy enough to test:

```
auc_wmw <- function(labels, scores){
labels <- as.logical(labels)
pos <- scores[labels]
neg <- scores[!labels]
U <- as.numeric(wilcox.test(pos, neg)$statistic)
U/(length(pos) * length(neg))
}
auc_wmw(category, prediction)
```

```
## Warning in wilcox.test.default(pos, neg): cannot compute exact p-value with
## ties
```

`## [1] 0.825`

The `wilcox.test`

function warns us that we “cannot compute exact p-value with ties”, but for the current exercise let’s just savor the fact that it got the AUC exactly right.

To start to decipher the U statistic, let’s calculate it ourselves instead of using the `wilcox.test`

function as a black box. Going back to Wikipedia we learn that

\[ U_1 = R_1 - {n_1(n_1+1) \over 2} \]

where \(R_1\) is the sum of the ranks of the positive cases, and \(n_1\) is the number of positive cases. You can calculate a related score (\(U_2\)) with the negative cases (of which there are \(n_2\)); these two scores are complementary in that they always add up to \(n_1 n_2\), similar to the way that flipping positives and negatives when calculating an ROC curve gives you an AUC that is one minus the AUC of the unflipped curve. Anyway, now we have:

```
auc_wmw2 <- function(labels, scores){
labels <- as.logical(labels)
n1 <- sum(labels)
n2 <- sum(!labels)
R1 <- sum(rank(scores)[labels])
U1 <- R1 - n1 * (n1 + 1)/2
U1/(n1 * n2)
}
auc_wmw2(category, prediction)
```

`## [1] 0.825`

Base R’s `rank`

function assigns the lowest rank value (1, if there are no ties) to the lowest score, and by default it averages the ranks for tied scores, which is exactly what we need. Now I’ll try to convince you that it makes sense that `U1/(n1 * n2)`

is equal to AUC.

First we plot the ranks of all the scores as a stack of horizontal bars, and color them by the labels.

```
# simplify the syntax for drawing rectangles
rectangle <- function(x, y, width, height, density=12, angle=-45, ...)
polygon(c(x,x,x+width,x+width), c(y,y+height,y+height,y),
density=density, angle=angle, ...)
U_illustration_part1 <- function(labels, scores){
# put cases in order by score
sort_order <- order(scores)
labels <- labels[sort_order]
scores <- scores[sort_order]
# count the cases
n <- length(labels)
# find overall rank for each case by score
ranks <- rank(scores)
# start with an empty plot
plot(c(0, n), c(0, n), type='n',
xlab="rank", ylab="case", asp=1)
# draw a grid in the background
abline(h=0:n, col="lightblue")
abline(v=0:n, col="lightblue")
# plot bars representing ranks of all cases
mapply(rectangle, x=0, y=(n - 1):0, # starting from the top
width=ranks, height=1,
density=NA, border="black", lwd=2, col=c("red", "green")[1 + labels])
legend("topright", legend=c("negative case", "positive case"),
text.col=c("red", "green"), bty='o', box.lwd=1, inset=0.1)
}
U_illustration_part1(labels=category, scores=prediction)
```

Now consider only the green bars, representing the ranks of the positive cases. We’ll stack them on top of one another, and slide them horizontally as needed to get a nice even stairstep on the right edge:

```
U_illustration_part2 <- function(labels, scores){
# sort the cases
sort_order <- order(scores)
labels <- labels[sort_order]
scores <- scores[sort_order]
# count positive and negative cases
n1 <- sum(labels) # number of positive cases
n2 <- sum(!labels) # number of negative cases
# find the overall ranks for the positive cases
ranks <- rank(scores)
pos_ranks <- ranks[as.logical(labels)]
# how far to slide each bar to make stairsteps on the right hand edge
x_offset <- n2 + (1:n1) - pos_ranks
# start with an empty plot
plot(c(0, n2 + n1), c(0, n1), type='n', asp=1,
xlab="n2 + n1 divisions", ylab="n1 divisions")
# plot bars for ranks of positive cases
mapply(rectangle, x=x_offset, y=(n1 - 1):0,
width=pos_ranks, height=1,
density=NA, border="darkgreen", lwd=2, col="green")
# draw the grid
abline(h=0:n1, col="lightblue")
abline(v=0:(n1 + n2), col="lightblue")
# mark the area we remove, and the area we keep
rectangle(n2, 0, n1, n1, density=10, col="red", lwd=1)
rectangle(0, 0, n2, n1, density=0, col="black", lty=2, lwd=3)
# draw a scaled version of the "official" ROC curve on top
roc_obj <- roc(labels, scores)
roc_df <- with(roc_obj, data.frame(FPR=rev(1 - specificities),
```

TPR=rev(sensitivities)))
with(roc_df, lines(n2*FPR, n1*TPR, type='l', lwd=4, col="blue"))
}
U_illustration_part2(labels=category, scores=prediction)

The total area of all the green bars equals the sum of the ranks of the positive cases (\(R_1\) in the equation). The red hatched box on the right hand side represents the part we’ll subtract. Note that the total area of green inside the red hatched box is \({n_1(n_1+1) \over 2}\), which you may recognize as the sum of the integers \(1\) through \(n_1\). The dashed rectangle on the left side shows the part we keep; the green area inside this dashed rectangle represents the value of \(U_1\).

Here `n1`

, the number of steps on the y axis, is obviously just the number of green bars. It may take a bit more contemplation to see that `n2`

, the number of steps on the x axis inside the dashed rectangle, reflects the number of red bars we removed (the negative cases). This plot is basically a transliteration of the formula for \(U_1\) into a figure.

The whole grid of the dashed rectangle thus has an area of `n1 * n2`

steps squared, and `U1`

is the area under the curve (in “square steps”). `U1/(n1 * n2)`

is the area under the curve as a fraction of the total area of the rectangle. In a traditional ROC curve the number of steps along each axis is normalized to 1, so that AUC is a fraction of a 1 by 1 area; once we normalize \(U_1\) to the area of the \(n_1\) by \(n_2\) rectangle, it equals AUC.

The blue line is the ‘official’ ROC curve scaled by `n2`

on the x (FPR) axis and by `n1`

on the y (TPR) axis, and it matches the left edge of our stacked bars except where scores are tied. The half-unit area (where two scores are tied) is split vertically rather than diagonally, but it is still a half unit, so you can see that the area of stacked green bars within the dashed rectangle corresponds exactly to the area under the ROC curve.

Let’s generate similar figures using the example from the earlier post, where the `test_set`

dataframe has a Boolean column called `bad_widget`

that labels cases, and a vector called `glm_response_scores`

containing scores from a logistic regression. We’ll use the same plotting code as above.

`U_illustration_part1(test_set$bad_widget, glm_response_scores)`

Now we remove the red bars, stack the green ones together, and slide them to make even stair-steps on the right side. The left edge of these stacked bars forms an ROC curve.

`U_illustration_part2(test_set$bad_widget, glm_response_scores)`

Again, the blue line is the ‘official’ ROC curve calculated with the `pROC`

package, then scaled by `n1`

and `n2`

on the y and x axes, respectively. In this dataset there are no tied scores so the ROC curve has no diagonal segments, and it matches the jagged contour of the left edge of our stacked rank bars exactly.

Although this graphical approach does not draw the diagonal segments of an ROC curve correctly, calculation based on the U-statistic does get the AUC right even in the presence of ties. To demonstrate this more dramatically, we’ll round off the original scores to one decimal place so that many of the scores round to the same value, producing groups of tied scores (see the ROC plots for the rounded values in the earlier AUC post).

`rounded_scores <- round(glm_response_scores, digits=1)`

Now we can use our U statistic - based functions to compute the AUC on both the original and rounded versions:

```
data.frame(
auc_pROC = auc(roc(test_set$bad_widget, glm_response_scores)),
auc_wmw = auc_wmw(test_set$bad_widget == "TRUE", glm_response_scores),
auc_wmw2 = auc_wmw2(test_set$bad_widget == "TRUE", glm_response_scores),
auc_pROC_ties = auc(roc(test_set$bad_widget, rounded_scores)),
auc_wmw_ties = auc_wmw(test_set$bad_widget, rounded_scores),
auc_wmw2_ties = auc_wmw2(test_set$bad_widget, rounded_scores)
)
```

```
## auc_pROC auc_wmw auc_wmw2 auc_pROC_ties auc_wmw_ties auc_wmw2_ties
## 1 0.903698 0.903698 0.903698 0.8972779 0.8972779 0.8972779
```

Note that the “official” AUC is slightly smaller when the scores are rounded, and that both versions of our Wilcoxon-Mann-Whitney U-Statistic based functions get the AUC exactly right, with or without ties.

This post has focused on the value and interpretation of the **U statistic** itself, but we have not discussed the **U test**, which uses this statistic (along with sample sizes) to evaluate statistical significance. According to Wikipedia, the U test evauates the null hypothesis that it is “equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.” As illustrated earlier, AUC reflects a similar probability, that a randomly chosen positive case will receive a higher score from your model than a randomly chosen negative case. Although exploring how this test calculates p-values is beyond the scope of this post, it should not be surprising that metrics embodying these sorts of probabilities can form the basis of such tests.

With the current focus on deep learning, neural networks are all the rage again. (Neural networks have been described for more than 60 years, but it wasn't until the the power of modern computing systems became available that they have been successfully applied to tasks like image recognition.) Neural networks are the fundamental predictive engine in deep learning systems, but it can be difficult to understand exactly what they do. To help with that, Brandon Rohrer has created this from-the-basics guide to how neural networks work:

In R, you can train a simple neural network with just a single hidden layer with the nnet package, which comes pre-installed with every R distribution. It's a great place to start if you're new to neural networks, but the deep learning applications call for more complex neural networks. R has several packages to check out here, including MXNet, darch, deepnet, and h2o: see this post for a comparison. The tensorflow package can also be used to implement various kinds of neural networks. And the rxNeuralNet function (found in the MicrosoftML package included with Microsoft R Server and Microsoft R Client) provides high-performance training of complex neural networks using CPUs and GPUs.

Data Science and Robots Blog: How neural networks work