by Bob Horton, Senior Data Scientist, Microsoft
The area under an ROC curve (AUC) is commonly used in machine learning to summarize the performance of a predictive model with a single value. But you might be surprised to learn that the AUC is directly connected to the Mann-Whitney U-Statistic, which is commonly used in a robust, non-parametric alternative to Student’s t-test. Here I’ll use ‘literate analysis’ to demonstrate this connection and illustrate how the two measures are related.
In previous posts on ROC curves and AUC I described some simple ways to visualize and calculate these objects. Here is the simple data we used earlier to illustrate AUC:
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
library('pROC')
(official_auc <- auc(roc(category, prediction)))
## Area under the curve: 0.825
Here category
is a vector of Boolean labels marking the true status of a sequence of cases. The other vector, prediction
, represents a set of numeric scores as would normally be generated by some type of measurement or classifier algorithm. These scores could represent, for example, the expected probability of an object being a cat. But they don’t need to be probabilities; any value indicating the relative strength of the classifier’s confidence that the object is a cat can work, as long as the scores let us sort the cases into some order. Our fake scores are designed to put the cases in the order they start with, except that the scores of two cases have been replaced with their average; this gives us some instances where the scores are tied, which is a fairly reasonable condition we should be sure to handle. For this dataset the ‘official’ value for AUC is 0.825; when we try various other ways to calculate AUC, this is the number we want to see.
Computing AUC from the U Statistic
From Wikipedia we learn that
\[ {AUC}_1 = {U_1 \over n_1n_2} \]
where \(U_1\) is the Mann-Whitney U statistic, also known as the Wilcoxon rank-sum test statistic, or some combination and/or permutation of those names. That seems like a strange claim, but it is easy enough to test:
auc_wmw <- function(labels, scores){
labels <- as.logical(labels)
pos <- scores[labels]
neg <- scores[!labels]
U <- as.numeric(wilcox.test(pos, neg)$statistic)
U/(length(pos) * length(neg))
}
auc_wmw(category, prediction)
## Warning in wilcox.test.default(pos, neg): cannot compute exact p-value with
## ties
## [1] 0.825
The wilcox.test
function warns us that we “cannot compute exact p-value with ties”, but for the current exercise let’s just savor the fact that it got the AUC exactly right.