by B. W. Lewis
This note warns about potentially misleading results when using the use=pairwise.complete.obs
and related options in R’s cor
and cov
functions. Pitfalls are illustrated using a very simple pathological example followed by a brief list of alternative ways to deal with missing data and some references about them.
Known unknowns
R includes excellent facilities for handling missing values for all native data types. Perhaps counterintuitively, marking a value as missing conveys the information that the value is not known. Donald Rumsfeld might call it a “known unknown1.” Upon encountering a missing value, we can deal with it by simply omitting it, imputing it somehow, or through several other possible approaches. R does a good job of making our choice of missing value approach explicit.
The cov
and cor
functions in the R programming language include several options for dealing with missing data. The use="pairwise.complete.obs"
option is particularly confusing, and can easily lead to faulty comparisons. This note explains and warns against its use.
Consider the following tiny example:
(x = matrix(c(-2,-1,0,1,2,1.5,2,0,1,2,NA,NA,0,1,2),5))
## [,1] [,2] [,3]
## [1,] -2 1.5 NA
## [2,] -1 2.0 NA
## [3,] 0 0.0 0
## [4,] 1 1.0 1
## [5,] 2 2.0 2
The functions V=cov(x)
computes the symmetric covariance matrix V
with entries defined by the pairwise covariance of columns of x
,
where i=j=1,2,3
in this example. The function cor(x)
similarly computes the symmetric correlation matrix with entries defined by pairwise correlation of the columns of x
. For example:
cov(x)
## [,1] [,2] [,3]
## [1,] 2.5 0.0 NA
## [2,] 0.0 0.7 NA
## [3,] NA NA NA
cor(x)
## [,1] [,2] [,3]
## [1,] 1 0 NA
## [2,] 0 1 NA
## [3,] NA NA 1
Due to missing values in the third column of x
we know that we don’t know the covariance between x[,3]
and anything else. Thanks to an arguably questionable2 choice in R’s cov2cor
function, R reports that the correlation of x[,3]
with itself is one, but we don’t know the correlation between x[,3]
and the other columns.
The use="complete"
option is one way to deal with missing values. It simply removes rows of the matrix x
with missing observations. Since the columns of the third through fifth rows of our example matrix are all identical we expect perfect correlation across the board, and indeed:
cor(x, use="complete")
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 1 1 1
Reasonable people might question this approach. Deleting two observations has a huge effect on the correlation between x[,1]
and x[,2]
in this example mostly because of the large change in x[,1]
. The result really says that we should collect more observations!
Unknown knowns
The use="pairwise.complete.obs"
is an even less reasonable way to deal with missing values. When specified, R computes correlations for each pair of columns using vectors formed by omitting rows with missing values on a pairwise basis. Thus each column vector may vary depending on it’s pairing, resulting in correlation values that are not even comparable. Consider our simple example again:
cor(x, use="pairwise.complete.obs")
## [,1] [,2] [,3]
## [1,] 1 0 1
## [2,] 0 1 1
## [3,] 1 1 1
By this bizarre measurement, the correlation of x[,1]
and x[,2]
is zero (as we saw above in the first example), and yet cor
claims that x[,3]
is perfectly correlated with both x[,1]
and x[,2]
. In other words, the result is nonsense. As Rumsfeld might say, we’ve converted known unknowns into unknown knowns.
What’s going on here is that the reported correlations are not comparable because they are computed against different vectors: all of x[,1]
and x[,2]
are compared to each other, but only parts of x[,1]
and x[,2]
are compared to x[,3]
.
The bad result is obvious for our small example. But the danger here is in large matrices with lots of missing values, where it may be impossible to use the pairwise
option in a meaningful way.
Recommendations
If you want to run correlations on lots of vectors with missing values, consider simply using the R default of use="everything"
and propagating missing values into the correlation matrix. This makes it clear what you don’t know.
If you really don’t want to do that, consider imputing the missing values. The simplest method replaces missing values in each column with the mean of the non-missing values in the respective column:
m = mean(na.omit(x[,3]))
xi = x
xi[is.na(x)] = m
cor(xi)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0.4472136
## [2,] 0.0000000 1.0000000 0.8451543
## [3,] 0.4472136 0.8451543 1.0000000
This can be done really efficiently when “centering” a matrix by simply replacing missing values of the centered matrix with zero.
Sometimes it might make more sense to use a piecewise constant interpolant, referred to as “last observation carry forward” especially when dealing with time series and ordered data. In yet other cases a known default value (perhaps from a much larger population than the one under study) might be more appropriate.
Another basic approach bootstraps the missing values from the non-missing ones:
i = is.na(x[,3])
N = sum(i)
b = replicate(500, {x[i,3] = sample(x[!i,3], size=N, replace=TRUE);cor(x[,1:2],x[,3])})
# Average imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,mean)
## [1] 0.3722048 0.6845172
# Standard deviation of imputed values of cor(x[,1],x[,3]) and cor(x[,2],x[,3])
apply(b,1,sd)
## [1] 0.3361684 0.2156523
If you have lots of observations, consider partitioning them with a basic clustering algorithm first and then imputing the missing values from their respective cluster cohorts. Or consider a matching method or a regression-based imputation method. See the references below for many more details.
References and packages
There are of course many excellent R packages and references on missing data. I recommend consulting the following packages and references:
- http://www.stat.columbia.edu/~gelman/arm/missing.pdf
- http://cran.r-project.org/web/packages/mi/
- http://gking.harvard.edu/amelia
- http://cran.r-project.org/web/views/OfficialStatistics.html (Look for the “Imputation” section in this task view for a list of R packages.)
-
The
cov2cor
function used bycor
always puts ones along the diagonal of a correlation matrix; that choice is valid only if all unknowns may assume bounded and valid numeric values which is actually pretty reasonable. But in a rare example of inconsistency in Rcor(x[,3],x[,3])
returnsNA
. Yikes!)↩
Hi, Bryan, Norm Matloff here. I must respectfully disagree with your post.
One should never rely on correlations in small data sets (or tiny ones, as you call your example). Second-order moments are a lot harder to estimate than first-order ones, e.g. the variance of a sample variance is large.
The Available Cases method for dealing with missing values, as exemplified in the pairwise.complete.obs option you cite, is a lot more useful than many people realize. It does tacitly make strong assumptions, but so do all of the missing-value methods, including in Amelia.
on this topic at the JSM in August, and will release an R package in the next couple of weeks.
Posted by: D | June 16, 2015 at 16:22
Sorry, that last post got mangled. The URL for my paper is http://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=316343
Posted by: D | June 16, 2015 at 16:24
I want to join Prof. Matloff in respectfully disagreeing.
In this n=tiny example, things indeed become silly. But this is not due to the pairwise.complete.obs-problem, but due to that computing correlations for n=3 in itself is silly.
Furthermore, your suggestion for imputing the mean can lead to *serious* underestimation of the correlation.
When you have a lot of data, e.g. n = 100 for x[,1] and x[,2] and n = 90 for x[,3], there really is no methodological issue with computing a correlation between x[,2] and x[,3] on basis of the 90 overlapping observations. Indeed, n = 90 is lower than the n = 100 observations you have for x[,1] and x[,2], but also n = 100 is nothing more than a (random) sample from a larger population. As long as there is not a specific mechanism that decides which values are missing (Missing-Not-At-Random, MNAR), you can simply regard the n = 90 as a regular random sample and compute r for it.
Your suggestion to impute the mean is plain wrong. By imputing the mean, you make the vector of observations as flat as possible (suppose x[,3] would have been -2, -1, 0, 1, 2; a straight line with slope +1 and correlation +1 with x[,1]. Because -2 and -1 are unknown, you propose to impute them with +1, making the series +1, +1, 0, +1, +2; much flatter and now with correlation with x[,1] much closer to zero (.44). Of course you do not know whether the two missing values are -2 and -1 or something else but, as long as you assume that the values are missing-completely-at-random (MCAR) or missing-at-random (MAR) you can easily prove that this imputation method will yield, on average, estimates of the correlation biased towards zero.
Posted by: Casper Albers | June 17, 2015 at 00:19
I will join the others in respectfully disagreeing. The two widely accepted state-of-art techniques for treating missing data in various settings are (1) maximum likelihood estimation and (2) multiple imputation. You provide some good advise by discouraging the use of pairwise deletion, which can lead to some issues both in terms of bias and calculation. Yet that good advice is outright cancelled out by the recommendation to use a single imputation method, which has been long discredited in the statistical literature.
Posted by: Maxim K. | June 18, 2015 at 05:28
Thanks for these thoughtful comments. The intentionally silly pathological
example, like most textbook examples, is intended to get to the heart of the
matter and clearly illustrate the problem that pairwise deletion of cases leads
to incomparable correlation values.
I very much agree that mean imputation can often be a bad idea, that's why I
suggest a number of alternatives (including multiple imputation).
But really the whole point of the article, I
hope, is to provoke the reader to *think carefully* about their problem when
missing values are involved!
Posted by: Bryan W. Lewis | June 19, 2015 at 05:06
Look at this simple simulation, your point may have same value only for very simple simulations and very high missings. In other cases this procedure is better than mean or bootstrap imputation.
The plots are here https://twitter.com/riquiapaza/status/612642731530301440
Posted by: Robert | June 21, 2015 at 08:34