## June 16, 2015

You can follow this conversation by subscribing to the comment feed for this post.

Hi, Bryan, Norm Matloff here. I must respectfully disagree with your post.

One should never rely on correlations in small data sets (or tiny ones, as you call your example). Second-order moments are a lot harder to estimate than first-order ones, e.g. the variance of a sample variance is large.

The Available Cases method for dealing with missing values, as exemplified in the pairwise.complete.obs option you cite, is a lot more useful than many people realize. It does tacitly make strong assumptions, but so do all of the missing-value methods, including in Amelia.

on this topic at the JSM in August, and will release an R package in the next couple of weeks.

Sorry, that last post got mangled. The URL for my paper is http://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=316343

I want to join Prof. Matloff in respectfully disagreeing.

In this n=tiny example, things indeed become silly. But this is not due to the pairwise.complete.obs-problem, but due to that computing correlations for n=3 in itself is silly.

Furthermore, your suggestion for imputing the mean can lead to *serious* underestimation of the correlation.
When you have a lot of data, e.g. n = 100 for x[,1] and x[,2] and n = 90 for x[,3], there really is no methodological issue with computing a correlation between x[,2] and x[,3] on basis of the 90 overlapping observations. Indeed, n = 90 is lower than the n = 100 observations you have for x[,1] and x[,2], but also n = 100 is nothing more than a (random) sample from a larger population. As long as there is not a specific mechanism that decides which values are missing (Missing-Not-At-Random, MNAR), you can simply regard the n = 90 as a regular random sample and compute r for it.

Your suggestion to impute the mean is plain wrong. By imputing the mean, you make the vector of observations as flat as possible (suppose x[,3] would have been -2, -1, 0, 1, 2; a straight line with slope +1 and correlation +1 with x[,1]. Because -2 and -1 are unknown, you propose to impute them with +1, making the series +1, +1, 0, +1, +2; much flatter and now with correlation with x[,1] much closer to zero (.44). Of course you do not know whether the two missing values are -2 and -1 or something else but, as long as you assume that the values are missing-completely-at-random (MCAR) or missing-at-random (MAR) you can easily prove that this imputation method will yield, on average, estimates of the correlation biased towards zero.

I will join the others in respectfully disagreeing. The two widely accepted state-of-art techniques for treating missing data in various settings are (1) maximum likelihood estimation and (2) multiple imputation. You provide some good advise by discouraging the use of pairwise deletion, which can lead to some issues both in terms of bias and calculation. Yet that good advice is outright cancelled out by the recommendation to use a single imputation method, which has been long discredited in the statistical literature.

Thanks for these thoughtful comments. The intentionally silly pathological
example, like most textbook examples, is intended to get to the heart of the
matter and clearly illustrate the problem that pairwise deletion of cases leads
to incomparable correlation values.

I very much agree that mean imputation can often be a bad idea, that's why I
suggest a number of alternatives (including multiple imputation).
But really the whole point of the article, I
hope, is to provoke the reader to *think carefully* about their problem when
missing values are involved!

Look at this simple simulation, your point may have same value only for very simple simulations and very high missings. In other cases this procedure is better than mean or bootstrap imputation.

The comments to this entry are closed.

## Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.