by Terry M. Therneau Ph.D.

Faculty, Mayo Clinic

About a year ago there was a query about how to do "type 3" tests for a Cox model on the R help list, which someone wanted because SAS does it. The SAS addition looked suspicious to me, but as the author of the survival package I thought I should understand the issue more deeply. It took far longer than I expected but has been illuminating.

First off, what exactly is this 'type 3' computation of which SAS so deeply enamored? Imagine that we are dealing with a data set that has interactions. In my field of biomedical statistics all data relationships have interactions: an effect is never precisely the same for young vs old, fragile vs robust, long vs short duration of disease, etc. We may not have the sample size or energy to model them, but they exist nonetheless. Assume as an example that we had a treatment effect that increases with age; how then would one describe a main effect for treatment? One approach is to select an age distribution of interest and use the mean treatment effect, averaged over that age distribution.

To compute this, one can start by fitting a sufficiently rich model, get predicted values for our age distribution, and then average them. This requires almost by definition a model that includes an age by treatment interaction: we need reasonably unbiased estimates of the treatment effects at individual ages a,b,c,... before averaging, or we are just fooling ourselves with respect to this overall approach. The SAS type 3 method for linear models is exactly this. It assumes as the "reference population of interest" a uniform distribution over any categorical variables and the observed distribution of the data set for any continuous ones, followed by a computation of the average predicted value. Least squares means are also an average prediction taken over the reference population.

A primary statistical issue with type 3 is the choice of reference. Assume for instance that age had been coded as a categorical with levels of 50-59, 60-69, 70-79 and 80+. A type 3 test answers the question of what the treatment effect would be in a population of subjects in which 1/4 were aged 50-59, another 1/4 were 60-69, etc. Since I will never encounter a set of subjects with said pattern in real life, such an average is irrelevant . A nice satire of the situation can be found under the nom de plume of Guernsey McPearson (Also have a look at *Multi-Centre Trials and the Finally Decisive Argument*). To be fair there are other cases where the uniform distribution is precisely the right population, e.g., a designed experiment that lost perfect balance due to a handful of missing response values. But these are rare to non-existent in my world, and type 3 remains an answer to the question that nobody asked.

Average population prediction also highlights a serious deficiency in R. Working out the algebra, type 3 tests for a linear model turn out to be a contrast, C %*% coef(fit), for a particular contrast vector or matrix C. This fits neatly into the SAS package, which has a simple interface for user specified contrasts. (The SAS type 3 algorithm is at its heart simply an elegant way to derive C for their default reference population.) The original S package took a different view, which R has inherited, of pre instead of post-processing. Several of the common contrasts one might want to test can be obtained by clever coding of the design matrix X, before the fit, causing the contrast of interest to appear as one of the coefficients of the fitted model. This is a nice idea when it works, but there are many cases where it is insufficient, a linear trend test or all possible pair-wise comparisons for example.

*R needs a general and well thought out post-fit contrasts function. Population averaged estimates could be one option of said routine, with the SAS population one possible choice.*

Also, I need to mention a couple more things:

- The standard methods for computing type 3 that I see in the help lists are flawed, giving seriously incorrect answers unless sum-to-zero constraints were used for the fit (contr.sum). This includes both the use of drop.terms and the anova function in the car package.

- For coxph models, my original goal, the situation is even more complex. In particular, which average does one want: average log hazard ratio, average hazard ratio, ratio of average hazards, or something else? Only one of these can be rewritten as a contrast in the coefficients, and thus the clever linear models algorithms do not transfer.