Matched Case-Control Studies (Reading: Faraway (2006), section 2.12)
ˇ@
Le (1998) presented a data from a matched case-control study concerning the association between x-rays and childhood acute myeloid leukemia. The sets are matched on age, race, and county of residence. The subjects with the same ID value are in same matched set. We start by reading the data into R and take a look of it:
> amlxray <- read.table("amlxray.txt")
> amlxray
ID disease Sex downs age Mray MupRay MlowRay Fray Cray CnRay
1 7004 1 F no 0 no no no no no 1
2 7004 0 F no 0 no no no no no 1
3 7006 1 M no 6 no no no no yes 3
4 7006 0 M no 6 no no no no yes 2
5 7009 1 F no 8 no no no no no 1
6 7009 0 F no 8 no no no no no 1
ˇKdeletedˇK
238 7211 0 F no 6 no no no no no 1
Notice that:
For the most part of the data, there is only one control for each case, but there are a few instances of two controls. This is an example of sparse data.
Only the age is presented here as one of the matching variables. In the first three matched sets (the first 6 rows of data) shown here, we can see that both subjects in any matched set have the same age and the first is the case and the second is the control. The other variables, Sex, downs, Mray, MupRay, ..., CnRay, are risk factors of interest.
ˇ@
Let us first pay some attention to the
data itself before fitting a model. In the first stage of data analysis, it is
often required to perform some initial analyses to gain some basic information
of data and evaluate the quality of data. The variable
downs indicates whether the
subject has Down's syndrome. Let us take a look of the data of the subjects with
a Down's syndrome:
> amlxray[amlxray$downs=="yes",1:4]
ID disease Sex downs
7 7010 1 M yes
17 7018 1 F yes
78 7066 1 F yes
88 7077 1 M yes
173 7146 1 F yes
196 7176 1 F yes
210 7189 1 F yes
We find that:
there are only seven subjects with Down's syndrome
all the seven subjects are cases
therefore, if we include the variable downs in the binomial regression, its coefficient is infinite
given this and the prior knowledge, it is simplest to exclude all the subjects with Down's syndrome and their associated matched subjects
> (ii <- which(amlxray$downs=="yes"))
[1] 7 17 78 88 173 196 210
> ramlxray <- amlxray[-c(ii,ii+1),]
ˇ@
For the other risk factors,
the variable Fray records whether the father has ever had an x-ray
the variables Mray, MupRay, and MlowRay record whether the mother has ever had an x-ray, ever had an upper body x-ray, and ever had a lower body x-ray, respectively.
the association between the three variables: Mray=yes, if either MupRay=yes or MlowRay=yes; Maray=no, if both MupRay=no and MlowRay=no (notice the hierarchy of measurement scale between them)
Because these variables are closely associated, we will pick just Mray for now and investigate the others more closely if Mray is indicated as an important predictor
An alternative choice is to use MupRay and MlowRay to generate a new four-level factor.
the variable CnRay is a four-level ordered factor grouping the number of x-rays that the child has received and the variable Cray merely indicates whether the child has ever had an x-ray
the four levels of CnRay is: CnRay=1, if no x-ray; CnRay=2, if 1 or 2 x-rays; CnRay=3, if 3 or 4 x-rays; CnRay=4, if 5 or more x-rays (notice the hierarchy of measurement scale between them)
the association between CnRay and Cray: Cray=no, if CnRay=1; Cray=yes, if CnRay=2, 3, or 4 (notice the hierarchy of measurement scale between them)
we will use CnRay in preference to Cray
in R, we can declare a variable as ordinal and find how the ordinal factor is coded by:
> ramlxray$CnRay <- as.ordered(ramlxray$CnRay)
> contrasts(ramlxray$CnRay)
.L .Q .C
1 -0.6708204 0.5 -0.2236068
2 -0.2236068 -0.5 0.6708204
3 0.2236068 -0.5 -0.6708204
4 0.6708204 0.5 0.2236068
Notice that:
the linear (.L), quadratic (.Q), cubic (.C) terms are coded based on orthogonal polynomials
in other words, the coding actually regards CnRay as a continuous variable and treats the four levels in CnRay as equally-spaced levels
ˇ@
Now, let us fit a model to the data. Since the conditional likelihood of a logit model (see lecture notes p.3-37) for matched case-control data takes the same form as that used for the proportional hazards model (PHM) in survival analysis, we can use software developed for PHM to fit the logit model. In R, the clogit function in the survival library can do the work. Note that the matched sets must be designated by the strata function:
> library(survival)
> cmod <- clogit(disease ~ Sex+Mray+Fray+CnRay+strata(ID),ramlxray)
> summary(cmod)
Call:
coxph(formula = Surv(rep(1, 224), disease) ~ Sex + Mray + Fray +
CnRay + strata(ID), data = ramlxray, method = "exact")
n= 224
coef exp(coef) se(coef) z p
SexM 0.156 1.17 0.386 0.405 0.6900
Mrayyes 0.228 1.26 0.582 0.391 0.7000
Frayyes 0.693 2.00 0.351 1.974 0.0480
CnRay.L 1.941 6.96 0.621 3.127 0.0018
CnRay.Q -0.248 0.78 0.582 -0.426 0.6700
CnRay.C -0.580 0.56 0.591 -0.982 0.3300
exp(coef) exp(-coef) lower .95 upper .95
SexM 1.17 0.855 0.549 2.49
Mrayyes 1.26 0.796 0.401 3.93
Frayyes 2.00 0.500 1.005 3.98
CnRay.L 6.96 0.144 2.063 23.51
CnRay.Q 0.78 1.281 0.249 2.44
CnRay.C 0.56 1.786 0.176 1.78
Rsquare= 0.089 (max possible= 0.499 )
Likelihood ratio test= 20.9 on 6 df, p=0.00192
Wald test = 14.5 on 6 df, p=0.0246
Score (logrank) test = 18.6 on 6 df, p=0.0049
Notice that:
the overall tests (test statistics=20.9/14.5/18.6) for the significance of at lease one of the predictors have small p-values, which indicates that at least some of the predictors are important
the predictors Sex and Mray are not significant. There seems little point in investigating the other x-ray variables associated with the mother.
an x-ray on the father, i.e., Fray, is marginally significant because its corresponding p-value=0.0480»0.05
for Fray, coef=0.693 and exp(coef)=2.00. We can say that the father having had an x-ray increases the log-odds of the disease by 0.693 or doubles the odds of the disease
the x-ray on the child, i.e., CnRay, has the clearest effect. Among its linear, quadratic, and cubic contrasts, only the linear effect is significant.
the estimated value of the linear effect of CnRay is 1.941, which indicates that when CnRay is changed from levels 1 to 2 (or 2 to 3, or 3 to 4),
the log-odds of the disease would be increased by 0.8680416=1.941*(0.6708204-0.2236068), where 0.6708204-0.2236068 is the difference of the linear codings of two adjacent levels of CnRay
the odds of the disease would be increased by a factor of 2.382241=exp(1.941*(0.6708204-0.2236068))
ˇ@
Let us recursively eliminate
predictors/effects which are not significant to simplify the model (intermediate
steps skipped). Since we have found only a significant linear effect for
CnRay, we can convert levels
1-4 in CnRay to the
numerical values 1-4, respectively, using the unclass
function (note: the former and the latter 1-4
have different meanings). The final model is:
> cmodr <- clogit(disease ~ Fray+unclass(CnRay)+strata(ID),ramlxray)
> summary(cmodr)
Call:
coxph(formula = Surv(rep(1, 224), disease) ~ Fray + unclass(CnRay) + strata(ID), data = ramlxray, method = "exact")
n= 224
coef exp(coef) se(coef) z p
Frayyes 0.670 1.96 0.344 1.95 0.05100
unclass(CnRay) 0.814 2.26 0.237 3.44 0.00058
exp(coef) exp(-coef) lower .95 upper .95
Frayyes 1.96 0.512 0.996 3.84
unclass(CnRay) 2.26 0.443 1.419 3.59
Rsquare= 0.084 (max possible= 0.499 )
Likelihood ratio test= 19.6 on 2 df, p=5.69e-05
Wald test = 14.1 on 2 df, p=0.000859
Score (logrank) test = 17.6 on 2 df, p=0.000154
Notice that:
the odds of the disease increase by a factor of 2.26 as we move between adjacent categories of CnRay, which is about the same as the 2.382241 we got in the result of the previous fitted model
the father's x-ray variable is now just insignificant in this model underlining its borderline status
the conclusions and explanations we can draw from this model are similar to those from the previous one
although we have found an effect due to x-rays of the child, we cannot conclude the effect is causal. After all, subjects only have x-rays when something is wrong, so it is quite possible that the x-rays are linked to some unknown lurking variables.
ˇ@
An incorrect analysis of this data ignores the matching structure and simply uses a binomial GLM:
> gmod <- glm(disease ~ Fray+unclass(CnRay),family=binomial,ramlxray)
> summary(gmod)
Call:
glm(formula = disease ~ Fray + unclass(CnRay), family = binomial, data = ramlxray)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.950 -0.950 -0.950 1.204 1.423
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1623 0.3011 -3.861 0.000113 ***
Frayyes 0.5004 0.3078 1.626 0.104046
unclass(CnRay) 0.6005 0.1774 3.385 0.000711 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 309.39 on 223 degrees of freedom
Residual deviance: 293.26 on 221 degrees of freedom
AIC: 299.26
Number of Fisher Scoring iterations: 4
We can see that the results are somewhat different.