Matched Case-Control Studies (Reading: Faraway (2006), section 2.12)

Le (1998) presented a data from a matched case-control study concerning the association between x-rays and childhood acute myeloid leukemia. The sets are matched on age, race, and county of residence. The subjects with the same ID value are in same matched set. We start by reading the data into R and take a look of it:

> amlxray

ID disease Sex downs age Mray MupRay MlowRay Fray Cray CnRay

1   7004       1   F    no   0   no     no      no   no   no     1

2   7004       0   F    no   0   no     no      no   no   no     1

3   7006       1   M    no   6   no     no      no   no  yes     3

4   7006       0   M    no   6   no     no      no   no  yes     2

5   7009       1   F    no   8   no     no      no   no   no     1

6   7009       0   F    no   8   no     no      no   no   no     1

…deleted…

238 7211       0   F    no   6   no     no      no   no   no     1

Notice that:

• For the most part of the data, there is only one control for each case, but there are a few instances of two controls. This is an example of sparse data.

• Only the age is presented here as one of the matching variables. In the first three matched sets (the first 6 rows of data) shown here, we can see that both subjects in any matched set have the same age and the first is the case and the second is the control. The other variables, Sex, downs, Mray, MupRay, ..., CnRay, are risk factors of interest.

Let us first pay some attention to the data itself before fitting a model. In the first stage of data analysis, it is often required to perform some initial analyses to gain some basic information of data and evaluate the quality of data. The variable downs indicates whether the subject has Down's syndrome. Let us take a look of the data of the subjects with a Down's syndrome:
> amlxray[amlxray\$downs=="yes",1:4]

ID disease Sex downs

7   7010       1   M   yes

17  7018       1   F   yes

78  7066       1   F   yes

88  7077       1   M   yes

173 7146       1   F   yes

196 7176       1   F   yes

210 7189       1   F   yes

We find that:

• there are only seven subjects with Down's syndrome

• all the seven subjects are cases

• therefore, if we include the variable downs in the binomial regression, its coefficient is infinite

• given this and the prior knowledge, it is simplest to exclude all the subjects with Down's syndrome and their associated matched subjects

> (ii <- which(amlxray\$downs=="yes"))

   7  17  78  88 173 196 210

> ramlxray <- amlxray[-c(ii,ii+1),]

For the other risk factors,

• the variable Fray records whether the father has ever had an x-ray

• the variables Mray, MupRay, and MlowRay record whether the mother has ever had an x-ray, ever had an upper body x-ray, and ever had a lower body x-ray, respectively.

• the association between the three variables: Mray=yes, if either MupRay=yes or MlowRay=yes; Maray=no, if both MupRay=no and MlowRay=no (notice the hierarchy of measurement scale between them)

• Because these variables are closely associated, we will pick just Mray for now and investigate the others more closely if Mray is indicated as an important predictor

• An alternative choice is to use MupRay and MlowRay to generate a new four-level factor.

• the variable CnRay is a four-level ordered factor grouping the number of x-rays that the child has received and the variable Cray merely indicates whether the child has ever had an x-ray

• the four levels of CnRay is: CnRay=1, if no x-ray; CnRay=2, if 1 or 2 x-rays; CnRay=3, if 3 or 4 x-rays; CnRay=4, if 5 or more x-rays (notice the hierarchy of measurement scale between them)

• the association between CnRay and Cray: Cray=no, if CnRay=1; Cray=yes, if CnRay=2, 3, or 4 (notice the hierarchy of measurement scale between them)

• we will use CnRay in preference to Cray

• in R, we can declare a variable as ordinal and find how the ordinal factor is coded by:

> ramlxray\$CnRay <- as.ordered(ramlxray\$CnRay)

> contrasts(ramlxray\$CnRay)

.L   .Q         .C

1 -0.6708204  0.5 -0.2236068

2 -0.2236068 -0.5  0.6708204

3  0.2236068 -0.5 -0.6708204

4  0.6708204  0.5  0.2236068

Notice that:

• the linear (.L), quadratic (.Q), cubic (.C) terms are coded based on orthogonal polynomials

• in other words, the coding actually regards CnRay as a continuous variable and treats the four levels in CnRay as equally-spaced levels

Now, let us fit a model to the data. Since the conditional likelihood of a logit model (see lecture notes p.3-37) for matched case-control data takes the same form as that used for the proportional hazards model (PHM) in survival analysis, we can use software developed for PHM to fit the logit model. In R, the clogit function in the survival library can do the work. Note that the matched sets must be designated by the strata function:

> library(survival)
> cmod <- clogit(disease ~ Sex+Mray+Fray+CnRay+strata(ID),ramlxray)
> summary(cmod)

Call:

coxph(formula = Surv(rep(1, 224), disease) ~ Sex + Mray + Fray +

CnRay + strata(ID), data = ramlxray, method = "exact")

n= 224

coef exp(coef) se(coef)      z      p

SexM     0.156      1.17    0.386  0.405 0.6900

Mrayyes  0.228      1.26    0.582  0.391 0.7000

Frayyes  0.693      2.00    0.351  1.974 0.0480

CnRay.L  1.941      6.96    0.621  3.127 0.0018

CnRay.Q -0.248      0.78    0.582 -0.426 0.6700

CnRay.C -0.580      0.56    0.591 -0.982 0.3300

exp(coef) exp(-coef) lower .95 upper .95

SexM         1.17      0.855     0.549      2.49

Mrayyes      1.26      0.796     0.401      3.93

Frayyes      2.00      0.500     1.005      3.98

CnRay.L      6.96      0.144     2.063     23.51

CnRay.Q      0.78      1.281     0.249      2.44

CnRay.C      0.56      1.786     0.176      1.78

Rsquare= 0.089   (max possible= 0.499 )

Likelihood ratio test= 20.9  on 6 df,   p=0.00192

Wald test            = 14.5  on 6 df,   p=0.0246

Score (logrank) test = 18.6  on 6 df,   p=0.0049

Notice that:

• the overall tests (test statistics=20.9/14.5/18.6) for the significance of at lease one of the predictors have small p-values, which indicates that at least some of the predictors are important

• the predictors Sex and Mray are not significant. There seems little point in investigating the other x-ray variables associated with the mother.

• an x-ray on the father, i.e., Fray, is marginally significant because its corresponding p-value=0.0480»0.05

• for Fraycoef=0.693 and exp(coef)=2.00. We can say that the father having had an x-ray increases the log-odds of the disease by 0.693 or doubles the odds of the disease

• the x-ray on the child, i.e., CnRay, has the clearest effect. Among its linear, quadratic, and cubic contrasts, only the linear effect is significant.

• the estimated value of the linear effect of CnRay is 1.941, which indicates that when CnRay is changed from levels 1 to 2 (or 2 to 3, or 3 to 4),

• the log-odds of the disease would be increased by 0.8680416=1.941*(0.6708204-0.2236068), where 0.6708204-0.2236068 is the difference of the linear codings of two adjacent levels of CnRay

• the odds of the disease would be increased by a factor of 2.382241=exp(1.941*(0.6708204-0.2236068))

Let us recursively eliminate predictors/effects which are not significant to simplify the model (intermediate steps skipped). Since we have found only a significant linear effect for CnRay, we can convert levels 1-4 in CnRay to the numerical values 1-4, respectively, using the unclass function (note: the former and the latter 1-4 have different meanings). The final model is:
> cmodr <- clogit(disease ~ Fray+unclass(CnRay)+strata(ID),ramlxray)
> summary(cmodr)

Call:

coxph(formula = Surv(rep(1, 224), disease) ~ Fray + unclass(CnRay) + strata(ID), data = ramlxray, method = "exact")

n= 224

coef exp(coef) se(coef)    z       p

Frayyes        0.670      1.96    0.344 1.95 0.05100

unclass(CnRay) 0.814      2.26    0.237 3.44 0.00058

exp(coef) exp(-coef) lower .95 upper .95

Frayyes             1.96      0.512     0.996      3.84

unclass(CnRay)      2.26      0.443     1.419      3.59

Rsquare= 0.084   (max possible= 0.499 )

Likelihood ratio test= 19.6  on 2 df,   p=5.69e-05

Wald test            = 14.1  on 2 df,   p=0.000859

Score (logrank) test = 17.6  on 2 df,   p=0.000154

Notice that:

• the odds of the disease increase by a factor of 2.26 as we move between adjacent categories of CnRay, which is about the same as the 2.382241 we got in the result of the previous fitted model

• the father's x-ray variable is now just insignificant in this model underlining its borderline status

• the conclusions and explanations we can draw from this model are similar to those from the previous one

• although we have found an effect due to x-rays of the child, we cannot conclude the effect is causal. After all, subjects only have x-rays when something is wrong, so it is quite possible that the x-rays are linked to some unknown lurking variables.

An incorrect analysis of this data ignores the matching structure and simply uses a binomial GLM:

> gmod <- glm(disease ~ Fray+unclass(CnRay),family=binomial,ramlxray)
> summary(gmod)

Call:

glm(formula = disease ~ Fray + unclass(CnRay), family = binomial, data = ramlxray)

Deviance Residuals:

Min      1Q  Median      3Q     Max

-1.950  -0.950  -0.950   1.204   1.423

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)     -1.1623     0.3011  -3.861 0.000113 ***

Frayyes          0.5004     0.3078   1.626 0.104046

unclass(CnRay)   0.6005     0.1774   3.385 0.000711 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 309.39  on 223  degrees of freedom

Residual deviance: 293.26  on 221  degrees of freedom

AIC: 299.26

Number of Fisher Scoring iterations: 4

We can see that the results are somewhat different.