Three-Way Tables (Reading: Faraway (2006), section 4.4)
Appleton et al. (1996) presented a data from a 20-year follow-up study on the effects of smoking. In the period of 1972-74, a large study, which also considered other issues, categorized women into smokers and nonsmokers and according to their age group. In the follow-up, the researchers recorded whether the subjects were dead or still alive (Q: what information about life status was lost in the dataset? Hint: the continuous variable, lifetime, was transformed into a categorical life variable). Only smokers or women who had never smoked are presented here. Relatively few smokers quit and these women have been excluded from the data. The cause of death is not reported here. Let us read the data into R:
femsmoke <- read.table("femsmoke.txt")
We can combine the data over age groups to produce a contingency table:
> (ct <- xtabs(y ~ smoker+dead,femsmoke))
Let us compute the proportions of dead and alive for smokers and nonsmokers:
We see that:
76.12% of smokers have survived for 20 years while only 68.58% of nonsmokers have survived
it seems suggest that smoking appears to have a beneficial effect on longevity
we can check the significance of the difference between smoker and nonsmoker groups:
The p-value of the Pearson X2 test for independence is small, so the difference cannot be reasonably ascribed to chance variation
However, if we consider the relationship within a given age group, say 55-64:
> (cta <- xtabs(y ~ smoker+dead,femsmoke, subset=(age=="55-64")))
We can similarly compute the proportions of dead and alive for smokers and nonsmokers within the age group:
We see that:
56.65% of the smokers have survived compared to 66.94% of the nonsmokers, which is a contradictive result to that found from the marginal table where we add over the age groups
the advantage to nonsmokers holds through out almost all the age groups (for 18-24 and 25-34 age groups, the proportions are about the same):
> prop.table(ftable(xtabs(y ~ age+smoker+dead,femsmoke)),1)
thus, the marginal association where we add over the age groups is different from the conditional association observed within age groups, Data where this effect is observed are an example of Simpson's paradox
by the way, it is interesting to note that the difference between smokers and nonsmokers in the 55-64 age group is not statistically significant in the Fisher's exact test (exercise: also check the results of deviance-based or Pearson X2 goodness-of-fit tests):
this is just a subset of the data
actually, none of the age groups are statistically significant in the Fisher's exact test (exercise: perform Fisher's exact test for other age groups)
because all the age groups showed a similar pattern, i.e., the advantage to nonsmokers, it would be better that we accumulate the information from all age groups to generate a more accurate conclusion
notice that when we incorporate the information from all age groups, the sample size is increased. For a larger sample size, it would be easier to get a significant result.
Q: how to combine information from all the 2´2 tables (each for a specific age group)? We will further discuss this issue later using three different approaches: a Poisson GLM approach, a binomial GLM approach, and a hypergeometric model.
Let us see why the Simpson's paradox occurs here by examining the proportions of smokers and nonsmokers within each age groups:
> prop.table(xtabs(y ~ smoker+age, femsmoke),2)
We see that:
smokers are more concentrated in the younger age groups and younger people are more likely to live for another 20 years
this explains why the marginal table gave an apparent advantage to smokers, which is in fact illusory
once we control for age, we see that smoking has a negative on longevity
Poisson Model - Mutual Independence.
Now, let us consider a GLM approach to investigating how the three factors interact. Because the main-effect-terms-only Poisson GLM corresponds to mutual independence, we can test for mutual independence by applying the Pearson's X2 goodness-of-fit test for this model:
> ct3 <- xtabs(y ~ smoker + dead + age, femsmoke)
We can also fit this Poisson GLM and use the deviance-based goodness-of-fit test:
> modi <- glm(y ~ smoker + dead + age, femsmoke, family=poisson)
although the test statistics for the two tests are somewhat different, in either case, we see a very large value (790.6 and 735.0, respectively) for the 19 degrees of freedom
we conclude that this model does not fit the data. Therefore, we reject the hypothesis of mutual independence.
We can show that
the coefficients of the main-effect-term-only model corresponds only to the marginal proportions. For
example, consider the smoker
> (coefsmoke <- exp(c(0,coef(modi))))
We see that these are just the marginal proportions for the smokers and nonsmokers in the data:
> prop.table(xtabs(y ~ smoker, femsmoke))
This serves to emphasize the point that the coefficients of the main effects of the Poisson model just convey information that we already know and is not the main interest of the study.
Poisson Model - Joint Independence.
A Poisson GLM with all main effect terms and just one 2-factor interaction term corresponds to joint independence. The specific interaction term tells us which pair of variables is dependent. For example, we can fit a model that says age is jointly independent of smoker and dead as follows:
> modj <- glm(y ~ smoker*dead + age, femsmoke, family=poisson)
although the deviance, 725.8, represents a small improvement over the mutual independence model (with a deviance 735.0), it is still very high for the degrees of freedom 18. It is clear that the model does not fit the data. We reject the hypothesis that age is jointly independent of smoker and dead.
there are other two joint independence models:
> modj1 <- glm(y ~ smoker*age + dead, femsmoke, family=poisson)
> c(deviance(modj1), df.residual(modj1))
> modj2 <- glm(y ~ dead*age + smoker, femsmoke, family=poisson)
> c(deviance(modj2), df.residual(modj2))
Both models also fit badly.
Poisson Model - Conditional Independence.
A Poisson GLM with all main effect terms and two 2-factor interaction terms corresponds to conditional independence. The nature of the conditional independence can be determined by observing which one of the three possible 2-factor interactions does not appear in the model. The most plausible conditional independence model for our data is (exercise: check the fitting of the other two conditional independence models):
> modc <- glm(y ~ smoker*age + age*dead, femsmoke, family=poisson)
> c(deviance(modc), df.residual(modc))
the deviance, 8.326939, is only slightly larger than the degrees of freedom 7, which indicates a fairly good fit
it suggests that smoking (smoker) is independent of life status (dead) given age (age) (Q: can it imply that smoking is independent of life status?)
however, bear in mind that we do have some zeros and other small numbers in the table, so there is some doubt as to the accuracy of the chi-square approximation here
although we put smoker*age and age*dead in the model, R automatically removed one of the repetitive main effect term age
Poisson Model - Uniform Association.
A Poisson GLM with all main effect terms and all three 2-factor interaction terms corresponds to uniform association:
> modu <- glm(y ~ (smoker+age+dead)^2, femsmoke, family=poisson)
We can see that
this model has 6 degrees of freedom (the same d.f. for 3-factor interaction term) left for goodness-of-fit test
the deviance, 2.380927, is small on 6 degrees of freedom
the uniform association model fit the data very well
To understand the meaning of uniform association, let us compute the fitted values to determine the fitted odds ratios for each age group (i.e., the odds ratios based on the uniform association model, not the sample odds ratios):
ctf <- xtabs(fitted(modu) ~ smoker+dead+age,femsmoke)
> apply(ctf, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))
We see that:
these fitted odds ratio is the same for every age group (note that odds ratio is a measure of association for 2´2 tables)
thus, the uniform association model asserts that for every level of one variable, we have the same association for the other two variables
the information may also be extracted from the coefficients of the fit. Consider the fitted odds ratio for smoker and dead for a given age group. It is precisely the exponential of the coefficient for the interaction term of smoker and dead:
We see that this exactly the fitted odds ratio we found above. The other interaction terms can be interpreted similarly.
Analysis Strategy for
3-Way Table using a GLM approach.
Because of the hierarchical structure in GLM, it makes sense to start with the most complex model and see how far it can be reduced. We can use difference-in-deviance tests to do the job. We now start with the saturated model:
> modsat <- glm(y ~ smoker*age*dead, femsmoke, family=poisson)
We see that the 3-factor interaction terms may be dropped. Now, we consider dropping the 2-factor interaction terms:
> drop1(modu, test="Chi")
We see that:
two of the interactions are strongly significant, but smoker:dead is only just statistically significant (note that if smoker:dead is removed, the resulting model is the modc above, which passes the goodness-of-fit test)
whether the 2-factor interaction term smoker:dead should be removed corresponds to the test for conditional independence of smoking and life status given age group
because smoker:dead is also significant, it is suggested that the uniform association model (modu) is more appropriate for the data than the conditional independence model (modc) (c.f. the difference between forward and backward selections. For this data, the former ends up with a conditional independence model while the latter a uniform association model)
For some three-way table, it is reasonable to regard one variable as the response and the other two as predictors. In this example, we could view dead as the response and smoker and age as the predictors. Since the variable dead has only two levels, we can model it using a binomial GLM. For the response variable with more than two levels, a multinomial model would be required.
We construct a binomial response matrix whose 1st column is the number of "successes" and 2nd column the number of "failures" and fit a model containing only the intercept term:
> ybin <- matrix(femsmoke$y,ncol=2)
> modbinull <- glm(ybin ~ 1, femsmoke[1:14,], family=binomial)
this binomial model is associated with the Poisson GLM containing all main effect terms and the 2-factor interaction term smoker:age (i.e., modj1 above)
the two models, modbinull and modj1, have the same deviance but different degrees of freedom (Q: why are the d.f.'s different?)
in other words, the binomial model without any predictors implicitly assumes an association between smoker and age
in any of the following binomial models, the association between smoker and age is also implicitly assumed
the fitted cell probabilities of the two models, modbinull and modj1, also have the same pattern:
> prop.table(ftable(xtabs(fitted(modj1) ~ age+smoker+dead, femsmoke)), 1)
> ftable(xtabs(c(fitted(modbinull), 1-fitted(modbinull)) ~ age + smoker + dead, femsmoke))
the Poisson model can predict the marginal proportions of smoker and age while the binomial model cannot
it is because the marginal totals of smoker and age are regarded as fixed in the binomial models and random in the Poisson models.
To investigate what binomial model is appropriate for the data, let us start with the most complex model and see how far it can be reduced. The saturated binomial model is:
> modbin <- glm(ybin ~ smoker*age, femsmoke[1:14,], family=binomial)
We see that the interaction term can be dropped. Now, we check if we may drop further terms:
> modbinr <- glm(ybin ~ smoker+age, femsmoke[1:14,], family=binomial)
We see that both main effect terms are significant, so no further simplification is possible.
The binomial model with both main effect terms and no interaction term is effectively equivalent to the uniform association model in the Poisson approach (i.e., the Poisson GLM with all main effect terms and all 2-factor interaction terms, modu, above):
We can see that the deviances of the two models are identical. We can also extract the same fitted odds ratio (as in the Poisson model modu) from the estimated coefficient in the binomial model modbinr:
Let us check which Poisson GLM is associated with a binomial GLM containing only one main effect term:
> modbina <- glm(ybin ~ age, femsmoke[1:14,], family=binomial)
We can see:
the deviance of this binomial GLM, 8.326939, is identical with that of the conditional independence Poisson model above, i.e., modc
a binomial GLM with only one main effect term is corresponding to the Poisson GLM with all main effect terms and two 2-factor interaction terms. The specific interaction term not included in the Poisson model tells us which main effect term is not contained in the binomial model.
From the discussion above, we know that a binomial GLM can be identified with a corresponding Poisson GLM and the numbers we will obtain will be identical.
A 2´2´K table can be decomposed into K 2´2 tables (which can be viewed as conditioned on age for this data). When the row and column marginal totals of every 2´2 tables are treated as fixed, we can use a hypergeometric distribution to model the data.
Recall that as shown above, the dependence between smoker and dead in every age groups are not statistically significant. Suppose that we compute the sample odds ratios in all the age groups:
> ct3 <- xtabs(y ~ smoker+dead+age,femsmoke)
> apply(ct3, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))
We see that:
there is some variation in the sample odds ratios, but they are all greater than one with the exception of the 25-34 age group
we could test for independence in each 2´2 table, but it is better to use a combined test as explained before
The Mantel-Haenszel test is designed to test independence in the 2´2 tables across K strata:
specifically, it tests H0: D1=D2=...=DK=1 (conditional independence hypothesis in the Poisson approach) against H1: D1=D2=...=DK¹1 (uniform association hypothesis in the Poisson approach), where Dk is the true odds ratio of the 2´2 table of the kth stratum
it only make sense to use the Mantel-Haenszel test if the association is similar in each stratum, i.e., D1=D2=...=DK has been accepted (a formal test called Breslow-Day test can be used to test the homogeneity of the K odds ratios)
for this data, the sample odds ratios do not vary greatly, so the use of the test is justified:
we used the exact method (to use exact distribution, the option "exact=TRUE" must be added in the command) in preference to the c21 approximation for the null distribution
the p-value 0.01591 is small, which indicates that a statistically significant association is revealed once we combine the information across strata
the common odds ratio is estimated as 1.530256, with the confidence interval [1.068889, 2.203415]
it tests the same hypotheses as the test in the command "drop1(modu, test="Chi")" above. For this data, we can see that the p-value of the Mantel-Haenszel test is very similar to that for the term smoker:dead (0.01591 and 0.01475, respectively).