Three-Way Tables (Reading: Faraway (2006), section 4.4)

ˇ@

Appleton et al. (1996) presented a data from a 20-year follow-up study on the effects of smoking. In the period of 1972-74, a large study, which also considered other issues, categorized women into smokers and nonsmokers and according to their age group. In the follow-up, the researchers recorded whether the subjects were dead or still alive (Q: what information about life status was lost in the dataset? Hint: the continuous variable, lifetime, was transformed into a categorical life variable). Only smokers or women who had never smoked are presented here. Relatively few smokers quit and these women have been excluded from the data. The cause of death is not reported here. Let us read the data into R:

> femsmoke

1    2    yes  yes 18-24

2    1     no  yes 18-24

3    3    yes  yes 25-34

ˇKdeletedˇK

28   0     no   no   75+

ˇ@

We can combine the data over age groups to produce a contingency table:

> (ct <- xtabs(y ~ smoker+dead,femsmoke))

smoker  no yes

no  502 230

yes 443 139

Let us compute the proportions of dead and alive for smokers and nonsmokers:

> prop.table(ct,1)

smoker        no       yes

no  0.6857923 0.3142077

yes 0.7611684 0.2388316

We see that:

• 76.12% of smokers have survived for 20 years while only 68.58% of nonsmokers have survived

• it seems suggest that smoking appears to have a beneficial effect on longevity

• we can check the significance of the difference between smoker and nonsmoker groups:

> summary(ct)

Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)

Number of cases in table: 1314

Number of factors: 2

Test for independence of all factors:

Chisq = 9.121, df = 1, p-value = 0.002527

The p-value of the Pearson X2 test for independence is small, so the difference cannot be reasonably ascribed to chance variation

ˇ@

However, if we consider the relationship within a given age group, say 55-64:

> (cta <- xtabs(y ~ smoker+dead,femsmoke, subset=(age=="55-64")))

smoker no yes

no  81  40

yes 64  51

We can similarly compute the proportions of dead and alive for smokers and nonsmokers within the age group:

> prop.table(cta,1)

smoker        no       yes

no  0.6694215 0.3305785

yes 0.5565217 0.4434783

We see that:

• 56.65% of the smokers have survived compared to 66.94% of the nonsmokers, which is a contradictive result to that found from the marginal table where we add over the age groups

• the advantage to nonsmokers holds through out almost all the age groups (for 18-24 and 25-34 age groups, the proportions are about the same):

age   smoker

18-24 no          0.98387097 0.01612903

yes         0.96363636 0.03636364

25-34 no          0.96815287 0.03184713

yes         0.97580645 0.02419355

35-44 no          0.94214876 0.05785124

yes         0.87155963 0.12844037

45-54 no          0.84615385 0.15384615

yes         0.79230769 0.20769231

55-64 no          0.66942149 0.33057851

yes         0.55652174 0.44347826

65-74 no          0.21705426 0.78294574

yes         0.19444444 0.80555556

75+   no          0.00000000 1.00000000

yes         0.00000000 1.00000000

• thus, the marginal association where we add over the age groups is different from the conditional association observed within age groups, Data where this effect is observed are an example of Simpson's paradox

• by the way, it is interesting to note that the difference between smokers and nonsmokers in the 55-64 age group is not statistically significant in the Fisher's exact test (exercise: also check the results of deviance-based or Pearson X2 goodness-of-fit tests):

> fisher.test(cta)

Fisher's Exact Test for Count Data

data:  cta

p-value = 0.08304

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

0.9203097 2.8334038

sample estimates:

odds ratio

1.610329

however,

• this is just a subset of the data

• actually, none of the age groups are statistically significant in the Fisher's exact test (exercise: perform Fisher's exact test for other age groups)

• because all the age groups showed a similar pattern, i.e., the advantage to nonsmokers, it would be better that we accumulate the information from all age groups to generate a more accurate conclusion

• notice that when we incorporate the information from all age groups, the sample size is increased. For a larger sample size,  it would be easier to get a significant result.

• Q: how to combine information from all the 2´2 tables (each for a specific age group)? We will further discuss this issue later using three different approaches: a Poisson GLM approach, a binomial GLM approach, and a hypergeometric model.

ˇ@

Let us see why the Simpson's paradox occurs here by examining the proportions of smokers and nonsmokers within each age groups:

> prop.table(xtabs(y ~ smoker+age, femsmoke),2)

age

smoker     18-24     25-34     35-44     45-54     55-64     65-74       75+

no  0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688

yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312

We see that:

• smokers are more concentrated in the younger age groups and younger people are more likely to live for another 20 years

• this explains why the marginal table gave an apparent advantage to smokers, which is in fact illusory

• once we control for age, we see that smoking has a negative on longevity

Poisson Model - Mutual Independence.
Now, let us consider a GLM approach to investigating how the three factors interact. Because the main-effect-terms-only Poisson GLM corresponds to mutual independence, we can test for mutual independence by applying the Pearson's X2 goodness-of-fit test for this model:

> ct3 <- xtabs(y ~ smoker + dead + age, femsmoke)

> summary(ct3)

Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)

Number of cases in table: 1314

Number of factors: 3

Test for independence of all factors:

Chisq = 790.6, df = 19, p-value = 2.140e-155

We can also fit this Poisson GLM and use the deviance-based goodness-of-fit test:

> modi <- glm(y ~ smoker + dead + age, femsmoke, family=poisson)

> summary(modi)

Call:

glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-7.9306  -5.3175  -0.5514   2.4229  11.1895

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)  3.84748    0.09721  39.580  < 2e-16 ***

smokeryes   -0.22931    0.05554  -4.129 3.64e-05 ***

deadyes     -0.94039    0.06139 -15.319  < 2e-16 ***

age25-34     0.87618    0.11003   7.963 1.67e-15 ***

age35-44     0.67591    0.11356   5.952 2.65e-09 ***

age45-54     0.57536    0.11556   4.979 6.40e-07 ***

age55-64     0.70166    0.11307   6.206 5.45e-10 ***

age65-74     0.34377    0.12086   2.844  0.00445 **

age75+      -0.41837    0.14674  -2.851  0.00436 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1193.9  on 27  degrees of freedom

Residual deviance:  735.0  on 19  degrees of freedom

AIC: 887.2

Number of Fisher Scoring iterations: 6

Notice that:

• although the test statistics for the two tests are somewhat different, in either case, we see a very large value (790.6 and 735.0, respectively) for the 19 degrees of freedom

• we conclude that this model does not fit the data. Therefore, we reject the hypothesis of mutual independence.

ˇ@

We can show that the coefficients of the main-effect-term-only model corresponds only to the marginal proportions. For example, consider the smoker factor:
> coef(modi)[2]

smokeryes
-0.2293101

> (coefsmoke <- exp(c(0,coef(modi)[2])))

smokeryes

1.000000  0.795082

> coefsmoke/sum(coefsmoke)

smokeryes

0.5570776 0.4429224

We see that these are just the marginal proportions for the smokers and nonsmokers in the data:

> prop.table(xtabs(y ~ smoker, femsmoke))

smoker

no       yes

0.5570776 0.4429224

This serves to emphasize the point that the coefficients of the main effects of the Poisson model just convey information that we already know and is not the main interest of the study.

ˇ@

Poisson Model - Joint Independence.

A Poisson GLM with all main effect terms and just one 2-factor interaction term corresponds to joint independence. The specific interaction term tells us which pair of variables is dependent. For example, we can fit a model that says age is jointly independent of smoker and dead as follows:

> modj <- glm(y ~ smoker*dead + age, femsmoke, family=poisson)

> summary(modj)

Call:

glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-8.2606  -5.1564  -0.5933   2.5373  10.4236

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)        3.79994    0.09888  38.428  < 2e-16 ***

smokeryes         -0.12503    0.06519  -1.918  0.05511 .

deadyes           -0.78052    0.07962  -9.803  < 2e-16 ***

age25-34           0.87618    0.11003   7.963 1.67e-15 ***

age35-44           0.67591    0.11356   5.952 2.65e-09 ***

age45-54           0.57536    0.11556   4.979 6.40e-07 ***

age55-64           0.70166    0.11307   6.206 5.45e-10 ***

age65-74           0.34377    0.12086   2.844  0.00445 **

age75+            -0.41837    0.14674  -2.851  0.00436 **

smokeryes:deadyes -0.37858    0.12566  -3.013  0.00259 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1193.9  on 27  degrees of freedom

Residual deviance:  725.8  on 18  degrees of freedom

AIC: 880

Number of Fisher Scoring iterations: 6

Notice that:

• although the deviance, 725.8, represents a small improvement over the mutual independence model (with a deviance 735.0), it is still very high for the degrees of freedom 18. It is clear that the model does not fit the data. We reject the hypothesis that age is jointly independent of smoker and dead.

• there are other two joint independence models:

• > modj1 <- glm(y ~ smoker*age + dead, femsmoke, family=poisson)

> c(deviance(modj1), df.residual(modj1))

[1] 641.4963 13.0000

> modj2 <- glm(y ~ dead*age + smoker, femsmoke, family=poisson)

> c(deviance(modj2), df.residual(modj2))

[1] 101.8334 13.0000

ˇ@

Poisson Model - Conditional Independence.

A Poisson GLM with all main effect terms and two 2-factor interaction terms corresponds to conditional independence. The nature of the conditional independence can be determined by observing which one of the three possible 2-factor interactions does not appear in the model. The most plausible conditional independence model for our data is (exercise: check the fitting of the other two conditional independence models):

> modc <- glm(y ~ smoker*age + age*dead, femsmoke, family=poisson)

> c(deviance(modc), df.residual(modc))

[1] 8.326939 7.000000

Notice that:

• the deviance, 8.326939, is only slightly larger than the degrees of freedom 7, which indicates a fairly good fit

• it suggests that smoking (smoker) is independent of life status (dead) given age (age) (Q: can it imply that smoking is independent of life status?)

• however, bear in mind that we do have some zeros and other small numbers in the table, so there is some doubt as to the accuracy of the chi-square approximation here

• although we put smoker*age and age*dead in the model, R automatically removed one of the repetitive main effect term age

ˇ@

Poisson Model - Uniform Association.

A Poisson GLM with all main effect terms and all three 2-factor interaction terms corresponds to uniform association:

> modu <- glm(y ~ (smoker+age+dead)^2, femsmoke, family=poisson)

> summary(modu)

Call:

glm(formula = y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)

Deviance Residuals:

Min          1Q      Median          3Q         Max

-7.001e-01  -1.100e-01  -1.661e-05   1.225e-01   6.727e-01

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)           4.10629    0.12759  32.184  < 2e-16 ***

smokeryes            -0.13074    0.18539  -0.705 0.480667

age25-34              0.92648    0.15071   6.147 7.87e-10 ***

age35-44              0.61204    0.15767   3.882 0.000104 ***

age45-54              0.08972    0.17335   0.518 0.604774

age55-64              0.27942    0.16550   1.688 0.091355 .

age65-74             -0.73124    0.21519  -3.398 0.000678 ***

age75+              -23.40456 8839.01148  -0.003 0.997887

deadyes              -3.86011    0.59389  -6.500 8.05e-11 ***

smokeryes:age25-34   -0.11752    0.22091  -0.532 0.594749

smokeryes:age35-44   -0.01268    0.22800  -0.056 0.955654

smokeryes:age45-54    0.56538    0.23585   2.397 0.016522 *

smokeryes:age55-64   -0.08512    0.23573  -0.361 0.718030

smokeryes:age65-74   -1.49088    0.30039  -4.963 6.93e-07 ***

smokeryes:age75+     -1.89060    0.39582  -4.776 1.78e-06 ***

smokeryes:deadyes     0.42741    0.17703   2.414 0.015762 *

age35-44:deadyes      1.34112    0.62857   2.134 0.032874 *

age45-54:deadyes      2.11336    0.61210   3.453 0.000555 ***

age55-64:deadyes      3.18077    0.60057   5.296 1.18e-07 ***

age65-74:deadyes      5.08798    0.61951   8.213  < 2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1193.9378  on 27  degrees of freedom

Residual deviance:    2.3809  on  6  degrees of freedom

AIC: 180.58

Number of Fisher Scoring iterations: 18

We can see that

• this model has 6 degrees of freedom (the same d.f. for 3-factor interaction term) left for goodness-of-fit test

• the deviance, 2.380927, is small on 6 degrees of freedom

• the uniform association model fit the data very well

ˇ@

To understand the meaning of uniform association, let us compute the fitted values to determine the fitted odds ratios for each age group (i.e., the odds ratios based on the uniform association model, not the sample odds ratios):

> ctf <- xtabs(fitted(modu) ~ smoker+dead+age,femsmoke)
> apply(ctf, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))

18-24    25-34    35-44    45-54    55-64    65-74      75+

1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275

We see that:

• these fitted odds ratio is the same for every age group (note that odds ratio is a measure of association for 2´2 tables)

• thus, the uniform association model asserts that for every level of one variable, we have the same association for the other two variables

• the information may also be extracted from the coefficients of the fit. Consider the fitted odds ratio for smoker and dead for a given age group. It is precisely the exponential of the coefficient for the interaction term of smoker and dead:

1.533275

We see that this exactly the fitted odds ratio we found above. The other interaction terms can be interpreted similarly.

ˇ@

Analysis Strategy for 3-Way Table using a GLM approach.
Because of the hierarchical structure in GLM, it makes sense to start with the most complex model and see how far it can be reduced. We can use difference-in-deviance tests to do the job. We now start with the saturated model:

> modsat <- glm(y ~ smoker*age*dead, femsmoke, family=poisson)
> drop1(modsat,test="Chi")

Single term deletions

Model:

y ~ smoker * age * dead

Df  Deviance     AIC     LRT Pr(Chi)

<none>             3.033e-10 190.195

smoker:age:dead  6     2.381 180.576   2.381  0.8815

We see that the 3-factor interaction terms may be dropped. Now, we consider dropping the 2-factor interaction terms:

> drop1(modu, test="Chi")

Single term deletions

Model:

y ~ (smoker + age + dead)^2

Df Deviance    AIC    LRT Pr(Chi)

<none>             2.38 180.58

smoker:age   6    92.63 258.83  90.25 < 2e-16 ***

smoker:dead  1     8.33 184.52   5.95 0.01475 *

age:dead     6   632.30 798.49 629.92 < 2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that:

• two of the interactions are strongly significant, but smoker:dead is only just statistically significant (note that if smoker:dead is removed, the resulting model is the modc above, which passes the goodness-of-fit test)

• whether the 2-factor interaction term smoker:dead should be removed corresponds to the test for conditional independence of smoking and life status given age group

• because smoker:dead is also significant, it is suggested that the uniform association model (modu) is more appropriate for the data than the conditional independence model (modc) (c.f. the difference between forward and backward selections. For this data, the former ends up with a conditional independence model while the latter a uniform association model)

ˇ@

Binomial Model.

For some three-way table, it is reasonable to regard one variable as the response and the other two as predictors. In this example, we could view dead as the response and smoker and age as the predictors. Since the variable dead has only two levels, we can model it using a binomial GLM. For the response variable with more than two levels, a multinomial model would be required.

ˇ@

We construct a binomial response matrix whose 1st column is the number of "successes" and 2nd column the number of "failures" and fit a model containing only the intercept term:

> ybin <- matrix(femsmoke\$y,ncol=2)

> modbinull <- glm(ybin ~ 1, femsmoke[1:14,], family=binomial)

> summary(modbinull)

Call:

glm(formula = ybin ~ 1, family = binomial, data = femsmoke[1:14,])

Deviance Residuals:

Min      1Q  Median      3Q     Max

-8.283  -5.472  -2.281   5.238  12.750

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.94039    0.06139  -15.32   <2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 641.5  on 13  degrees of freedom

Residual deviance: 641.5  on 13  degrees of freedom

AIC: 690.49

Number of Fisher Scoring iterations: 4

Notice that:

• this binomial model is associated with the Poisson GLM containing all main effect terms and the 2-factor interaction term smoker:age (i.e., modj1 above)

• the two models, modbinull and modj1, have the same deviance but different degrees of freedom (Q: why are the d.f.'s different?)

• in other words, the binomial model without any predictors implicitly assumes an association between smoker and age

• in any of the following binomial models, the association between smoker and age is also implicitly assumed

• the fitted cell probabilities of the two models, modbinull and modj1, also have the same pattern:

• > prop.table(ftable(xtabs(fitted(modj1) ~ age+smoker+dead, femsmoke)), 1)

age   smoker

18-24 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

25-34 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

35-44 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

45-54 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

55-64 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

65-74 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

75+   no          0.7191781 0.2808219

yes         0.7191781 0.2808219

> ftable(xtabs(c(fitted(modbinull), 1-fitted(modbinull)) ~ age + smoker + dead, femsmoke))

age   smoker

18-24 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

25-34 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

35-44 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

45-54 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

55-64 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

65-74 no          0.7191781 0.2808219

yes         0.7191781 0.2808219

75+   no          0.7191781 0.2808219

yes         0.7191781 0.2808219

However,

• the Poisson model can predict the marginal proportions of smoker and age while the binomial model cannot

• it is because the marginal totals of smoker and age are regarded as fixed in the binomial models and random in the Poisson models.

ˇ@

To investigate what binomial model is appropriate for the data, let us start with the most complex model and see how far it can be reduced. The saturated binomial model is:

> modbin <- glm(ybin ~ smoker*age, femsmoke[1:14,], family=binomial)

> drop1(modbin,test="Chi")

Single term deletions

Model:

ybin ~ smoker * age

Df Deviance    AIC    LRT Pr(Chi)

<none>        5.27e-10 74.996

smoker:age  6    2.381 65.377  2.381  0.8815

We see that the interaction term can be dropped. Now, we check if we may drop further terms:

> modbinr <- glm(ybin ~ smoker+age, femsmoke[1:14,], family=binomial)

> drop1(modbinr,test="Chi")

Single term deletions

Model:

ybin ~ smoker + age

Df Deviance    AIC    LRT Pr(Chi)

<none>        2.38  65.38

smoker  1     8.33  69.32   5.95 0.01475 *

age     6   632.30 683.29 629.92 < 2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that both main effect terms are significant, so no further simplification is possible.

ˇ@

The binomial model with both main effect terms and no interaction term is effectively equivalent to the uniform association model in the Poisson approach (i.e., the Poisson GLM with all main effect terms and all 2-factor interaction terms, modu, above):

> deviance(modu)

[1] 2.380927

> deviance(modbinr)

[1] 2.380927

We can see that the deviances of the two models are identical. We can also extract the same fitted odds ratio (as in the Poisson model modu) from the estimated coefficient in the binomial model modbinr:

> exp(coef(modbinr)[2])

smokeryes
1.533275

ˇ@

ˇ@

Let us check which Poisson GLM is associated with a binomial GLM containing only one main effect term:

> modbina <- glm(ybin ~ age, femsmoke[1:14,], family=binomial)

> deviance(modbina)

[1] 8.326939

We can see:

• the deviance of this binomial GLM, 8.326939, is identical with that of the conditional independence Poisson model above, i.e., modc

• a binomial GLM with only one main effect term is corresponding to the Poisson GLM with all main effect terms and two 2-factor interaction terms. The specific interaction term not included in the Poisson model tells us which main effect term is not contained in the binomial model.

ˇ@

From the discussion above, we know that a binomial GLM can be identified with a corresponding Poisson GLM and the numbers we will obtain will be identical.

ˇ@

Hypergeometric Model.

A 2´2´K table can be decomposed into K 2´2 tables (which can be viewed as conditioned on age for this data). When the row and column marginal totals of every 2´2 tables are treated as fixed, we can use a hypergeometric distribution to model the data.

ˇ@

Recall that as shown above, the dependence between smoker and dead in every age groups are not statistically significant. Suppose that we compute the sample odds ratios in all the age groups:

> ct3 <- xtabs(y ~ smoker+dead+age,femsmoke)
> apply(ct3, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))

18-24    25-34    35-44    45-54    55-64    65-74      75+

2.301887 0.753719 2.400000 1.441748 1.613672 1.148515      NaN

We see that:

• there is some variation in the sample odds ratios, but they are all greater than one with the exception of the 25-34 age group

• we could test for independence in each 2´2 table, but it is better to use a combined test as explained before

The Mantel-Haenszel test is designed to test independence in the 2´2 tables across K strata:

• specifically, it tests H0: D1=D2=...=DK=1 (conditional independence hypothesis in the Poisson approach) against H1: D1=D2=...=DK¹1 (uniform association hypothesis in the Poisson approach), where Dk is the true odds ratio of the 2´2 table of the kth stratum

• it only make sense to use the Mantel-Haenszel test if the association is similar in each stratum, i.e., D1=D2=...=DK has been accepted (a formal test called Breslow-Day test can be used to test the homogeneity of the K odds ratios)

• for this data, the sample odds ratios do not vary greatly, so the use of the test is justified:

• > mantelhaen.test(ct3,exact=TRUE)

Exact conditional test of independence in 2 x 2 x k tables

data:  ct3

S = 502, p-value = 0.01591

alternative hypothesis: true common odds ratio is not equal to 1

95 percent confidence interval:

1.068889 2.203415

sample estimates:

common odds ratio

1.530256

Notice that:

• we used the exact method (to use exact distribution, the option "exact=TRUE" must be added in the command) in preference to the c21 approximation for the null distribution

• the p-value 0.01591 is small, which indicates that a statistically significant association is revealed once we combine the information across strata

• the common odds ratio is estimated as 1.530256, with the confidence interval [1.068889, 2.203415]

• it tests the same hypotheses as the test in the command "drop1(modu, test="Chi")" above. For this data, we can see that the p-value of the Mantel-Haenszel test is very similar to that for the term smoker:dead (0.01591 and 0.01475, respectively).