Three-Way Tables (Reading: Faraway (2006), section 4.4)

ˇ@

Appleton et al. (1996) presented a data from a 20-year follow-up study on the effects of smoking. In the period of 1972-74, a large study, which also considered other issues, categorized women into smokers and nonsmokers and according to their age group. In the follow-up, the researchers recorded whether the subjects were dead or still alive (Q: what information about life status was lost in the dataset? Hint: the continuous variable, lifetime, was transformed into a categorical life variable). Only smokers or women who had never smoked are presented here. Relatively few smokers quit and these women have been excluded from the data. The cause of death is not reported here. Let us read the data into R:

> femsmoke <- read.table("femsmoke.txt")
> femsmoke

     y smoker dead   age

1    2    yes  yes 18-24

2    1     no  yes 18-24

3    3    yes  yes 25-34

ˇKdeletedˇK

28   0     no   no   75+

ˇ@

Simpson's Paradox.
We can combine the data over age groups to produce a contingency table:

> (ct <- xtabs(y ~ smoker+dead,femsmoke))

      dead

smoker  no yes

   no  502 230

   yes 443 139

Let us compute the proportions of dead and alive for smokers and nonsmokers:

> prop.table(ct,1)

      dead

smoker        no       yes

   no  0.6857923 0.3142077

   yes 0.7611684 0.2388316

We see that:

ˇ@

However, if we consider the relationship within a given age group, say 55-64:

> (cta <- xtabs(y ~ smoker+dead,femsmoke, subset=(age=="55-64")))

      dead

smoker no yes

   no  81  40

   yes 64  51

We can similarly compute the proportions of dead and alive for smokers and nonsmokers within the age group:

> prop.table(cta,1)

      dead

smoker        no       yes

   no  0.6694215 0.3305785

   yes 0.5565217 0.4434783

We see that:

ˇ@

Let us see why the Simpson's paradox occurs here by examining the proportions of smokers and nonsmokers within each age groups:

> prop.table(xtabs(y ~ smoker+age, femsmoke),2)

      age

smoker     18-24     25-34     35-44     45-54     55-64     65-74       75+

   no  0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688

   yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312

We see that:

 

Poisson Model - Mutual Independence.
Now, let us consider a GLM approach to investigating how the three factors interact. Because the main-effect-terms-only Poisson GLM corresponds to mutual independence, we can test for mutual independence by applying the Pearson's X2 goodness-of-fit test for this model:

> ct3 <- xtabs(y ~ smoker + dead + age, femsmoke)

> summary(ct3)

Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)

Number of cases in table: 1314

Number of factors: 3

Test for independence of all factors:

        Chisq = 790.6, df = 19, p-value = 2.140e-155

We can also fit this Poisson GLM and use the deviance-based goodness-of-fit test:

> modi <- glm(y ~ smoker + dead + age, femsmoke, family=poisson)

> summary(modi)

Call:

glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-7.9306  -5.3175  -0.5514   2.4229  11.1895 

 

Coefficients:

            Estimate Std. Error z value Pr(>|z|)   

(Intercept)  3.84748    0.09721  39.580  < 2e-16 ***

smokeryes   -0.22931    0.05554  -4.129 3.64e-05 ***

deadyes     -0.94039    0.06139 -15.319  < 2e-16 ***

age25-34     0.87618    0.11003   7.963 1.67e-15 ***

age35-44     0.67591    0.11356   5.952 2.65e-09 ***

age45-54     0.57536    0.11556   4.979 6.40e-07 ***

age55-64     0.70166    0.11307   6.206 5.45e-10 ***

age65-74     0.34377    0.12086   2.844  0.00445 **

age75+      -0.41837    0.14674  -2.851  0.00436 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for poisson family taken to be 1)

 

    Null deviance: 1193.9  on 27  degrees of freedom

Residual deviance:  735.0  on 19  degrees of freedom

AIC: 887.2

 

Number of Fisher Scoring iterations: 6

Notice that:

ˇ@

We can show that the coefficients of the main-effect-term-only model corresponds only to the marginal proportions. For example, consider the smoker factor:
> coef(modi)[2]

 smokeryes
-0.2293101

> (coefsmoke <- exp(c(0,coef(modi)[2])))

          smokeryes

 1.000000  0.795082

> coefsmoke/sum(coefsmoke)

          smokeryes

0.5570776 0.4429224

We see that these are just the marginal proportions for the smokers and nonsmokers in the data:

> prop.table(xtabs(y ~ smoker, femsmoke))

smoker

       no       yes

0.5570776 0.4429224

This serves to emphasize the point that the coefficients of the main effects of the Poisson model just convey information that we already know and is not the main interest of the study.

ˇ@

Poisson Model - Joint Independence.

A Poisson GLM with all main effect terms and just one 2-factor interaction term corresponds to joint independence. The specific interaction term tells us which pair of variables is dependent. For example, we can fit a model that says age is jointly independent of smoker and dead as follows:

> modj <- glm(y ~ smoker*dead + age, femsmoke, family=poisson)

> summary(modj)

Call:

glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-8.2606  -5.1564  -0.5933   2.5373  10.4236 

 

Coefficients:

                  Estimate Std. Error z value Pr(>|z|)   

(Intercept)        3.79994    0.09888  38.428  < 2e-16 ***

smokeryes         -0.12503    0.06519  -1.918  0.05511 . 

deadyes           -0.78052    0.07962  -9.803  < 2e-16 ***

age25-34           0.87618    0.11003   7.963 1.67e-15 ***

age35-44           0.67591    0.11356   5.952 2.65e-09 ***

age45-54           0.57536    0.11556   4.979 6.40e-07 ***

age55-64           0.70166    0.11307   6.206 5.45e-10 ***

age65-74           0.34377    0.12086   2.844  0.00445 **

age75+            -0.41837    0.14674  -2.851  0.00436 **

smokeryes:deadyes -0.37858    0.12566  -3.013  0.00259 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for poisson family taken to be 1)

 

    Null deviance: 1193.9  on 27  degrees of freedom

Residual deviance:  725.8  on 18  degrees of freedom

AIC: 880

 

Number of Fisher Scoring iterations: 6

Notice that:

ˇ@

Poisson Model - Conditional Independence.

A Poisson GLM with all main effect terms and two 2-factor interaction terms corresponds to conditional independence. The nature of the conditional independence can be determined by observing which one of the three possible 2-factor interactions does not appear in the model. The most plausible conditional independence model for our data is (exercise: check the fitting of the other two conditional independence models):

> modc <- glm(y ~ smoker*age + age*dead, femsmoke, family=poisson)

> c(deviance(modc), df.residual(modc))

[1] 8.326939 7.000000

Notice that:

ˇ@

Poisson Model - Uniform Association.

A Poisson GLM with all main effect terms and all three 2-factor interaction terms corresponds to uniform association:

> modu <- glm(y ~ (smoker+age+dead)^2, femsmoke, family=poisson)

> summary(modu)

Call:

glm(formula = y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)

 

Deviance Residuals:

       Min          1Q      Median          3Q         Max 

-7.001e-01  -1.100e-01  -1.661e-05   1.225e-01   6.727e-01 

 

Coefficients:

                     Estimate Std. Error z value Pr(>|z|)   

(Intercept)           4.10629    0.12759  32.184  < 2e-16 ***

smokeryes            -0.13074    0.18539  -0.705 0.480667   

age25-34              0.92648    0.15071   6.147 7.87e-10 ***

age35-44              0.61204    0.15767   3.882 0.000104 ***

age45-54              0.08972    0.17335   0.518 0.604774   

age55-64              0.27942    0.16550   1.688 0.091355 . 

age65-74             -0.73124    0.21519  -3.398 0.000678 ***

age75+              -23.40456 8839.01148  -0.003 0.997887   

deadyes              -3.86011    0.59389  -6.500 8.05e-11 ***

smokeryes:age25-34   -0.11752    0.22091  -0.532 0.594749   

smokeryes:age35-44   -0.01268    0.22800  -0.056 0.955654   

smokeryes:age45-54    0.56538    0.23585   2.397 0.016522 * 

smokeryes:age55-64   -0.08512    0.23573  -0.361 0.718030   

smokeryes:age65-74   -1.49088    0.30039  -4.963 6.93e-07 ***

smokeryes:age75+     -1.89060    0.39582  -4.776 1.78e-06 ***

smokeryes:deadyes     0.42741    0.17703   2.414 0.015762 * 

age25-34:deadyes      0.12006    0.68655   0.175 0.861178   

age35-44:deadyes      1.34112    0.62857   2.134 0.032874 * 

age45-54:deadyes      2.11336    0.61210   3.453 0.000555 ***

age55-64:deadyes      3.18077    0.60057   5.296 1.18e-07 ***

age65-74:deadyes      5.08798    0.61951   8.213  < 2e-16 ***

age75+:deadyes       27.31727 8839.01150   0.003 0.997534   

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for poisson family taken to be 1)

 

    Null deviance: 1193.9378  on 27  degrees of freedom

Residual deviance:    2.3809  on  6  degrees of freedom

AIC: 180.58

 

Number of Fisher Scoring iterations: 18

We can see that

ˇ@

To understand the meaning of uniform association, let us compute the fitted values to determine the fitted odds ratios for each age group (i.e., the odds ratios based on the uniform association model, not the sample odds ratios):

> ctf <- xtabs(fitted(modu) ~ smoker+dead+age,femsmoke)
> apply(ctf, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))

   18-24    25-34    35-44    45-54    55-64    65-74      75+

1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275

We see that:

ˇ@

Analysis Strategy for 3-Way Table using a GLM approach.
Because of the hierarchical structure in GLM, it makes sense to start with the most complex model and see how far it can be reduced. We can use difference-in-deviance tests to do the job. We now start with the saturated model:

> modsat <- glm(y ~ smoker*age*dead, femsmoke, family=poisson)
> drop1(modsat,test="Chi")

Single term deletions

 

Model:

y ~ smoker * age * dead

                Df  Deviance     AIC     LRT Pr(Chi)

<none>             3.033e-10 190.195               

smoker:age:dead  6     2.381 180.576   2.381  0.8815

We see that the 3-factor interaction terms may be dropped. Now, we consider dropping the 2-factor interaction terms:

> drop1(modu, test="Chi")

Single term deletions

 

Model:

y ~ (smoker + age + dead)^2

            Df Deviance    AIC    LRT Pr(Chi)   

<none>             2.38 180.58                  

smoker:age   6    92.63 258.83  90.25 < 2e-16 ***

smoker:dead  1     8.33 184.52   5.95 0.01475 * 

age:dead     6   632.30 798.49 629.92 < 2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that:

ˇ@

Binomial Model.

For some three-way table, it is reasonable to regard one variable as the response and the other two as predictors. In this example, we could view dead as the response and smoker and age as the predictors. Since the variable dead has only two levels, we can model it using a binomial GLM. For the response variable with more than two levels, a multinomial model would be required.

ˇ@

We construct a binomial response matrix whose 1st column is the number of "successes" and 2nd column the number of "failures" and fit a model containing only the intercept term:

> ybin <- matrix(femsmoke$y,ncol=2)

> modbinull <- glm(ybin ~ 1, femsmoke[1:14,], family=binomial)

> summary(modbinull)

Call:

glm(formula = ybin ~ 1, family = binomial, data = femsmoke[1:14,])

 

Deviance Residuals:

   Min      1Q  Median      3Q     Max 

-8.283  -5.472  -2.281   5.238  12.750 

 

Coefficients:

            Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.94039    0.06139  -15.32   <2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 641.5  on 13  degrees of freedom

Residual deviance: 641.5  on 13  degrees of freedom

AIC: 690.49

 

Number of Fisher Scoring iterations: 4

Notice that:

ˇ@

To investigate what binomial model is appropriate for the data, let us start with the most complex model and see how far it can be reduced. The saturated binomial model is:

> modbin <- glm(ybin ~ smoker*age, femsmoke[1:14,], family=binomial)

> drop1(modbin,test="Chi")

Single term deletions

 

Model:

ybin ~ smoker * age

           Df Deviance    AIC    LRT Pr(Chi)

<none>        5.27e-10 74.996              

smoker:age  6    2.381 65.377  2.381  0.8815

We see that the interaction term can be dropped. Now, we check if we may drop further terms:

> modbinr <- glm(ybin ~ smoker+age, femsmoke[1:14,], family=binomial)

> drop1(modbinr,test="Chi")

Single term deletions

 

Model:

ybin ~ smoker + age

       Df Deviance    AIC    LRT Pr(Chi)   

<none>        2.38  65.38                  

smoker  1     8.33  69.32   5.95 0.01475 * 

age     6   632.30 683.29 629.92 < 2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that both main effect terms are significant, so no further simplification is possible.

ˇ@

The binomial model with both main effect terms and no interaction term is effectively equivalent to the uniform association model in the Poisson approach (i.e., the Poisson GLM with all main effect terms and all 2-factor interaction terms, modu, above):

> deviance(modu)

[1] 2.380927

> deviance(modbinr)

[1] 2.380927

We can see that the deviances of the two models are identical. We can also extract the same fitted odds ratio (as in the Poisson model modu) from the estimated coefficient in the binomial model modbinr:

> exp(coef(modbinr)[2])

smokeryes
 1.533275

ˇ@

ˇ@

Let us check which Poisson GLM is associated with a binomial GLM containing only one main effect term:

> modbina <- glm(ybin ~ age, femsmoke[1:14,], family=binomial)

> deviance(modbina)

[1] 8.326939

We can see:

ˇ@

From the discussion above, we know that a binomial GLM can be identified with a corresponding Poisson GLM and the numbers we will obtain will be identical.

ˇ@

Hypergeometric Model.

A 2´2´K table can be decomposed into K 2´2 tables (which can be viewed as conditioned on age for this data). When the row and column marginal totals of every 2´2 tables are treated as fixed, we can use a hypergeometric distribution to model the data.

ˇ@

Recall that as shown above, the dependence between smoker and dead in every age groups are not statistically significant. Suppose that we compute the sample odds ratios in all the age groups:

> ct3 <- xtabs(y ~ smoker+dead+age,femsmoke)
> apply(ct3, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))

   18-24    25-34    35-44    45-54    55-64    65-74      75+

2.301887 0.753719 2.400000 1.441748 1.613672 1.148515      NaN

We see that:

The Mantel-Haenszel test is designed to test independence in the 2´2 tables across K strata: