Matched Pairs (Reading: Faraway (2006), section 4.3)
¡@
In Stuart (1955), data on the vision of a sample of women is presented. The left and right eye performance is graded into four categories. Let us read the data into R and take a look of it:
>
eyegrade <- read.table("eyegrade.txt")
> eyegrade
y right left
1 1520 best best
2 266 best second
3 124 best third
4 66 best worst
5 234 second best
6 1512 second second
7 432 second third
8 78 second worst
9 117 third best
10 362 third second
11 1772 third third
12 205 third worst
13 36 worst best
14 82 worst second
15 179 worst third
16 492 worst worst
> (ct <- xtabs(y ~ right+left, eyegrade))
left
right best second third worst
best 1520 266 124 66
second 234 1512 432 78
third 117 362 1772 205
worst 36 82 179 492
In the table,
we see that the row and column variables have the same levels. This is because in matched pairs, we observe one measure on two matched objects.
we observe larger numbers on the diagonal than off the diagonal, which indicates many people do have symmetric vision
we can also see that the entries adjacent to the diagonal are larger than those further away
¡@
Let us check for independence:
> summary(ct)
Call: xtabs(formula = y ~ right + left, data = eyegrade)
Number of cases in table: 7477
Number of factors: 2
Test for independence of all factors:
Chisq = 8097, df = 9, p-value = 0
We are not surprised to find strong evidence of dependence in matched-pair data.
¡@
Let us check whether the matched-pair data has the symmetry property. We can fit such a model by first defining a factor where the levels represent the symmetric pairs for the off-diagonal elements. There is only one observation for each level down the diagonal while there are two observations for each level off the diagonal:
> (symfac <- factor(apply(eyegrade[,2:3],1,function(x) paste(sort(x),collapse="-"))))
[1] best-best best-second best-third best-worst
[5] best-second second-second second-third second-worst
[9] best-third second-third third-third third-worst
[13] best-worst second-worst third-worst worst-worst
10 Levels: best-best best-second best-third ... worst-worst
We can see that the new factor has I(I+1)/2=4(4+1)/2=10 levels. We now fit a model for the new factor:
> mods <- glm(y ~ symfac, eyegrade, family=poisson)
> summary(mods)
Call:
glm(formula = y ~ symfac, family = poisson, data = eyegrade)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.219e+00 -4.776e-01 -1.475e-07 4.700e-01 2.008e+00
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.326466 0.025649 285.638 < 2e-16 ***
symfacbest-second -1.805005 0.051555 -35.011 < 2e-16 ***
symfacbest-third -2.534816 0.069334 -36.559 < 2e-16 ***
symfacbest-worst -3.394640 0.102283 -33.189 < 2e-16 ***
symfacsecond-second -0.005277 0.036322 -0.145 0.884
symfacsecond-third -1.342529 0.043787 -30.660 < 2e-16 ***
symfacsecond-worst -2.944439 0.083114 -35.427 < 2e-16 ***
symfacthird-third 0.153399 0.034960 4.388 1.15e-05 ***
symfacthird-worst -2.068970 0.057114 -36.225 < 2e-16 ***
symfacworst-worst -1.127987 0.051869 -21.747 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 8692.334 on 15 degrees of freedom
Residual deviance: 19.249 on 6 degrees of freedom
AIC: 156.63
Number of Fisher Scoring iterations: 4
> pchisq(deviance(mods),df.residual(mods),lower=F)
[1] 0.003762852
Here, we see evidence of a lack of symmetry because a deviance 19.249 on 6 degrees of freedom is too large. It is worth checking the residuals:
> round(xtabs(residuals(mods) ~ right+left, eyegrade),3)
left
right best second third worst
best 0.000 1.001 0.317 2.008
second -1.023 0.000 1.732 -0.225
third -0.320 -1.783 0.000 0.928
worst -2.219 0.223 -0.949 0.000
We see that:
the residuals down the diagonal are zero because there is only one observation for each level
the matrix of absolute residuals would be very close to a symmetric matrix (Q: can you see why it happens? Hint: there are two observations for each level off the diagonal)
the residuals above the diagonal are mostly positive, while they are mostly negative below
so, there are generally more poor left, good right eye combinations than good left, poor right eye combination
furthermore, let us compute the marginal totals:
> margin.table(ct,1)
right
best second third worst
1976 2256 2456 789
> margin.table(ct,2)
left
best second third worst
1907 2222 2507 841
We see that
there are somewhat more poor left eyes and good right eyes
so, perhaps marginal homogeneity does not hold here
notice that symmetry implies marginal homogeneity
¡@
Let us examine whether the data has the quasi-symmetry property. To do it, we can fit a model with main effect terms of the row and column variables and the new factor, symfac:
> modq <- glm(y ~ right+left+symfac, eyegrade, family=poisson)
> summary(modq)
Call:
glm(formula = y ~ right + left + symfac, family = poisson, data = eyegrade)
Deviance Residuals:
1 2 3 4 5 6 7 8
0.000e+00 1.612e-01 -8.394e-01 8.894e-01 -1.706e-01 -1.460e-07 6.325e-01 -1.128e+00
9 10 11 12 13 14 15 16
9.114e-01 -6.760e-01 -3.501e-07 2.409e-01 -1.093e+00 1.200e+00 -2.548e-01 -1.490e-07
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.32647 0.02565 285.638 < 2e-16 ***
rightsecond -2.43955 0.09055 -26.942 < 2e-16 ***
rightthird -1.61523 0.06955 -23.223 < 2e-16 ***
rightworst -0.72288 0.05641 -12.816 < 2e-16 ***
leftsecond -2.33241 0.09149 -25.493 < 2e-16 ***
leftthird -1.39721 0.07012 -19.927 < 2e-16 ***
leftworst -0.40510 0.05641 -7.182 6.88e-13 ***
symfacbest-second 0.57954 0.09462 6.125 9.07e-10 ***
symfacbest-third -1.03453 0.08633 -11.983 < 2e-16 ***
symfacbest-worst -2.84322 0.10266 -27.696 < 2e-16 ***
symfacsecond-second 4.76669 0.16668 28.598 < 2e-16 ***
symfacsecond-third 2.54814 0.11038 23.085 < 2e-16 ***
symfacsecond-worst NA NA NA NA
symfacthird-third 3.16584 0.11415 27.734 < 2e-16 ***
symfacthird-worst NA NA NA NA
symfacworst-worst NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 8692.3336 on 15 degrees of freedom
Residual deviance: 7.2708 on 3 degrees of freedom
AIC: 150.65
Number of Fisher Scoring iterations: 4
We can see that:
some effects got "NA" in the output. it happened because the model is over-parameterized, which causes a singular XTX (X is the model matrix). When XTX is singular, R would automatically remove some effects to get a non-singular model matrix. (Q: the number of effects in the model, i.e., 16, equals the number of observations. Why is the model not saturated?)
let us check whether a deviance of 7.2708 on 3 degrees of freedom can pass the goodness-of-fit test:
> pchisq(deviance(modq),df.residual(modq),lower=F)
[1] 0.06375054
We see that this model does fit, which suggests that the table has the property of quasi-symmetry.
¡@
Because marginal homogeneity together with quasi-symmetry implies symmetry. One can test for marginal homogeneity by comparing the symmetry and quasi-symmetry models using the difference-in-deviance test:
> anova(mods,modq,test="Chi")
Analysis of Deviance Table
Model 1: y ~ symfac
Model 2: y ~ right + left + symfac
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 6 19.2492
2 3 7.2708 3 11.9784 0.0075
We can see that:
the difference in deviances is 11.9784 on 3 degrees of freedom.
because the p-value is small, we find evidence of a lack of marginal homogeneity
note that the test is appropriate only when quasi-symmetry already holds
¡@
When we examine the table here, we do see that the numbers on the diagonal is much larger than those off the diagonal. A table with such a property is almost impossible to pass the test for independence. We might ask whether there is independence within off-diagonal cells, i.e., whether left and right eye performance are independent among those people whose vision is not symmetric. This is called quasi-independence hypothesis and we can test it by removing the data on the diagonal:
> modqi <- glm(y ~ right+left, eyegrade, family=poisson, subset=-c(1,6,11,16))
> summary(modqi)
Call:
glm(formula = y ~ right + left, family = poisson, data = eyegrade, subset = -c(1, 6, 11, 16))
Deviance Residuals:
2 3 4 5 7 8 9 10 12 13 14
4.7352 -5.2834 -0.1783 4.1642 0.5516 -6.2875 -3.5775 -1.4254 5.6736 -2.2159 -4.0338
15
4.8330
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.38841 0.07454 58.870 < 2e-16 ***
rightsecond 0.78176 0.06409 12.197 < 2e-16 ***
rightthird 0.68723 0.06470 10.622 < 2e-16 ***
rightworst -0.45698 0.07552 -6.051 1.44e-09 ***
leftsecond 0.88999 0.06789 13.110 < 2e-16 ***
leftthird 0.87160 0.06689 13.030 < 2e-16 ***
leftworst -0.17689 0.07481 -2.364 0.0181 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 900.99 on 11 degrees of freedom
Residual deviance: 199.11 on 5 degrees of freedom
AIC: 294.81
Number of Fisher Scoring iterations: 4
Notice that:
there are only 12 observations because the 4 observations on the diagonal have been removed
the deviance 199.11 on 5 degrees of freedom is clearly too large so that the model does not fit Þ the table is not quasi-independent
this is not surprising because in the table the entries adjacent to the diagonal are larger than those further away