Interpreting Odds (Reading: Faraway (2006), section 2.5)
¡@
Here is a data from a study on infant respiratory disease, namely the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding and sex, which may be found in Payne (1987). Let us first read into R the data and take a look of it:
>
babyfood <- read.table("babyfood.txt")
> babyfood
disease nondisease sex food
1 77 381 Boy Bottle
2 19 128 Boy Suppl
3 47 447 Boy Breast
4 48 336 Girl Bottle
5 16 111 Girl Suppl
6 31 433 Girl Breast
The predictors "sex" and "food" are cross-classifying categorical variables so that we can express the data in a format of contingency table with y_{x}/n_{x} in each cell as follows:
¡@ | Bottle Only | Some Breast with Supplyment | Breast Only |
Boys | 77/458 | 19/147 | 47/494 |
Girls | 48/384 | 16/127 | 31/464 |
The layout above with the proportion in each cell can be obtained by:
> xtabs(disease/(disease+nondisease)~sex+food,babyfood)
food
sex Bottle Breast Suppl
Boy 0.16812227 0.09514170 0.12925170
Girl 0.12500000 0.06681034 0.12598425
The table seems indicating that
breast feeding can reduce the chance of getting respiratory disease
boys have higher probability of getting the disease than girls
¡@
To draw more concrete and accurate conclusions, let us fit and examine a binomial GLM of the data:
>
mdl <- glm(cbind(disease,nondisease) ~ sex + food, family=binomial,babyfood)
# 0-1 dummy variables are used for this case. You can use
the command "model.matrix(mdl)" to understand how sex and food are coded
>
summary(mdl)
Call:
glm(formula = cbind(disease, nondisease) ~ sex + food, family = binomial, data = babyfood)
Deviance Residuals:
1 2 3 4 5 6
0.1096 -0.5052 0.1922 -0.1342 0.5896 -0.2284
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6127 0.1124 -14.347 < 2e-16 ***
sexGirl -0.3126 0.1410 -2.216 0.0267 *
foodBreast -0.6693 0.1530 -4.374 1.22e-05 ***
foodSuppl -0.1725 0.2056 -0.839 0.4013
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26.37529 on 5 degrees of freedom
Residual deviance: 0.72192 on 2 degrees of freedom
AIC: 40.24
Number of Fisher Scoring iterations: 4
Notice that
this is a model containing only main effects. The predictor sex has two categories (Þ one main effect) and the predictor food has three categories (Þ two main effects) so that the model has four parameters (including an intercept term).
the chi-square approximation can be expected to be accurate here due to the large sample sizes in each of the covariate classes
the significance of the effect sexGirl indicates that there is a clear difference between boys and girls on the probabilities of getting the disease
the significance of the effect foodBreast indicates that there is a clear difference between "breast only" and "bottle only"
because the effect foodSuppl is insignificant, there is no strong evidence that supports "some breast with supplement" and "bottle only" are different
the small Residual deviance shows that this model is a good fit
Q: Is it required to add the sex-by-food interaction effects?
To answer the question, one way is to fit a model with interaction effects (as will be shown shortly) and examine whether the interaction effects are significant.
However, there is a simpler method for this case. Notice that a model with all main effects and interaction effects (there are 1´2=2 interaction effects) would require six parameters, which equals the number of covariate classes in this case. In other words, the model with main effects and interaction effects would be saturated with zero deviance whose degrees of freedom is also zero. Suppose that S=the model with main effects and L=the model with main effect and interaction effects. We can soon get D_{S}-D_{L}=0.72192 and df_{S}-df_{L}=2. Because 0.72192 is not at all large for two degrees of freedom, we would accept H_{0}: S and may conclude that there is no evidence of significant interactions.
In R, you can fit a model with interaction effects by:
> mdlfull <- glm(cbind(disease,nondisease) ~
sex * food, family=binomial,babyfood)
# an alternative is to replayce "sex * food" by "sex +
food + sex:food". You can use the command "help(formula)" to get more
information.
> summary(mdlfull)
Call:
glm(formula = cbind(disease, nondisease) ~ sex * food, family = binomial, data = babyfood)
Deviance Residuals:
[1] 0 0 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.59899 0.12495 -12.797 < 2e-16 ***
sexGirl -0.34692 0.19855 -1.747 0.080591 .
foodBreast -0.65342 0.19780 -3.303 0.000955 ***
foodSuppl -0.30860 0.27578 -1.119 0.263145
sexGirl:foodBreast -0.03742 0.31225 -0.120 0.904603
sexGirl:foodSuppl 0.31757 0.41397 0.767 0.443012
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2.6375e+01 on 5 degrees of freedom
Residual deviance: -7.9270e-14 on 0 degrees of freedom
AIC: 43.518
Number of Fisher Scoring iterations: 3
Notice that both interaction effects are insignificant, which supports our previous arguments.
¡@
Let us now go back to the model with only main effects, i.e, mdl. We can test for the significance of the predictors by:
> drop1(mdl,test="Chi")
Single term deletions
Model:
cbind(disease, nondisease) ~ sex + food
Df Deviance AIC LRT Pr(Chi)
<none> 0.722 40.240
sex 1 5.699 43.217 4.977 0.02569 *
food 2 20.899 56.417 20.177 4.155e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The drop1 function test each predictor relative to the full, i.e., mdl. Notice that
0.722 is the deviance of the model mdl, i.e., cbind(disease, nondisease) ~ sex + food
5.699 is the deviance of the model cbind(disease, nondisease) ~ food
20.899 is the deviance of the model cbind(disease, nondisease) ~ sex
the p-value, say 4.155e-05, can be obtained by the command "pchisq(20.899-0.722,2,lower=FALSE)"
We see that both predictors are significant in this sense.
¡@
Now, consider the interpretation of the coefficients. Let us start with the effect of breast feeding:
> exp(-0.6693)
[1] 0.5120669
We see that
breast feeding reduces the odds of respiratory disease to 51.2% of that for bottle feeding, or
breast feeding reduces the log-odds of respiratory disease by 0.6693 compared to that for bottle feeding
We could compute a confidence interval for it by figuring the standard error on the odds scale; however, we get better coverage properties by computing the interval on the log-odds scale and then transforming the endpoints as follows:
> c(-0.6693-1.96*0.153,-0.6693+1.96*0.153)
[1] -0.96918 -0.36942
> exp(c(-0.96918,-0.36942))
[1] 0.3793940 0.6911351
Notice that the CI is asymmetric about the estimated effect of 0.5120669.
Confidence intervals can also be computed using profile
likelihood methods:
>
library(MASS)
>
exp(confint(mdl))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1591976 0.2474327
sexGirl 0.5536196 0.9629212
foodBreast 0.3781890 0.6895168
foodSuppl 0.5555197 1.2464284
Notice that
the CI [0.3781890, 0.6895168] is slightly closer to zero than the previous one although it makes little difference for this data
the latter result is usually more reliable for the Hauck-Donner reason discussed before
¡@
As an aside, note that for small value of e, we have:
log(x(1+e)) = log(x) + log(1+e) » log(x) + e
This approximation is reasonable for -0.25 < e < 0.25. So, for example, given the estimated foodSuppl effect=-0.1725, we can approximate the reduction in odds as about 17% relative to bottle feeding. The exact figure is:
> 1-exp(-0.173)
[1] 0.1588624
So, the approximation is good for a quick sense of the effect, but an exact calculation is necessary for results that will be presented to others.