Interpreting Odds (Reading: Faraway (2006), section 2.5)
Here is a data from a study on infant respiratory disease, namely the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding and sex, which may be found in Payne (1987). Let us first read into R the data and take a look of it:
babyfood <- read.table("babyfood.txt")
The predictors "sex" and "food" are cross-classifying categorical variables so that we can express the data in a format of contingency table with yx/nx in each cell as follows:
|¡@||Bottle Only||Some Breast with Supplyment||Breast Only|
The layout above with the proportion in each cell can be obtained by:
The table seems indicating that
breast feeding can reduce the chance of getting respiratory disease
boys have higher probability of getting the disease than girls
To draw more concrete and accurate conclusions, let us fit and examine a binomial GLM of the data:
mdl <- glm(cbind(disease,nondisease) ~ sex + food, family=binomial,babyfood)
# 0-1 dummy variables are used for this case. You can use
the command "model.matrix(mdl)" to understand how sex and food are coded
this is a model containing only main effects. The predictor sex has two categories (Þ one main effect) and the predictor food has three categories (Þ two main effects) so that the model has four parameters (including an intercept term).
the chi-square approximation can be expected to be accurate here due to the large sample sizes in each of the covariate classes
the significance of the effect sexGirl indicates that there is a clear difference between boys and girls on the probabilities of getting the disease
the significance of the effect foodBreast indicates that there is a clear difference between "breast only" and "bottle only"
because the effect foodSuppl is insignificant, there is no strong evidence that supports "some breast with supplement" and "bottle only" are different
the small Residual deviance shows that this model is a good fit
Q: Is it required to add the sex-by-food interaction effects?
To answer the question, one way is to fit a model with interaction effects (as will be shown shortly) and examine whether the interaction effects are significant.
However, there is a simpler method for this case. Notice that a model with all main effects and interaction effects (there are 1´2=2 interaction effects) would require six parameters, which equals the number of covariate classes in this case. In other words, the model with main effects and interaction effects would be saturated with zero deviance whose degrees of freedom is also zero. Suppose that S=the model with main effects and L=the model with main effect and interaction effects. We can soon get DS-DL=0.72192 and dfS-dfL=2. Because 0.72192 is not at all large for two degrees of freedom, we would accept H0: S and may conclude that there is no evidence of significant interactions.
In R, you can fit a model with interaction effects by:
> mdlfull <- glm(cbind(disease,nondisease) ~
sex * food, family=binomial,babyfood)
# an alternative is to replayce "sex * food" by "sex +
food + sex:food". You can use the command "help(formula)" to get more
Notice that both interaction effects are insignificant, which supports our previous arguments.
Let us now go back to the model with only main effects, i.e, mdl. We can test for the significance of the predictors by:
The drop1 function test each predictor relative to the full, i.e., mdl. Notice that
0.722 is the deviance of the model mdl, i.e., cbind(disease, nondisease) ~ sex + food
5.699 is the deviance of the model cbind(disease, nondisease) ~ food
20.899 is the deviance of the model cbind(disease, nondisease) ~ sex
the p-value, say 4.155e-05, can be obtained by the command "pchisq(20.899-0.722,2,lower=FALSE)"
We see that both predictors are significant in this sense.
Now, consider the interpretation of the coefficients. Let us start with the effect of breast feeding:
We see that
breast feeding reduces the odds of respiratory disease to 51.2% of that for bottle feeding, or
breast feeding reduces the log-odds of respiratory disease by 0.6693 compared to that for bottle feeding
We could compute a confidence interval for it by figuring the standard error on the odds scale; however, we get better coverage properties by computing the interval on the log-odds scale and then transforming the endpoints as follows:
Notice that the CI is asymmetric about the estimated effect of 0.5120669.
Confidence intervals can also be computed using profile
the CI [0.3781890, 0.6895168] is slightly closer to zero than the previous one although it makes little difference for this data
the latter result is usually more reliable for the Hauck-Donner reason discussed before
As an aside, note that for small value of e, we have:
log(x(1+e)) = log(x) + log(1+e) » log(x) + e
This approximation is reasonable for -0.25 < e < 0.25. So, for example, given the estimated foodSuppl effect=-0.1725, we can approximate the reduction in odds as about 17% relative to bottle feeding. The exact figure is:
So, the approximation is good for a quick sense of the effect, but an exact calculation is necessary for results that will be presented to others.