NTHU STAT 5230 Lab - Binomial Data: Interpreting Odds

Interpreting Odds (Reading: Faraway (2006, 1st ed.), section 2.5)

Here is a data from a study on infant respiratory disease, namely the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding and sex, which may be found in Payne (1987). Let us first read into R the data and take a look of it:

> babyfood <- read.table("babyfood.txt")
> babyfood

disease nondisease sex food

1 77 381 Boy Bottle

2 19 128 Boy Suppl

3 47 447 Boy Breast

4 48 336 Girl Bottle

5 16 111 Girl Suppl

6 31 433 Girl Breast

The predictors "sex" and "food" are cross-classifying categorical variables so that we can express the data in a format of contingency table with y_x/n_x in each cell as follows:

	Bottle Only	Some Breast with Supplyment	Breast Only
Boys	77/458	19/147	47/494
Girls	48/384	16/127	31/464

The layout above with the proportion in each cell can be obtained by:

> xtabs(disease/(disease+nondisease)~sex+food,babyfood)

food

sex Bottle Breast Suppl

Boy 0.16812227 0.09514170 0.12925170

Girl 0.12500000 0.06681034 0.12598425

The table seems indicating that

breast feeding can reduce the chance of getting respiratory disease
boys have higher probability of getting the disease than girls

To draw more concrete and accurate conclusions, let us fit and examine a binomial GLM of the data:

> mdl <- glm(cbind(disease,nondisease) ~ sex + food, family=binomial,babyfood) # 0-1 dummy variables are used for this case. You can use the command "model.matrix(mdl)" to understand how sex and food are coded
> summary(mdl)

Call:

glm(formula = cbind(disease, nondisease) ~ sex + food, family = binomial, data = babyfood)

Deviance Residuals:

1 2 3 4 5 6

0.1096 -0.5052 0.1922 -0.1342 0.5896 -0.2284

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6127 0.1124 -14.347 < 2e-16 ***

sexGirl -0.3126 0.1410 -2.216 0.0267 *

foodBreast -0.6693 0.1530 -4.374 1.22e-05 ***

foodSuppl -0.1725 0.2056 -0.839 0.4013

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 26.37529 on 5 degrees of freedom

Residual deviance: 0.72192 on 2 degrees of freedom

AIC: 40.24

Number of Fisher Scoring iterations: 4

Notice that

this is a model containing only main effects. The predictor sex has two categories (Þ one main effect) and the predictor food has three categories (Þ two main effects) so that the model has four parameters (including an intercept term).
the chi-square approximation can be expected to be accurate here due to the large sample sizes in each of the covariate classes
the significance of the effect sexGirl indicates that there is a clear difference between boys and girls on the probabilities of getting the disease
the significance of the effect foodBreast indicates that there is a clear difference between "breast only" and "bottle only"
because the effect foodSuppl is insignificant, there is no strong evidence that supports "some breast with supplement" and "bottle only" are different
the small Residual deviance shows that this model is a good fit
Q: Is it required to add the sex-by-food interaction effects?
- To answer the question, one way is to fit a model with interaction effects (as will be shown shortly) and examine whether the interaction effects are significant.
- However, there is a simpler method for this case. Notice that a model with all main effects and interaction effects (there are 1´2=2 interaction effects) would require six parameters, which equals the number of covariate classes in this case. In other words, the model with main effects and interaction effects would be saturated with zero deviance whose degrees of freedom is also zero. Suppose that S=the model with main effects and L=the model with main effect and interaction effects. We can soon get D_S-D_L=0.72192 and df_S-df_L=2. Because 0.72192 is not at all large for two degrees of freedom, we would accept H₀: S and may conclude that there is no evidence of significant interactions.

In R, you can fit a model with interaction effects by:

> mdlfull <- glm(cbind(disease,nondisease) ~ sex * food, family=binomial,babyfood) # an alternative is to replayce "sex * food" by "sex + food + sex:food". You can use the command "help(formula)" to get more information.
> summary(mdlfull)

Call:

glm(formula = cbind(disease, nondisease) ~ sex * food, family = binomial, data = babyfood)

Deviance Residuals:

[1] 0 0 0 0 0 0

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.59899 0.12495 -12.797 < 2e-16 ***

sexGirl -0.34692 0.19855 -1.747 0.080591 .

foodBreast -0.65342 0.19780 -3.303 0.000955 ***

foodSuppl -0.30860 0.27578 -1.119 0.263145

sexGirl:foodBreast -0.03742 0.31225 -0.120 0.904603

sexGirl:foodSuppl 0.31757 0.41397 0.767 0.443012

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2.6375e+01 on 5 degrees of freedom

Residual deviance: -7.9270e-14 on 0 degrees of freedom

AIC: 43.518

Number of Fisher Scoring iterations: 3

Notice that both interaction effects are insignificant, which supports our previous arguments.

Let us now go back to the model with only main effects, i.e, mdl. We can test for the significance of the predictors by:

> drop1(mdl,test="Chi")

Single term deletions

Model:

cbind(disease, nondisease) ~ sex + food

Df Deviance AIC LRT Pr(Chi)

<none> 0.722 40.240

sex 1 5.699 43.217 4.977 0.02569 *

food 2 20.899 56.417 20.177 4.155e-05 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The drop1 function test each predictor relative to the full, i.e., mdl. Notice that

0.722 is the deviance of the model mdl, i.e., cbind(disease, nondisease) ~ sex + food
5.699 is the deviance of the model cbind(disease, nondisease) ~ food
20.899 is the deviance of the model cbind(disease, nondisease) ~ sex
the p-value, say 4.155e-05, can be obtained by the command "pchisq(20.899-0.722,2,lower=FALSE)"
We see that both predictors are significant in this sense.

Now, consider the interpretation of the coefficients. Let us start with the effect of breast feeding:

> exp(-0.6693)

[1] 0.5120669

We see that

breast feeding reduces the odds of respiratory disease to 51.2% of that for bottle feeding, or
breast feeding reduces the log-odds of respiratory disease by 0.6693 compared to that for bottle feeding

We could compute a confidence interval for it by figuring the standard error on the odds scale; however, we get better coverage properties by computing the interval on the log-odds scale and then transforming the endpoints as follows:

> c(-0.6693-1.96*0.153,-0.6693+1.96*0.153)

[1] -0.96918 -0.36942

> exp(c(-0.96918,-0.36942))

[1] 0.3793940 0.6911351

Notice that the CI is asymmetric about the estimated effect of 0.5120669.

Confidence intervals can also be computed using profile likelihood methods:
> library(MASS)
> exp(confint(mdl))

Waiting for profiling to be done...

2.5 % 97.5 %

(Intercept) 0.1591976 0.2474327

sexGirl 0.5536196 0.9629212

foodBreast 0.3781890 0.6895168

foodSuppl 0.5555197 1.2464284

Notice that

the CI [0.3781890, 0.6895168] is slightly closer to zero than the previous one although it makes little difference for this data
the latter result is usually more reliable for the Hauck-Donner reason discussed before

As an aside, note that for small value of e, we have:

log(x(1+e)) = log(x) + log(1+e) » log(x) + e

This approximation is reasonable for -0.25 < e < 0.25. So, for example, given the estimated foodSuppl effect=-0.1725, we can approximate the reduction in odds as about 17% relative to bottle feeding. The exact figure is:

> 1-exp(-0.173)

[1] 0.1588624

So, the approximation is good for a quick sense of the effect, but an exact calculation is necessary for results that will be presented to others.