NTHU STAT 5230 Lab - Binomial Data: Overdispersion

Overdispersion (Reading: Faraway (2006, 1st ed.), section 2.11)

Manly (1978) reported a data from an experiment where boxes of trout eggs were buried at five different stream locations and retrieved at four different times. The times are specified by the number of weeks after the original placement. The number of surviving eggs was recorded. Let us first read the data into R and construct a tabulation of the data:

> troutegg <- read.table("troutegg.txt")

> ftable(xtabs(cbind(survive,total) ~ location+period, troutegg))

survive total

location period

1 4 89 94

7 94 98

8 77 86

11 141 155

2 4 106 108

7 91 106

8 87 96

11 104 122

3 4 119 123

7 100 130

8 88 119

11 91 125

4 4 104 104

7 80 97

8 67 99

11 111 132

5 4 49 93

7 11 113

8 18 88

11 0 138

Notice that

in one case, all the eggs survive, while in another, none of the eggs survive
therefore, we would expect that within the range of the predictors in the dataset, the true p_x can run from 0 to 1, which indicates:
- a Normal linear model with y_x/n_x as the response should not be a good approximation to the binomial model
- the choice of link functions could be an important issue for this case

We can obtain the survival proportion in each cell by:

> xtabs((survive/total) ~ location+period, troutegg)

period

location 4 7 8 11

1 0.94680851 0.95918367 0.89534884 0.90967742

2 0.98148148 0.85849057 0.90625000 0.85245902

3 0.96747967 0.76923077 0.73949580 0.72800000

4 1.00000000 0.82474227 0.67676768 0.84090909

5 0.52688172 0.09734513 0.20454545 0.00000000

We can find some rough conclusions from the table:

no matter what length of retrieving time is chosen, location 5 has a survival probability much lower than other locations
the survival probabilities in the other 4 locations are more consistent
it seems that the survival probabilityhas the tendency of decreasing over retrieving time
it is not clear whether there exists some strong interactions between location and period

Let us fit a binomial GLM with only main effects to the data. We now treat location and period as nominal variables although period can be regarded as a quantitative variable:

> troutegg$location <- as.factor(troutegg$location)

> troutegg$period <- as.factor(troutegg$period)

> bmod <- glm(cbind(survive,total-survive) ~ location + period, family=binomial,troutegg)

> summary(bmod)

Call:

glm(formula = cbind(survive, total - survive) ~ location + period, family = binomial, data = troutegg)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.83046 -0.36497 -0.03034 0.61907 3.24340

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.6358 0.2813 16.479 < 2e-16 ***

location2 -0.4168 0.2461 -1.694 0.0903 .

location3 -1.2421 0.2194 -5.660 1.51e-08 ***

location4 -0.9509 0.2288 -4.157 3.23e-05 ***

location5 -4.6138 0.2502 -18.439 < 2e-16 ***

period7 -2.1702 0.2384 -9.103 < 2e-16 ***

period8 -2.3256 0.2429 -9.573 < 2e-16 ***

period11 -2.4500 0.2341 -10.466 < 2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1021.469 on 19 degrees of freedom

Residual deviance: 64.495 on 12 degrees of freedom

AIC: 157.03

Number of Fisher Scoring iterations: 5

Notice that the deviance of 64.495 on 12 degrees of freedom shows that this model does not fit. Some possible causes of the large deviance are:

sparse data
outlier
wrong Xb structure (the current model only contains main effects)
overdispersion

Before we conclude that there is an overdispersion, we need to eliminate other potential explanations.

With about 100 eggs in each box, we have no problem with sparseness.

To check for outliers, a half-normal plot of the residuals is a good way. Here is a function for drawing half-normal plot written by Prof. Faraway:

> halfnorm(residuals(bmod))

From the half-normal plot, we find that:

no single outlier is apparent
perhaps one can discern a larger number of residuals which seem to follow a more dispersed distribution than the rest.

For the Xb structure, we can add the 3´4=12 interaction effects to the current main-effect model. After doing this, the new model is a saturated model so that its deviance is zero. In other words, the deviance decreases by 64.495 for 12 degrees of freedom, which of course is statistically significant in terms of the difference-in-deviance test. However, let us draw an interaction plot by using as the response the empirical logits, which can be regarded as empirical estimates of h_x=logit(p_x) and are defined as:

log[(y_x+0.5)/(n_x-y_x+0.5)].

The 0.5 are added to prevent infinite values when n_x=y_x.

> elogits <- log((troutegg$survive+0.5)/(troutegg$total-troutegg$survive+0.5))
> with(troutegg,interaction.plot(period,location,elogits))

Notice that:

interaction plots are always difficult to interpret conclusively
for this case, we can say the five lines are roughly parallel to each other if the sharp decrease in the end of the location-5 line is ignored; in other words, there is no obvious sign of large interactions
we might conclude that although interactions are statistically significant, they are not physically very significant
so, there is no clear evidence that the Xb structure with only main effects is inadequate

Having eliminated these more obvious causes of large deviance, we may put the blame on overdispersion. Possible reasons for the overdispersion include:

heterogeneous trout eggs
variation in the experimental procedures
unknown variables affecting survival probability
...

The simplest approach for modeling overdispersion is to introduce an additional dispersion parameter s². We can estimate the dispersion parameter as:

> (sigma2 <- sum(residuals(bmod,type="pearson")^2)/12)

[1] 5.330322

We see that this is substantially larger than one as it would be in the standard binomial GLM.

We can now make F-test (c.f., chi-square test for standard binomial GLM) on the predictors using:

> drop1(bmod,scale=sigma2,test="F")

Single term deletions

Model:

cbind(survive, total - survive) ~ location + period

scale: 5.330322

Df Deviance AIC F value Pr(F)

<none> 64.50 157.03

location 4 913.56 308.32 39.494 8.142e-07 ***

period 3 228.57 181.81 10.176 0.001288 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Warning message:

F test assumes 'quasibinomial' family in: drop1.glm(bmod, scale = sigma2, test = "F")

Notice that:

both predictors are significant (note: we can see from the interaction plots that some main effects can be removed to further simplify the model)
in the command, it is necessary to specify the scale argument using the estimated value of s²
if this scale argument is omitted, the deviance, rather than Pearson X², will be used in the estimation of s²
- this makes very little difference in this particular dataset
- but, in some cases, using the deviance instead of Pearson X² to estimate s² gives inconsistent results
the warning message reminds us that the use of s² results in a model that is no longer a standard binomial GLM
no goodness-of-fit test is possible because we now have a free dispersion parameter s²

To test the significance of individual parameters, we can use the estimated s² to scale up the standard errors of parameter estimates:

> summary(bmod,dispersion=sigma2)

Call:

glm(formula = cbind(survive, total - survive) ~ location + period, family = binomial, data = troutegg)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.83046 -0.36497 -0.03034 0.61907 3.24340

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.6358 0.6495 7.138 9.49e-13 ***

location2 -0.4168 0.5682 -0.734 0.4632

location3 -1.2421 0.5066 -2.452 0.0142 *

location4 -0.9509 0.5281 -1.800 0.0718 .

location5 -4.6138 0.5777 -7.987 1.39e-15 ***

period7 -2.1702 0.5504 -3.943 8.05e-05 ***

period8 -2.3256 0.5609 -4.146 3.38e-05 ***

period11 -2.4500 0.5405 -4.533 5.82e-06 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 5.330322)

Null deviance: 1021.469 on 19 degrees of freedom

Residual deviance: 64.495 on 12 degrees of freedom

AIC: 157.03

Number of Fisher Scoring iterations: 5

(exercise:

compare the results with the output of the standard binomial GLM

find their differences

explain how the differences affect the conclusions.)

(Q: should we still use Normal(0,1) to calculate the p-values? is t-distribution a better choice?)