Poisson Regression (Reading: Faraway (2006), section 3.1)

   

    Here is a data which recorded the counts of the numbers of species of tortoise found on 30 Galapagos Islands and the numbers that are endemic to that island. The data also contains five geographic variables for each island. The data was presented by Johnson and Raven (1973). A few missing values that appeared in the original dataset have been filled for simplicity. We start the analysis of the data by reading the data into R and throw out the Endemics variable (which falls in the 2nd column of the data frame) since we won't be using it in this analysis.
> gala <- read.table("gala.txt")

> gala

             Species Endemics    Area Elevation Nearest Scruz Adjacent

Baltra            58       23   25.09       346     0.6   0.6     1.84

Bartolome         31       21    1.24       109     0.6  26.3   572.33

Caldwell           3        3    0.21       114     2.8  58.7     0.78

ˇKdeletedˇK

Wolf              21       12    2.85       253    34.1 254.7     2.33

> gala <- gala[,-2]

ˇ@

    Let us first fit a Normal linear model:

> modl <- lm(Species ~ . , gala)
> summary(modl)

Call:

lm(formula = Species ~ ., data = gala)

 

Residuals:

     Min       1Q   Median       3Q      Max

-111.679  -34.898   -7.862   33.460  182.584

 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)   

(Intercept)  7.068221  19.154198   0.369 0.715351   

Area        -0.023938   0.022422  -1.068 0.296318   

Elevation    0.319465   0.053663   5.953 3.82e-06 ***

Nearest      0.009144   1.054136   0.009 0.993151   

Scruz       -0.240524   0.215402  -1.117 0.275208   

Adjacent    -0.074805   0.017700  -4.226 0.000297 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 60.98 on 24 degrees of freedom

Multiple R-Squared: 0.7658,     Adjusted R-squared: 0.7171

F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

We find that:

> plot(predict(modl),residuals(modl),xlab="Fitted",ylab="Residuals")

Notice that:

Let us follow the suggestion to fit a model by using the square-root of Species as the response and draw the same residual plot for the new model:

> modt <- lm(sqrt(Species) ~ . , gala)

> plot(predict(modt),residuals(modt),xlab="Fitted",ylab="Residuals")

We now see that the non-constant variance problem have been cleared up in the new model. Let us take a look at the fit:

> summary(modt)

Call:

lm(formula = sqrt(Species) ~ ., data = gala)

 

Residuals:

    Min      1Q  Median      3Q     Max

-4.5572 -1.4969 -0.3031  1.3527  5.2110

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)   

(Intercept)  3.3919243  0.8712678   3.893 0.000690 ***

Area        -0.0019718  0.0010199  -1.933 0.065080 . 

Elevation    0.0164784  0.0024410   6.751 5.55e-07 ***

Nearest      0.0249326  0.0479495   0.520 0.607844   

Scruz       -0.0134826  0.0097980  -1.376 0.181509   

Adjacent    -0.0033669  0.0008051  -4.182 0.000333 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 2.774 on 24 degrees of freedom

Multiple R-Squared: 0.7827,     Adjusted R-squared: 0.7374

F-statistic: 17.29 on 5 and 24 DF,  p-value: 2.874e-07

Notice that:

ˇ@

    For count responses as in this case, we can model the response by using Poisson distribution. Notice that the variance of Poisson increases with its mean. We now fit a Poisson GLM to the Galapagos data:
> modp <- glm(Species ~ .,family=poisson,gala)
> summary(modp)

Call:

glm(formula = Species ~ ., family = poisson, data = gala)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-8.2752  -4.4966  -0.9443   1.9168  10.1849 

 

Coefficients:

              Estimate Std. Error z value Pr(>|z|)   

(Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***

Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***

Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***

Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***

Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***

Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for poisson family taken to be 1)

 

    Null deviance: 3510.73  on 29  degrees of freedom

Residual deviance:  716.85  on 24  degrees of freedom

AIC: 889.68

 

Number of Fisher Scoring iterations: 5

The format of the output is same as that for binomial GLM because they are members of generalized linear models. We can find from the output that:

Using the function written by Prof. Faraway, we can draw a half-normal plot for the residuals to see whether the large deviance can be explained by an outlier:

> halfnorm(residuals(modp))

We do not see any clear evidence of outlier in the half-normal plot.

ˇ@

It could be the Xb structure form of the model needs some improvement, but some experimentation with different form for the predictors will reveal that :

Let us now go to the last explanation --- over-dispersion. For a Poisson distribution, the mean is equal to the variance. Let us investigate this relationship for this model. It is difficult to estimated the variance for a given value of the mean, but (yx-(mx-hat))2 does serve as a crude approximation. We plot this estimated variance against the mean:

> plot(log(fitted(modp)),log((gala$Species-fitted(modp))^2),xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
> abline(0,1)

We see that the variance is proportional to, but larger than, the mean.

ˇ@

    For the cases of over-dispersion, we can generalize the Poisson GLM by allowing a dispersion parameter s2 such that:

        Var(yx)=s2E(yx).

The dispersion parameter can be estimated by:

> (dp <- sum(residuals(modp,type="pearson")^2)/modp$df.res)

[1] 31.74914

We can then adjust the standard errors of b-hat and so forth in the summary as follow:

> summary(modp,dispersion=dp)

Call:

glm(formula = Species ~ ., family = poisson, data = gala)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-8.2752  -4.4966  -0.9443   1.9168  10.1849 

 

Coefficients:

              Estimate Std. Error z value Pr(>|z|)   

(Intercept)  3.1548079  0.2915897  10.819  < 2e-16 ***

Area        -0.0005799  0.0001480  -3.918 8.95e-05 ***

Elevation    0.0035406  0.0004925   7.189 6.53e-13 ***

Nearest      0.0088256  0.0102621   0.860    0.390   

Scruz       -0.0057094  0.0035251  -1.620    0.105   

Adjacent    -0.0006630  0.0001653  -4.012 6.01e-05 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for poisson family taken to be 31.74914)

 

    Null deviance: 3510.73  on 29  degrees of freedom

Residual deviance:  716.85  on 24  degrees of freedom

AIC: 889.68

 

Number of Fisher Scoring iterations: 5

Notice that:

To compare two nested models, we now need to use F-test, rather than chi-square test:

> drop1(modp,test="F")

Single term deletions

 

Model:

Species ~ Area + Elevation + Nearest + Scruz + Adjacent

          Df Deviance     AIC F value     Pr(F)   

<none>         716.85  889.68                     

Area       1  1204.35 1375.18 16.3217 0.0004762 ***

Elevation  1  2389.57 2560.40 56.0028 1.007e-07 ***

Nearest    1   739.41  910.24  0.7555 0.3933572    

Scruz      1   813.62  984.45  3.2400 0.0844448 . 

Adjacent   1  1341.45 1512.29 20.9119 0.0001230 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Warning message:

F test assumes 'quasipoisson' family in: drop1.glm(modp, test = "F")

The F-test is more reliable than the z-statistics although it makes little difference for this data.