Diagnostics 2 (Reading: Faraway(2005, 1st edition), 4.1 and 4.3)


Residual Plots

We still use the savings dataset as an example again:

> savings <- read.table("savings.data") # read the data into R

> g <- lm(sav ~ p15 + p75 + inc + gro, data=savings) # fit the model with sav as the response and the rest variables as predictors

 

Let's plot

and check if their relative magnitudes are different or similar.

> rstud <- rstandard(g) # studentized residuals
> rjack <- rstudent(g) # jacknife residuals
> par(mfrow=c(2,2)) # split the graphic window into 2x2 subwindows

> plot(g$res,ylab="raw residuals"); plot(rstud,ylab="studentized residuals"); plot(rjack,ylab="jacknife residuals") # draw raw, studentized, jacknife residuals against index

First, the residuals vs. fitted plot and the abs(residuals) vs. fitted plot.

> par(mfrow=c(1,2))  # split the graphic window into 1x2 subwindows

> plot(g$fit, g$res, xlab="Fitted", ylab="Residuals"); abline(h=0) # draw the plot of residuals against y-hat
> plot(g$fit,abs(g$res),xlab="Fitted",ylab="|Residuals|") # draw the plot of absolute residual againt y-hat

 

It's often had to judge residual plots without prior experience so let's generate some of the artificial variety. The following four panels show

  1. Constant Variance

  2. Strong non-constant variance

  3. Mild non-constant variance

  4. Non-linearity

> par(mfrow=c(3,3)) # split the graphic window into 2x2 subwindows
> for(i in 1:9) {plot(1:50,rnorm(50)); abline(h=0)} # Constant Variance

> for(i in 1:9) {plot(1:50,(1:50)*rnorm(50)); abline(h=0)}  # Strong non-constant variance

> for(i in 1:9) {plot(1:50,sqrt((1:50))*rnorm(50)); abline(h=0)}  # Mild non-constant variance

> for(i in 1:9) {plot(1:50,cos((1:50)*pi/25)+rnorm(50)); abline(h=0)}  # Non-linearity

 

Now let's go back to saving data, and look at the residuals against predictor plots:

> par(mfrow=c(2,3)) # split the graphic window into 2x3 subwindows

> plot(savings$p15,g$res,xlab="pop15",ylab="residuals"); abline(h=0) # residual plot - residual against predictor p15
> plot(savings$p75,g$res,xlab="pop75",ylab="residuals"); abline(h=0) # residual plot - residual against predictor p75
> plot(savings$inc,g$res,xlab="income",ylab="residuals"); abline(h=0) # residual plot - residual against predictor inc
> plot(savings$gro,g$res,xlab="growth",ylab="residuals"); abline(h=0)  # residual plot - residual against predictor gro

A quick way to get various diagnostic plots for the regression is:

> par(mfrow=c(2,2)); plot(g) # some default diagnostic plots in R

 


Non-constant variance

Consider the residual vs. fitted plot for the Galapagos data in previous lab.

> gala <- read.table("gala.data") # read the data into R

> gg <- lm(Species~Area+Elevation+Scruz+Nearest+Adjacent, data=gala) # fit a model with Species as response and other variables as predictors

> plot(gg$fit,gg$res,xlab="Fitted",ylab="Residuals",main="untransformed response") # draw the plot of residual against y-hat

 


curvature in the mean of residuals --- added variable plot and partial residual plot

We use the savings dataset as an example again.

First we construct a added variable plot for p15:

> d <- lm(sav ~ p75 + inc + gro, data=savings)$res # calculate the residual of the model with sav as response and p75, inc, gro as predictors
> m <- lm(p15 ~ p75 + inc + gro, data=savings)$res # calculate the residual of the model with p15 as response and p75, inc, gro as predictors
> plot(m, d, xlab="pop15 residuals", ylab="saving residuals", main="added variable plot") # draw the 1st residuals against the 2nd residuals

 

A partial residual plot is easier to do:

> plot(savings$p15, g$res+g$coef['p15']*savings$p15, xlab="pop15", ylab="saving (adjusted)", main="partial residual") # draw the partial residual plot for p15
> abline(0, g$coef['p15']) # add the fitted line to the plot

or more directly, you can use the function:

> prplot(g,1)

 

 


Q-Q plots

Q-Q plots can be used to check the assumptions of normality.

 

Let's try it out on the same saving data:

> par(mfrow=c(2,2))  # split the graphic window into 2x2 subwindows

> qqnorm(g$res, ylab="raw residuals") # draw the normal probability plot for raw residuals
> qqline(g$res) # add a straight line to guide the examination

We can plot the studentized residuals:

> qqnorm(rstud, ylab="studentized residuals") # draw the normal probability plot for studentized residuals
> abline(0,1) # add the straight line with zero intercept and solpe=1

 

Note that histograms and box-plots are not as sensitive for checking normality:

> par(mfrow=c(1, 2)) # split the graphic window into 1x2 subwindows

> hist(g$res, 10) # a histogram plot for raw residuals
> boxplot(g$res, ylab="Boxplot of savings residuals") # a box-plot for raw residuals

 

 

We can get an idea of the variation to be expected in Q-Q plots in the following experiment. I generate data from different distributions:

  1. Normal

  2. Lognormal - an example of a skewed distribution

  3. Cauchy - an example of a long-tailed (platykurtic) distribution

  4. Uniform - an example of a short-tailed (leptokurtic) distribution

> par(mfrow=c(3,3)) # split the graphic window into 3x3 subwindows
> for(i in 1:9) {x<-rnorm(50); qqnorm(x); qqline(x)} # normal

> for(i in 1:9) {x<-exp(rnorm(50)); qqnorm(x); qqline(x)} # lognormal

> for(i in 1:9) {x<-rcauchy(50); qqnorm(x); qqline(x)} # Cauchy

> for(i in 1:9) {x<-runif(50); qqnorm(x); qqline(x)} # uniform

 

 


Half-normal plot

Half-normal plot is often used to look for extreme values which will be apparent as points that diverge substantially from the rest of the data.

 

We demonstrate the half-normal plot on the leverages and Cook statistics for the saving data. Here is a function for half-normal plot.

> par(mfrow=c(1,2)); countries <- row.names(savings)

> halfnorm(lm.influence(g)$hat, labs=countries, ylab="Leverages") # draw half-normal plot for leverage

> halfnorm(cooks.distance(g), labs=countries, nlab=3, ylab="Cook statistics") # draw half-normal plot for Cook's statistics

 


Correlated errors

For the example, we use the data taken from an environmental study that measured the four variables

for 111 consecutive days.

> air <- read.table("air.data",header=T)  # read the data into R

> air  # take a look of the data

        ozone radiation temperature wind

1    3.448217       190          67  7.4

2    3.301927       118          72  8.0

...deleted...

111  2.714418       223          68 11.5

Take a look of the scatter plots of variables in the data set:

> pairs(air) # draw pair-wise scatter plots

 

We fit a linear model:

> ga <- lm(ozone ~ radiation + temperature + wind, data=air) # fit the model with ozone as response and other variables as predictors
> summary(ga) # take a look of the fitted model

Call:

lm(formula = ozone ~ radiation + temperature + wind, data = air)

 

Residuals:

     Min       1Q   Median       3Q      Max

-1.12160 -0.37635 -0.02535  0.33607  1.49514

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)   

(Intercept) -0.2973296  0.5552139  -0.536 0.593400   

radiation    0.0022055  0.0005585   3.949 0.000141 ***

temperature  0.0500443  0.0061062   8.196 5.85e-13 ***

wind        -0.0760219  0.0157548  -4.825 4.67e-06 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 0.5102 on 107 degrees of freedom

Multiple R-Squared: 0.6807,     Adjusted R-squared: 0.6717

F-statistic: 76.03 on 3 and 107 DF,  p-value: < 2.2e-16

Check some diagnostics:

> par(mfrow=c(2, 2)); plot(ga) # draw the default diagnostics plots

 

Suppose we are otherwise satisfied with this model and want to check for serial correlation. First make an index plot of the residuals (note index represents time in this case):

> plot(ga$res) # draw the plot of residual against index

Now plot successive residuals:

> plot(ga$res[-111],ga$res[-1])  # plot εt-hat against εt+1-hat

You can plot more than just successive pairs:

> pairs(cbind(ga$res[-c(109,110,111)], ga$res[-c(1,110,111)], ga$res[-c(1,2,111)], ga$res[-c(1,2,3)]),labels=c("res","lag1 res","lag2 res","lag3 res"))  # plot εt-hat against εt+1-hat, εt+2-hat, εt+3-hat

We can compute the Durbin-Watson statistic:

> library(lmtest) # load the "lmtest" library into R

> dwtest(ozone ~ radiation + temperature + wind, data=air) # perform the D-W test

        Durbin-Watson test

 

data:  ozone ~ radiation + temperature + wind

DW = 1.8501, p-value = 0.1893

alternative hypothesis: true autocorrelation is greater than 0