Diagnostics 1 (Reading: Faraway(2005, 1st edition), 4.2)


We'll use the savings dataset (with country name) that has appeared several times in pervious lab as an example here. 


First fit the model and make an index plot of the residuals:

> savings <- read.table("savings.data") # read the data into R

> g <- lm(sav ~ p15 + p75 + inc + gro, data=savings) # fit the model with sav as the response and the rest variables as predictors

> plot(g$res, ylab="Residual", main="index plot of residuals") # draw the plot of residuals against index



Now let us look at the leverage.


We first extract the X-matrix here using "model.matrix()" and then compute and plot the leverages (also called "hat" values)

> x <- model.matrix(g) # extract X-matrix
> lev <- hat(x) # calculate leverages. The command "hat" returns the leverage, another way is to use the function "lm.influence", e.g., lev <- lm.influence(g)$hat, see example in influential observation
> sum(lev) # examine the sum of leverages

[1] 5


Let's check which countries have large leverages. 


Studentized residuals

Now get the studentized residuals:

> gsum <- summary(g) # get the summary output of the fitted model g
> gsum$sig # The σ-hat

[1] 3.802648

> stud <- g$res/(gsum$sig*sqrt(1-lev)) # studentized residuals, an easy way is to use the function "rstandard()", e.g., stud <- rstandard(g)
> plot(stud,ylab="studentized Residuals",main="studentized residuals") # draw the plot of studentized residuals against index


Outlier and jacknife residuals

Let's compute the jacknife residuals:

> jack <- stud*sqrt((50-5-1)/(50-5-stud^2)) # jacknife residuals, an easy way is to use the function "rstudent()", e.g., jack <- rstudent(g)
> plot(jack, ylab="Jacknife residuals", main="Jacknife residuals") # draw the plot of jacknife residuals against index

> jack[abs(jack)==max(abs(jack))] # find out the country with largest jacknife residual


> qt(0.05/2, 44) # compare with tn-p-1(a/2)

[1] -2.015368

Here is an example of a dataset with multiple outliers.

Read in and plot the data:

> star <- read.table("star.data",h=T) # read the data into R

> star # take a look of the data

   index temp light

1      1 4.37  5.23

2      2 4.56  5.74


47    47 4.42  4.50

> plot(star$temp,star$light) # draw the scatter plot of log light intensity against temperature

Now fit a linear regression and add the fitted line to the plot:

> gs <- lm(light ~ temp, data=star) # fit a model with light as the response and temp as the predictor
> abline(gs) # add the fitted line to the scatter plot


The four stars on the upper left of the plot are giants. See what happens if these are excluded:

> gs1 <- lm(light ~ temp, data=star, subset=(temp>3.6)) # fit a model without using the giant data
> plot(star$temp,star$light); abline(gs); abline(gs1, lty=2)  # add the fitted line of the last fitted model


Influential observation

Continuing with our study of the saving data. First we compute and plot the Cook statistics:

> cook <- stud^2*lev/(5*(1-lev)) # The Cook statistics, an easy way is to use the function "cooks.distance()", e.g., cook <- cooks.distance(g)
> plot(cook, ylab="Cooks distance") # draw the plot of Cook's statistics against index
> identify(1:50,cook,countries) # identify the three countries with largest Cook's statistics\


It would be rather tedious to do this for each country. There's is a quick way in R so that we don't have to fit 50 regression models to examine the coefficients found when each country is excluded:

> ginf <- lm.influence(g) # calculate some values useful for investigating influence
> names(ginf) # ginf has three components, the hat values (leverages), change in leave-out-one coefficients, change in leave-out-one sigmahat, and weighted residuals

[1] "hat"          "coefficients" "sigma"        "wt.res" 

> ginf # take a look of ginf


    Australia       Austria       Belgium       Bolivia        Brazil        Canada

   0.06771409    0.12039224    0.08748097    0.08947314    0.06956000    0.15840654


        Libya      Malaysia

   0.53145507    0.06523377



              (Intercept)           p15          p75           inc           gro

Australia      0.09156168 -1.525349e-03 -0.029047707  4.265293e-05 -3.166120e-05

Austria       -0.07467639  8.682630e-04  0.044733545 -3.456376e-05 -1.622880e-03


Malaysia       0.27201243 -8.876680e-03  0.035190516 -4.633154e-05 -1.423981e-02



    Australia       Austria       Belgium       Bolivia        Brazil        Canada

     3.843255      3.844341      3.829643      3.844035      3.805321      3.845264


        Libya      Malaysia

     3.794791      3.817615



    Australia       Austria       Belgium       Bolivia        Brazil        Canada

    0.8632876     0.6162526     2.2187481    -0.6982426     3.5527272    -0.3172475


        Libya      Malaysia

   -2.8294673    -2.9708394


The following is a summary of the countries we identified in previous analysis.

  Zambia Chile Libya US Japan Ireland Sweden
large leverage     ** * * *  
studentized residual * *          
jacknife residual * *          
Cook stat. *   **   *    
diff. in LO1 coef. (p15)     *   *    
diff. in LO1 coef. (p75)         * *  
diff. in LO1 coef. (inc)       *   * *
diff. in LO1 coef. (gro)     *        
change in LO1 sigma * *