We'll use the savings dataset (with country name) that has appeared several times in pervious lab as an example here.
First fit the model and make an index plot of the residuals:
> savings <- read.table("savings.data") # read the data into R
> g <- lm(sav ~ p15 + p75 + inc + gro, data=savings) # fit the model with sav as the response and the rest variables as predictors
> plot(g$res, ylab="Residual", main="index plot of residuals") # draw the plot of residuals against index
Let's find which countries correspond to the largest and smallest residuals.
> sort(g$res)[c(1,50)] # sort the residuals from small to large and get the largest and smallest residuals; the command "sort" returns a vector in ascending or descending order
or by using the "identify()" function. We first make up a character vector of the country names using "row.names()" which gets the row names from the data frame.
> countries <- row.names(savings) # extract the country names
> identify(1:50,g$res,countries) # label the points with their country names
Now let us look at the leverage.
We first extract the X-matrix here using "model.matrix()" and then compute and plot the leverages (also called "hat" values)
> x <- model.matrix(g) # extract X-matrix
> lev <- hat(x) # calculate leverages. The command "hat" returns the leverage, another way is to use the function "lm.influence", e.g., lev <- lm.influence(g)$hat, see example in influential observation
> sum(lev) # examine the sum of leverages
Why is the sum of the leverages equal to 5? (Hint: how many parameters in β?)
Let's check which countries have large leverages.
We can mark a horizontal line at 2p/n to indicate our "rule of thumb".
> plot(lev, ylab="Leverages",main="index plot of
leverages") # draw the plot of leverages against
> abline(h=2*5/50) # add a 2p/n horizontal line
> names(lev) <- countries # use the command "names" to assign the
country names to the elements of the vector lev
> lev[lev>2*5/50] # identify the four countries with the largest leverages
You can also identify large leverage from the plot by:
> identify(1:50, lev, countries)
large residuals and large leverage are distinct properties
in this case, the sets of countries having such large values do not intersect.
Now get the studentized residuals:
> gsum <- summary(g)
# get the summary output of the fitted
> gsum$sig # The σ-hat
> stud <-
# studentized residuals, an easy way is to use
the function "rstandard()", e.g., stud <- rstandard(g)
> plot(stud,ylab="studentized Residuals",main="studentized residuals") # draw the plot of studentized residuals against index
Note the range on the y-axis. How does this compare to the plain residuals?
Which studentized residuals are large?
In this case, there is not much difference between the studentized and raw residuals apart from the scale. Why? Let's take a look of the distribution of sqrt(1-hi):
> stem(sqrt(1-lev)) # draw stem plot of sqrt(1-hi)
Let's compute the jacknife residuals:
> jack <- stud*sqrt((50-5-1)/(50-5-stud^2))
# jacknife residuals, an easy way is to use the function "rstudent()", e.g.,
jack <- rstudent(g)
> plot(jack, ylab="Jacknife residuals", main="Jacknife residuals") # draw the plot of jacknife residuals against index
> jack[abs(jack)==max(abs(jack))] # find out the country with largest jacknife residual
> qt(0.05/2, 44) # compare with tn-p-1(a/2)
The largest residual of 2.85 is pretty big for a standard t-test scale
but is it an outlier under the Bonferroni critical value that takes into consideration the condition of multiple testing?
> qt(.05/(50*2), 44) # compare with tn-p-1(a/2n)
What do you conclude? No outliers are indicated.
Here is an example of a dataset with multiple outliers.
Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1, which is in the direction of Cygnus.
Read in and plot the data:
> star <- read.table("star.data",h=T) # read the data into R
> star # take a look of the data
> plot(star$temp,star$light) # draw the scatter plot of log light intensity against temperature
What do you think the relationship is between temperature and light intensity?
Now fit a linear regression and add the fitted line to the plot:
> gs <- lm(light ~ temp, data=star) # fit a model with light as the
response and temp as the predictor
> abline(gs) # add the fitted line to the scatter plot
Is this what you expected? Are there any outliers in the data?
Perform the outlier test.
> plot(rstudent(gs)) # plot of jacknife residuals
> sort(rstudent(gs))[c(1,2,3,45,46,47)] # get the three largest and three smallest jacknife residuals
> qt(0.05/2,44); qt(0.05/(2*47),44) # critical value under significance level a/2 and a/2n
The outlier test does not reveal any.
Let's try to find where the observations with large jacknife residuals located:
> star$temp[c(17,14,19,2,4,34)] # get the temp values of the six observations with most extreme jacknife residuals
The four stars on the upper left of the plot are giants. See what happens if these are excluded:
> gs1 <- lm(light ~ temp, data=star,
subset=(temp>3.6)) # fit a model without using the giant
> plot(star$temp,star$light); abline(gs); abline(gs1, lty=2) # add the fitted line of the last fitted model
Examine the regression summary and compare to the previous one. (exercise)
This example illustrates the problem of multiple outliers.
We can visualize the problems here, but for higher dimensional data, this is much more difficult.
Continuing with our study of the saving data. First we compute and plot the Cook statistics:
> cook <- stud^2*lev/(5*(1-lev)) # The Cook
statistics, an easy way is to use the function "cooks.distance()", e.g., cook
> plot(cook, ylab="Cooks distance") # draw the plot of Cook's statistics against index
> identify(1:50,cook,countries) # identify the three countries with largest Cook's statistics\
Why do the three countries have largest Cook's statistics?
We now exclude Libya and see how the fit changes:
> g1 <- lm(sav ~ p15+p75+inc+gro,subset=(cook <
0.2),data=savings) # fit a model without Libya data
> summary(g1) # take a look of the fitted model without Libya data
Compared to the full data fit:
Excluding Libya did have a moderate effect on the quantitative values of the coefficients.
The coefficient for gro changed by about 50%.
We don't like our estimates to be so sensitive to the presence of just one country.
It would be rather tedious to do this for each country. There's is a quick way in R so that we don't have to fit 50 regression models to examine the coefficients found when each country is excluded:
> ginf <- lm.influence(g) # calculate some values
useful for investigating influence
> names(ginf) # ginf has three components, the hat values (leverages), change in leave-out-one coefficients, change in leave-out-one sigmahat, and weighted residuals
> ginf # take a look of ginf
Let's plot the change in the second (p15) and third (p75) parameter estimates when a case is left out. (Try this for the other estimates).
> plot(ginf$coef[,2],ginf$coef[,3],xlab="change in p15 coef", ylab="change in p75 coef") # draw the plot of "changes in p15" against "changes in p75"
> identify(ginf$coef[,2],ginf$coef[,3],countries) # identify the countries whose data cause larger changes in parameter estimation
Which countries stick out?
Consider a fit without Japan:
> gj <- lm(sav ~ p15+p75+inc+gro,subset=(countries
!= "Japan")) # fit a model without Japan data
> summary(gj) # take a look of the fitted model
Compare it to the full data fit - what qualitative changes do you observe?
Finally, let's try to identify the countries that are influential on individual coefficient estimates and sigmahat.
> par(mfrow=c(2,3)) # split the graph window into 2x3 subwindows
> plot(ginf$coef[,2]);abline(h=0);identify(1:50,ginf$coef[,2],countries) # p15
> plot(ginf$coef[,3]);abline(h=0);identify(1:50,ginf$coef[,3],countries) # p75
> plot(ginf$coef[,4]);abline(h=0);identify(1:50,ginf$coef[,4],countries) # inc
> plot(ginf$coef[,5]);abline(h=0);identify(1:50,ginf$coef[,5],countries) # gro
> plot(ginf$sig);identify(1:50,ginf$sig,countries) # σ-hat
We can see the influence is not simple --- countries have influence on different components of the fit.
The following is a summary of the countries we identified in previous analysis.
|diff. in LO1 coef. (p15)||*||*|
|diff. in LO1 coef. (p75)||*||*|
|diff. in LO1 coef. (inc)||*||*||*|
|diff. in LO1 coef. (gro)||*|
|change in LO1 sigma||*||*|