Here is a data set concerning the number of species of tortoise on the various Galapagos Islands. There are 30 cases (Islands) and 7 variables in the data set. We start by reading the data into R.
>
gala <- read.table("gala.data")
# read the data into R
> gala
# take a look
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
... much deleted ...
Wolf 21 12 2.85 253 34.1 254.7 2.33
The variables are
Species | the number of species of tortoise found on the island |
Endemics | the number of endemic species |
Area | the area of the island (km2) |
Elevation | the highest elevation of the island (m) |
Nearest | the distance from the nearest island (km) |
Scruz | the distance from Santa Cruz island (km) |
Adjacent | the area of the adjacent island (km2) |
In this dataset,
Some missing values are filled in for simplicity.
Q: Are these variable quantitative or qualitative? If quantitative, continuous or discrete?
Now fit a model. Let us start with a simple model, in which
the variable "Species" is treated as the response (Q: Is it reasonable to regard Species as a continuous variable?) and
the rest variables except "Endemics" as the predictors.
The model can be expressed as
Speciesi | = | β0 + β1*Areai + β2*Elevationi + β3*Nearesti + β4*Scruzi + β5*Adjacenti + εi, |
i=1, 2, ..., 30. |
The command to fit a linear model in R is:
>
gfit <- lm(Species~Area+Elevation+Nearest+Scruz+Adjacent, data=gala)
# "lm" is the command in R to fit a linear model. Use "help(lm)"
to understand the details of the command. The option "data=" specifies the
dataset that will be used. Also, use "help(formula)" to learn how to express a
model in R.
> summary(gfit)
# all the usual regression stuff
Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
Residuals:
Min 1Q Median 3Q Max
-111.679 -34.898 -7.862 33.460 182.584
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.369 0.715351
Area -0.023938 0.022422 -1.068 0.296318
Elevation 0.319465 0.053663 5.953 3.82e-06 ***
Nearest 0.009144 1.054136 0.009 0.993151
Scruz -0.240524 0.215402 -1.117 0.275208
Adjacent -0.074805 0.017700 -4.226 0.000297 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 60.98 on 24 degrees of freedom
Multiple R-Squared: 0.7658, Adjusted R-squared: 0.7171
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
Notice that
There are many quantities (statistics) and many information in the output.
Q: What quantities have caught your attention?
For different datasets, you might find their most important/useful information existing in different statistics.
Q: For this dataset, which statistics/quantities are particularly important?
The default of "summay(lm.object)" does not print the correlation matrix.
> summary(gfit, cor=T) # with the option "cor=T", the output will contain the correlation matrix of the estimated parameters
> options(digits=3) # the option "digits" controls the number of digits to print
> summary(gfit, cor=T)$cor # take a look of the correlation matrix of β's estimators
(Intercept) Area Elevation Nearest Scruz Adjacent
(Intercept) 1.0000 0.3271 -0.565 -0.0431 -0.3389 0.253
Area 0.3271 1.0000 -0.801 0.2035 -0.0378 0.433
Elevation -0.5650 -0.8014 1.000 -0.2333 0.0991 -0.642
Nearest -0.0431 0.2035 -0.233 1.0000 -0.6257 0.284
Scruz -0.3389 -0.0378 0.099 -0.6257 1.0000 -0.191
Adjacent 0.2533 0.4328 -0.642 0.2839 -0.1910 1.000
Notice that
the correlation matrix is a symmetric matrix
its diagonals are 1's
Q: How to read these numbers?
For example, cor(β1-hat,β2-hat)=-0.8014 shows that the estimators of Area and Elevation effects are highly negatively correlated. They have strong collinearity.
On the other hand, cor(β2-hat,β4-hat)=0.099 tells us the estimators of Elevation and Scruz effects have much weaker correlation.
> summary(gfit)$cov * (summary(gfit)$sigma^2) # the covariance matrix of the β's estimators, the command "summary(gfit)$cov" returns (XTX)-1
(Intercept)
Area Elevation Nearest
Scruz Adjacent
(Intercept)
366.8833 0.140474
-0.580739 -0.86964 -1.398067
0.085879
Area
0.1405 0.000503 -0.000964
0.00481 -0.000183
0.000172
Elevation
-0.5807 -0.000964
0.002880 -0.01320 0.001145
-0.000610
Nearest
-0.8696 0.004811
-0.013196 1.11120 -0.142067
0.005297
Scruz
-1.3981 -0.000183
0.001145 -0.14207 0.046398
-0.000728
Adjacent
0.0859 0.000172 -0.000610
0.00530 -0.000728
0.000313
Notice that in the covariance matrix
The values in the diagonal are the estimated variances of β-hat. Compare it with the Std. Error in the lm output. Q: What is the relationship between them?
Q: How are the covariance matrix related to the correlation matrix? In other words, how to obtain the correlation matrix from the covariance matrix given above?
Let us now examine the correlation between the columns of X.
> cor(model.matrix(gfit)) # the command "model.matrix" returns the model matrix X, which in this case is exactly the matrix formed by the data of the five predictor variables themselves plus the constant vector. This output shows the correlation between the columns of X.
(Intercept) Area Elevation Nearest Scruz Adjacent
(Intercept) 1 NA NA NA NA NA
Area NA 1.000 0.7537 -0.1111 -0.1008 0.1800
Elevation NA 0.754 1.0000 -0.0111 -0.0154 0.5365
Nearest NA -0.111 -0.0111 1.0000 0.6154 -0.1162
Scruz NA -0.101 -0.0154 0.6154 1.0000 0.0517
Adjacent NA 0.180 0.5365 -0.1162 0.0517 1.0000
Notice that
this correlation matrix is related to XTX, rather than the (XTX)-1 in the correlation matrix of β-hat.
Check how different it is from the correlation matrix of β-hat. Q: Can you identify some relationship between the two correlation matrices?
Q: Why do we get the NA's in the output?
> cov(model.matrix(gfit)) # the covariances between the five variables.
(Intercept) Area Elevation Nearest Scruz Adjacent
(Intercept) 0 0 0.0 0.0 0 0
Area 0 746687 274595.6 -1370.4 -5925 134495
Elevation 0 274596 177750.7 -66.7 -443 195531
Nearest 0 -1370 -66.7 203.8 598 -1435
Scruz 0 -5925 -442.8 597.6 4628 3038
Adjacent 0 134495 195531.1 -1434.6 3038 747393
Check how different it is from the covariance matrix of the estimated parameters.
For practice purpose, let us also construct these
quantities directly --- we define Y, X, (XTX)-1,
%*% denotes matrix multiplication, "t()" makes the transpose, and
"solve()" computes the inverse.
> y <- gala$Species
# the response
> rep(1,30)
# make a vector of ones
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> x <- cbind(rep(1,30), gala[,-c(1,2)])
# bind a column of ones to all except the first two columns of the data from gala
> x <- as.matrix(x)
# force x from being a data frame to a matrix x
> xtxi <- solve(t(x) %*% x)
# compute (XTX)-1
> xtxi
#
it is the same as the output obtained from the command "summary(gfit)$cov"
rep(1, 30) Area Elevation Nearest Scruz Adjacent
rep(1, 30) 9.87e-02 3.78e-05 -1.56e-04 -2.34e-04 -3.76e-04 2.31e-05
Area 3.78e-05 1.35e-07 -2.59e-07 1.29e-06 -4.91e-08 4.62e-08
Elevation -1.56e-04 -2.59e-07 7.75e-07 -3.55e-06 3.08e-07 -1.64e-07
Nearest -2.34e-04 1.29e-06 -3.55e-06 2.99e-04 -3.82e-05 1.42e-06
Scruz -3.76e-04 -4.91e-08 3.08e-07 -3.82e-05 1.25e-05 -1.96e-07
Adjacent 2.31e-05 4.62e-08 -1.64e-07 1.42e-06 -1.96e-07 8.43e-08
β-hat, y-hat, ε-hat may all be computed directly or extracted from the linear model object gfit. We do it both ways here --- check whether the results match.
> beta <- xtxi %*% t(x) %*% y
#
the β-hat
> beta
[,1]
rep(1, 30) 7.06822
Area -0.02394
Elevation 0.31946
Nearest 0.00914
Scruz -0.24052
Adjacent -0.07480
However,
it is a bad way to compute β-hat
it is inefficient and can be very inaccurate when the predictors are strongly correlated
An alternative is to directly solve the normal equation XTXb=XTY.
> solve(t(x)%*%x, t(x)%*%y) # solve the normal equation
[,1]
rep(1, 30) 7.06822
Area -0.02394
Elevation 0.31946
Nearest 0.00914
Scruz -0.24052
Adjacent -0.07480
We can extract the regression quantities we need from the lm object.
> names(gfit) # the "names()" command is the way to see the components of an R object
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
> gfits <- summary(gfit)
> names(gfits)
# more regression quantities in
summary of lm object.
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
> deviance(gfit)
# the "deviance" command returns RSS for a regression
fitted model.
[1] 89231
Compare the quantity and the Residual standard error in the lm output. Q: how to obtain the residual standard error from the RSS? Notice that σ-hat=sqrt(RSS/(n-p)).
> gfit$coef # get β-hat from the fitted model gfit
(Intercept) Area Elevation Nearest Scruz Adjacent
7.06822 -0.02394 0.31946 0.00914 -0.24052 -0.07480
> yhat <- x %*% beta
# y-hat=X(β-hat)
> cbind(yhat, gfit$fitted)
# compare if they are identical
[,1] [,2]
Baltra 116.726 116.726
Bartolome -7.273 -7.273
Caldwell 29.331 29.331
... deleted ...
Wolf 26.701 26.701
> res <- y - yhat
# the residuals
> cbind(res, gfit$resid)
# compare if they are identical
[,1] [,2]
Baltra -58.73 -58.73
Bartolome 38.27 38.27
Caldwell -26.33 -26.33
... deleted ...
Wolf -5.70 -5.70
Notice that
the residuals and the fitted values are orthogonal.
> sum(res*yhat) # their inner product = 0
[1] 2.73e-11
actually, the residual vector is orthogonal to any columns of X (therefore, orthogonal to any linear combinations of the columns of X
> t(x) %*% res # XT(ε-hat)
[,1]
rep(1, 30)
4.76e-12
Area 1.54e-09
Elevation
2.91e-11
Nearest
-6.96e-13
Scruz
6.98e-11
Adjacent
-5.07e-10
We now compute the residual sum of squares (RSS), σ-hat, standard error of β-hat, R2, and the correlation matrix for β-hat. Verify that the results agree with the output from before.
> sum(res*res) # the RSS
[1] 89231
> gfit$df # the degrees of freedom, it = 30-6, why?
[1] 24
> sighat <- sqrt(sum(res*res)/gfit$df)
# calculate σ-hat
> sighat
# take a look
[1] 61
> diag(xtxi) # the "diag" command extract the diagonal of a matrix. The output is the diagonal entries of (XTX)-1
rep(1, 30) Area Elevation Nearest Scruz Adjacent
9.87e-02 1.35e-07 7.75e-07 2.99e-04 1.25e-05 8.43e-08
> sqrt(diag(xtxi))*sighat # Q: what's this?
rep(1, 30) Area Elevation Nearest Scruz Adjacent
19.1542 0.0224 0.0537 1.0541 0.2154 0.0177
> summary(gfit)$coef[,2] # compare with the previous outputs
(Intercept) Area Elevation Nearest Scruz Adjacent
19.1542 0.0224 0.0537 1.0541 0.2154 0.0177
> mean(y) # calculate the mean of y
[1] 85.2
> sum((y-mean(y))^2) # the total sum of squares
[1] 381081
> 1 - sum(res*res) / sum((y-mean(y))^2) # the R2
[1] 0.766
> var(yhat)/var(y) # Q: why does it equal R2?
[,1]
[1,] 0.766
> cor(y,yhat)^2 # Q: why does it equal R2?
[,1]
[1,] 0.766
Let's calculate the correlation matrix of β-hat.
> z <- sqrt(diag(xtxi))
> xtxi / z %o% z # the correlation matrix, where %o% means outerproduct
zzT
rep(1, 30) Area Elevation Nearest Scruz Adjacent
rep(1, 30) 1.0000 0.3271 -0.565 -0.0431 -0.3389 0.253
Area 0.3271 1.0000 -0.801 0.2035 -0.0378 0.433
Elevation -0.5650 -0.8014 1.000 -0.2333 0.0991 -0.642
Nearest -0.0431 0.2035 -0.233 1.0000 -0.6257 0.284
Scruz -0.3389 -0.0378 0.099 -0.6257 1.0000 -0.191
Adjacent 0.2533 0.4328 -0.642 0.2839 -0.1910 1.000
Now we make a prediction for the number of species on an island with predictor
values:
Area=0.08, Elevation=93, Nearest=6.0, Scruz=12.0, and Adjacent=0.34.
> xp <- c(1, 0.08, 93, 6.0, 12.0, 0.34)
# don't forget the initial
one
> sum(xp*beta) # the predicted number of species
[1] 33.9