Parameter Estimation (Reading: Faraway (2005, 1st edition), section 2.8)

 

Here is a data set concerning the number of species of tortoise on the various Galapagos Islands. There are 30 cases (Islands) and 7 variables in the data set. We start by reading the data into R.

> gala <- read.table("gala.data") # read the data into R
> gala # take a look

             Species Endemics    Area Elevation Nearest Scruz Adjacent

Baltra            58       23   25.09       346     0.6   0.6     1.84

Bartolome         31       21    1.24       109     0.6  26.3   572.33

Caldwell           3        3    0.21       114     2.8  58.7     0.78

... much deleted ...

Wolf              21       12    2.85       253    34.1 254.7     2.33

 

The variables are

Species the number of species of tortoise found on the island
Endemics the number of endemic species
Area the area of the island (km2)
Elevation the highest elevation of the island (m)
Nearest the distance from the nearest island (km)
Scruz the distance from Santa Cruz island (km)
Adjacent the area of the adjacent island (km2)

In this dataset,

 

Now fit a model. Let us start with a simple model, in which

Speciesi = β0 + β1*Areai + β2*Elevationi + β3*Nearesti + β4*Scruzi + β5*Adjacenti + εi,
    i=1, 2, ..., 30.

 

The command to fit a linear model in R is:

> gfit <- lm(Species~Area+Elevation+Nearest+Scruz+Adjacent, data=gala) # "lm" is the command in R to fit a linear model. Use "help(lm)" to understand the details of the command. The option "data=" specifies the dataset that will be used. Also, use "help(formula)" to learn how to express a model in R.
> summary(gfit) # all the usual regression stuff

Call:
lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,

    data = gala)

Residuals:

Min      1Q      Median 3Q     Max 

-111.679 -34.898 -7.862 33.460 182.584 

Coefficients:

                Estimate  Std. Error  t value  Pr(>|t|) 

(Intercept)     7.068221   19.154198    0.369  0.715351 

Area           -0.023938    0.022422   -1.068  0.296318 

Elevation       0.319465    0.053663    5.953  3.82e-06 ***

Nearest         0.009144    1.054136    0.009  0.993151 

Scruz          -0.240524    0.215402   -1.117  0.275208 

Adjacent       -0.074805    0.017700   -4.226  0.000297 ***

---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 60.98 on 24 degrees of freedom

Multiple R-Squared: 0.7658, Adjusted R-squared: 0.7171 

F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07

 Notice that

The default of "summay(lm.object)" does not print the correlation matrix.

> summary(gfit, cor=T) # with the option "cor=T", the output will contain the correlation matrix of the estimated parameters

> options(digits=3) # the option "digits" controls the number of digits to print

> summary(gfit, cor=T)$cor # take a look of the correlation matrix of β's estimators

            (Intercept)    Area Elevation Nearest   Scruz Adjacent

(Intercept)      1.0000  0.3271    -0.565 -0.0431 -0.3389    0.253

Area             0.3271  1.0000    -0.801  0.2035 -0.0378    0.433

Elevation       -0.5650 -0.8014     1.000 -0.2333  0.0991   -0.642

Nearest         -0.0431  0.2035    -0.233  1.0000 -0.6257    0.284

Scruz           -0.3389 -0.0378     0.099 -0.6257  1.0000   -0.191

Adjacent         0.2533  0.4328    -0.642  0.2839 -0.1910    1.000

Notice that

> summary(gfit)$cov * (summary(gfit)$sigma^2) # the covariance matrix of the β's estimators, the command "summary(gfit)$cov" returns (XTX)-1

            (Intercept)      Area Elevation  Nearest     Scruz  Adjacent

(Intercept)    366.8833  0.140474 -0.580739 -0.86964 -1.398067  0.085879

Area             0.1405  0.000503 -0.000964  0.00481 -0.000183  0.000172

Elevation       -0.5807 -0.000964  0.002880 -0.01320  0.001145 -0.000610

Nearest         -0.8696  0.004811 -0.013196  1.11120 -0.142067  0.005297

Scruz           -1.3981 -0.000183  0.001145 -0.14207  0.046398 -0.000728

Adjacent         0.0859  0.000172 -0.000610  0.00530 -0.000728  0.000313

Notice that in the covariance matrix

Let us now examine the correlation between the columns of X.

> cor(model.matrix(gfit)) # the command "model.matrix" returns the model matrix X, which in this case is exactly the matrix formed by the data of the five predictor variables themselves plus the constant vector. This output shows the correlation between the columns of X.

            (Intercept)   Area Elevation Nearest   Scruz Adjacent

(Intercept)           1     NA        NA      NA      NA       NA

Area                 NA  1.000    0.7537 -0.1111 -0.1008   0.1800

Elevation            NA  0.754    1.0000 -0.0111 -0.0154   0.5365

Nearest              NA -0.111   -0.0111  1.0000  0.6154  -0.1162

Scruz                NA -0.101   -0.0154  0.6154  1.0000   0.0517

Adjacent             NA  0.180    0.5365 -0.1162  0.0517   1.0000

Notice that

> cov(model.matrix(gfit)) # the covariances between the five variables.

            (Intercept)   Area Elevation Nearest Scruz Adjacent

(Intercept)           0      0       0.0     0.0     0        0

Area                  0 746687  274595.6 -1370.4 -5925   134495

Elevation             0 274596  177750.7   -66.7  -443   195531

Nearest               0  -1370     -66.7   203.8   598    -1435

Scruz                 0  -5925    -442.8   597.6  4628     3038

Adjacent              0 134495  195531.1 -1434.6  3038   747393

Check how different it is from the covariance matrix of the estimated parameters.

 

For practice purpose, let us also construct these quantities directly --- we define Y, X, (XTX)-1, %*% denotes matrix multiplication, "t()" makes the transpose, and "solve()" computes the inverse.
> y <- gala$Species # the response
> rep(1,30) # make a vector of ones

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

> x <- cbind(rep(1,30), gala[,-c(1,2)]) # bind a column of ones to all except the first two columns of the data from gala
> x <- as.matrix(x) # force x from being a data frame to a matrix x
> xtxi <- solve(t(x) %*% x) # compute (XTX)-1
> xtxi # it is the same as the output obtained from the command "summary(gfit)$cov"

           rep(1, 30)      Area Elevation   Nearest     Scruz  Adjacent

rep(1, 30)   9.87e-02  3.78e-05 -1.56e-04 -2.34e-04 -3.76e-04  2.31e-05

Area         3.78e-05  1.35e-07 -2.59e-07  1.29e-06 -4.91e-08  4.62e-08

Elevation   -1.56e-04 -2.59e-07  7.75e-07 -3.55e-06  3.08e-07 -1.64e-07

Nearest     -2.34e-04  1.29e-06 -3.55e-06  2.99e-04 -3.82e-05  1.42e-06

Scruz       -3.76e-04 -4.91e-08  3.08e-07 -3.82e-05  1.25e-05 -1.96e-07

Adjacent     2.31e-05  4.62e-08 -1.64e-07  1.42e-06 -1.96e-07  8.43e-08

β-hat, y-hat, ε-hat may all be computed directly or extracted from the linear model object gfit. We do it both ways here --- check whether the results match.

> beta <- xtxi %*% t(x) %*% y # the β-hat
> beta 

               [,1]

rep(1, 30)  7.06822

Area       -0.02394

Elevation   0.31946

Nearest     0.00914

Scruz      -0.24052

Adjacent   -0.07480

However,

We can extract the regression quantities we need from the lm object.

> names(gfit) # the "names()" command is the way to see the components of an R object

[1] "coefficients"  "residuals"     "effects"       "rank"        

[5] "fitted.values" "assign"        "qr"            "df.residual" 

[9] "xlevels"       "call"          "terms"         "model"   

> gfits <- summary(gfit)
> names(gfits)
# more regression quantities in summary of lm object.

[1] "call"          "terms"         "residuals"     "coefficients"

[5] "aliased"       "sigma"         "df"            "r.squared"   

[9] "adj.r.squared" "fstatistic"    "cov.unscaled"

> deviance(gfit) # the "deviance" command returns RSS for a regression fitted model.

[1] 89231

Compare the quantity and the Residual standard error in the lm output. Q: how to obtain the residual standard error from the RSS? Notice that σ-hat=sqrt(RSS/(n-p)).

> gfit$coef # get β-hat from the fitted model gfit

(Intercept)        Area   Elevation     Nearest       Scruz    Adjacent

    7.06822    -0.02394     0.31946     0.00914    -0.24052    -0.07480

> yhat <- x %*% beta # y-hat=X(β-hat)
> cbind(yhat, gfit$fitted) # compare if they are identical

                [,1]    [,2]

Baltra       116.726 116.726

Bartolome     -7.273  -7.273

Caldwell      29.331  29.331

 ... deleted ...

Wolf          26.701  26.701

> res <- y - yhat # the residuals
> cbind(res, gfit$resid) # compare if they are identical

                [,1]    [,2]

Baltra        -58.73  -58.73

Bartolome      38.27   38.27

Caldwell      -26.33  -26.33

 ... deleted ...

Wolf           -5.70   -5.70

Notice that

 

We now compute the residual sum of squares (RSS), σ-hat, standard error of β-hat, R2, and the correlation matrix for β-hat. Verify that the results agree with the output from before.

> sum(res*res) # the RSS

[1] 89231

> gfit$df # the degrees of freedom, it = 30-6, why?

[1] 24

> sighat <- sqrt(sum(res*res)/gfit$df) # calculate σ-hat
> sighat # take a look

[1] 61

> diag(xtxi) # the "diag" command extract the diagonal of a matrix. The output is the diagonal entries of (XTX)-1

rep(1, 30)       Area  Elevation    Nearest      Scruz   Adjacent

  9.87e-02   1.35e-07   7.75e-07   2.99e-04   1.25e-05   8.43e-08

> sqrt(diag(xtxi))*sighat  # Q: what's this?

rep(1, 30)       Area  Elevation    Nearest      Scruz   Adjacent

   19.1542     0.0224     0.0537     1.0541     0.2154     0.0177

> summary(gfit)$coef[,2]  # compare with the previous outputs

(Intercept)        Area   Elevation     Nearest       Scruz    Adjacent

    19.1542      0.0224      0.0537      1.0541      0.2154      0.0177

> mean(y) # calculate the mean of y

[1] 85.2

> sum((y-mean(y))^2) # the total sum of squares

[1] 381081

> 1 - sum(res*res) / sum((y-mean(y))^2) # the R2

[1] 0.766

> var(yhat)/var(y) # Q: why does it equal R2?

      [,1]

[1,] 0.766

> cor(y,yhat)^2 # Q: why does it equal R2?

      [,1]

[1,] 0.766


Let's calculate the correlation matrix of β-hat.

> z <- sqrt(diag(xtxi))
> xtxi / z %o% z # the correlation matrix, where %o% means outerproduct zzT

           rep(1, 30)    Area Elevation Nearest   Scruz Adjacent

rep(1, 30)     1.0000  0.3271    -0.565 -0.0431 -0.3389    0.253

Area           0.3271  1.0000    -0.801  0.2035 -0.0378    0.433

Elevation     -0.5650 -0.8014     1.000 -0.2333  0.0991   -0.642

Nearest       -0.0431  0.2035    -0.233  1.0000 -0.6257    0.284

Scruz         -0.3389 -0.0378     0.099 -0.6257  1.0000   -0.191

Adjacent       0.2533  0.4328    -0.642  0.2839 -0.1910    1.000


Now we make a prediction for the number of species on an island with predictor values: Area=0.08, Elevation=93, Nearest=6.0, Scruz=12.0, and Adjacent=0.34.

> xp <- c(1, 0.08, 93, 6.0, 12.0, 0.34) # don't forget the initial one
> sum(xp*beta) # the predicted number of species

[1] 33.9