Identifiability (Reading: Faraway(2005, 1st edition), 2.9)

Now, consider the savings data we analyzed in previous lab:

> saving.x <- read.table("saving.x",header=T) # read the data into R

> p15 <- saving.x[,1]; p75 <- saving.x[,2]; inc <- saving.x[,3]; gro <- saving.x[,4]; sav <- saving.x[,5] # assign columns
　

We now create a new variable

pa=100-p15-p75.

Q: What does this variable represent in reality?

> pa <- 100-p15-p75 # calculate pa
> g <- lm(sav~pa+p15+p75+inc+gro) # fit a model with pa and all other predictors
> summary(g) # take a look of the fitted model

Call:

lm(formula = sav ~ pa + p15 + p75 + inc + gro)

Residuals:

Min 1Q Median 3Q Max

-8.2423 -2.6859 -0.2491 2.4278 9.7509

Coefficients: (1 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.406e+02 1.025e+02 -1.372 0.1770

pa 1.692e+00 1.084e+00 1.561 0.1255

p15 1.230e+00 9.773e-01 1.259 0.2146

p75 NA NA NA NA

inc -3.368e-04 9.311e-04 -0.362 0.7193

gro 4.097e-01 1.962e-01 2.088 0.0425 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.803 on 45 degrees of freedom

Multiple R-Squared: 0.3385, Adjusted R-squared: 0.2797

F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007902

Note that

we get a message about one undefined coefficient, the term p75,
It is because the rank of the X-matrix is only 5, which is less than its 6 columns.
R automatically fits a reduced model when X^TX is singular.
Let's take a look of the X-matrix.

> x <- model.matrix(g) # form the X matrix

> xtx <- t(x) %*% x # compute X^TX
> solve(xtx) # X^TX cannot be inverted

錯誤在solve.default(xtx) : 系統計算上是獨特的: 互反條件數=3.88081e-023
Check whether the β-hats and t-tests for inc and gro are affected (compare with the result in the previous lab)

If we didn't know which linear combination was causing the trouble, how would we find out? An eigen-decomposition of X^TX can help:

> e <- eigen(xtx) # compute the eigen-decomposition, the "eigen" command returns eigenvalues and eigenvectors of a matrix. Use "help(eigen)" to get more information.
> e$values # the eigenvalues of X^TX

[1] 1.095112e+08 1.096867e+05 3.190823e+03 3.739541e+02 1.368720e+01

[6] 7.288675e-13

Note that only the last eigenvalue is zero (within calculation error), indicating only one linear combination existing in the columns of X.

> signif(e$vectors,3) # the eigenvectors of X^TX, the "signif" command rounds the values to the specified number of significant digits

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] -0.000506 -0.0141 0.00125 -0.000603 0.00989 1.00e+00

[2,] -0.034300 -0.7940 -0.59700 -0.098100 -0.05630 -1.00e-02

[3,] -0.014700 -0.6040 0.79500 0.031000 0.04800 -1.00e-02

[4,] -0.001610 -0.0164 -0.07310 0.006840 0.99700 -1.00e-02

[5,] -0.999000 0.0363 0.00906 0.001170 -0.00036 -2.56e-17

[6,] -0.001740 -0.0594 -0.08310 0.995000 -0.01390 -1.22e-14

The last column, which corresponds to the zero eigenvalue, shows that

1-(0.01)pa-(0.01)p15-(0.01)p75=0,

which is equivalent to

100-pa-p15-p75=0,

is the offending combination.

Notice that

lack of identifiablity is obviously a problem but it's easy to identify and work around
more problematic are cases where we are "close to unidentifiability".

To demonstrate this, suppose we add a small random perturbation from U[-0.005,0.005] to the 3rd decimal place of variable pa, where U denotes the uniform distribution, i.e.,

pa=100-p15-p75+U:

> pae <- pa+0.001*(runif(50)-0.5) # add the small perturbation to pa, the "runif" command generate Uniform[0, 1] sudo random numbers

> p15+p75+pae # take a look of the sum of p15, p75, and pae.

[1] 100.00002 100.00008 99.99992 100.00038 100.00006 99.99967 100.00046

[8] 99.99957 99.99972 100.00021 100.00036 100.00009 99.99993 100.00027

[15] 100.00021 100.00005 100.00022 100.00042 99.99995 99.99995 100.00041

[22] 99.99982 100.00033 100.00019 99.99973 100.00005 99.99997 100.00005

[29] 100.00043 99.99978 99.99965 99.99952 100.00004 99.99969 99.99956

[36] 100.00036 99.99972 100.00036 99.99970 99.99959 99.99987 100.00043

[43] 99.99993 99.99993 99.99951 99.99972 100.00021 100.00017 99.99953

[50] 100.00045

Notice that they are not 100 any more, but very very close.

> ge <- lm(sav~pae+p15+p75+inc+gro) # now, fit the same model again
> summary(ge, cor=T) # take a look of the fitted model

Call:

lm(formula = sav ~ pae + p15 + p75 + inc + gro)

Residuals:

Min 1Q Median 3Q Max

-8.0361 -2.7051 -0.2360 2.3631 9.6476

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.463e+04 1.934e+05 0.231 0.8186

pae -4.460e+02 1.934e+03 -0.231 0.8187

p15 -4.465e+02 1.934e+03 -0.231 0.8185

p75 -4.477e+02 1.934e+03 -0.231 0.8180

inc -3.817e-04 9.610e-04 -0.397 0.6932

gro 4.028e-01 2.005e-01 2.009 0.0507 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.843 on 44 degrees of freedom

Multiple R-Squared: 0.3393, Adjusted R-squared: 0.2642

F-statistic: 4.518 on 5 and 44 DF, p-value: 0.002071

Correlation of Coefficients:

(Intercept) pae p15 p75 inc

pae -1.00

p15 -1.00 1.00

p75 -1.00 1.00 1.00

inc -0.20 0.20 0.20 0.20

gro -0.15 0.15 0.15 0.15 0.28

Note that (compared with model g)

all parameters can be estimated
the standard errors of pae, p15, and p75, become very large because we cannot estimate them in a stable way
also, pay attention to the correlation of β-hat, which show the sign of strong collinearity
Please do an eigen-decomposition of X^TX for the case by yourself and interpret your result.