Now, consider the savings data we analyzed in previous lab:
> saving.x <- read.table("saving.x",header=T) # read the data into R
> p15 <- saving.x[,1]; p75 <-
saving.x[,2]; inc <- saving.x[,3]; gro <- saving.x[,4];
sav <- saving.x[,5] # assign columns
We now create a new variable
pa=100-p15-p75.
Q: What does this variable represent in reality?
> pa <- 100-p15-p75 # calculate pa
> g <-
lm(sav~pa+p15+p75+inc+gro) # fit a model with pa and all other
predictors
> summary(g) # take a look of the fitted model
Call:
lm(formula = sav ~ pa + p15 + p75 + inc + gro)
Residuals:
Min 1Q Median 3Q Max
-8.2423 -2.6859 -0.2491 2.4278 9.7509
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.406e+02 1.025e+02 -1.372 0.1770
pa 1.692e+00 1.084e+00 1.561 0.1255
p15 1.230e+00 9.773e-01 1.259 0.2146
p75 NA NA NA NA
inc -3.368e-04 9.311e-04 -0.362 0.7193
gro 4.097e-01 1.962e-01 2.088 0.0425 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-Squared: 0.3385, Adjusted R-squared: 0.2797
F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007902
Note that
we get a message about one undefined coefficient, the term p75,
It is because the rank of the X-matrix is only 5, which is less than its 6 columns.
R automatically fits a reduced model when XTX is singular.
Let's take a look of the X-matrix.
> x <- model.matrix(g) # form the X matrix
> xtx <- t(x) %*% x # compute
XTX
> solve(xtx)
# XTX cannot
be inverted
錯誤在solve.default(xtx) : 系統計算上是獨特的: 互反條件數=3.88081e-023
Check whether the β-hats and t-tests for inc and gro are affected (compare with the result in the previous lab)
If we didn't know which linear combination was causing the trouble, how would we find out? An eigen-decomposition of XTX can help:
> e <- eigen(xtx) # compute the eigen-decomposition, the
"eigen" command returns eigenvalues and eigenvectors of a matrix. Use "help(eigen)"
to get more information.
> e$values # the eigenvalues of XTX
[1] 1.095112e+08 1.096867e+05 3.190823e+03 3.739541e+02 1.368720e+01
[6] 7.288675e-13
Note that only the last eigenvalue is zero (within calculation error), indicating only one linear combination existing in the columns of X.
> signif(e$vectors,3) # the eigenvectors of XTX, the "signif" command rounds the values to the specified number of significant digits
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.000506 -0.0141 0.00125 -0.000603 0.00989 1.00e+00
[2,] -0.034300 -0.7940 -0.59700 -0.098100 -0.05630 -1.00e-02
[3,] -0.014700 -0.6040 0.79500 0.031000 0.04800 -1.00e-02
[4,] -0.001610 -0.0164 -0.07310 0.006840 0.99700 -1.00e-02
[5,] -0.999000 0.0363 0.00906 0.001170 -0.00036 -2.56e-17
[6,] -0.001740 -0.0594 -0.08310 0.995000 -0.01390 -1.22e-14
The last column, which corresponds to the zero eigenvalue, shows that
1-(0.01)pa-(0.01)p15-(0.01)p75=0,
which is equivalent to
100-pa-p15-p75=0,
is the offending combination.
Notice that
lack of identifiablity is obviously a problem but it's easy to identify and work around
more problematic are cases where we are "close to unidentifiability".
To demonstrate this, suppose we add a small random perturbation from U[-0.005,0.005] to the 3rd decimal place of variable pa, where U denotes the uniform distribution, i.e.,
pa=100-p15-p75+U:
> pae <- pa+0.001*(runif(50)-0.5) # add the small perturbation to pa, the "runif" command generate Uniform[0, 1] sudo random numbers
> p15+p75+pae # take a look of the sum of p15, p75, and pae.
[1] 100.00002 100.00008 99.99992 100.00038 100.00006 99.99967 100.00046
[8] 99.99957 99.99972 100.00021 100.00036 100.00009 99.99993 100.00027
[15] 100.00021 100.00005 100.00022 100.00042 99.99995 99.99995 100.00041
[22] 99.99982 100.00033 100.00019 99.99973 100.00005 99.99997 100.00005
[29] 100.00043 99.99978 99.99965 99.99952 100.00004 99.99969 99.99956
[36] 100.00036 99.99972 100.00036 99.99970 99.99959 99.99987 100.00043
[43] 99.99993 99.99993 99.99951 99.99972 100.00021 100.00017 99.99953
[50] 100.00045
Notice that they are not 100 any more, but very very close.
> ge <-
lm(sav~pae+p15+p75+inc+gro) # now, fit the same model again
> summary(ge, cor=T) # take a look of the fitted model
Call:
lm(formula = sav ~ pae + p15 + p75 + inc + gro)
Residuals:
Min 1Q Median 3Q Max
-8.0361 -2.7051 -0.2360 2.3631 9.6476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.463e+04 1.934e+05 0.231 0.8186
pae -4.460e+02 1.934e+03 -0.231 0.8187
p15 -4.465e+02 1.934e+03 -0.231 0.8185
p75 -4.477e+02 1.934e+03 -0.231 0.8180
inc -3.817e-04 9.610e-04 -0.397 0.6932
gro 4.028e-01 2.005e-01 2.009 0.0507 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.843 on 44 degrees of freedom
Multiple R-Squared: 0.3393, Adjusted R-squared: 0.2642
F-statistic: 4.518 on 5 and 44 DF, p-value: 0.002071
Correlation of Coefficients:
(Intercept) pae p15 p75 inc
pae -1.00
p15 -1.00 1.00
p75 -1.00 1.00 1.00
inc -0.20 0.20 0.20 0.20
gro -0.15 0.15 0.15 0.15 0.28
Note that (compared with model g)
all parameters can be estimated
the standard errors of pae, p15, and p75, become very large because we cannot estimate them in a stable way
also, pay attention to the correlation of β-hat, which show the sign of strong collinearity
Please do an eigen-decomposition of XTX for the case by yourself and interpret your result.