NTHU STAT 5410 Lab - Collinearity

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 **

GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141

GNP -3.582e-02 3.349e-02 -1.070 0.312681

Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 **

Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 ***

Population -5.110e-02 2.261e-01 -0.226 0.826212

Year 1.829e+00 4.555e-01 4.016 0.003037 **

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.3049 on 9 degrees of freedom

Multiple R-Squared: 0.9955, Adjusted R-squared: 0.9925

F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10

Recall that the response is number employed. Three of the predictors have large p-values but all are variables that might be expected to affect the response. Why aren't they significant? Check the correlation matrix first (rounding to 3 digits for convenience)

GNP.deflator GNP Unemployed Armed.Forces Population Year

GNP.deflator 1.000 0.992 0.621 0.465 0.979 0.991

GNP 0.992 1.000 0.604 0.446 0.991 0.995

Unemployed 0.621 0.604 1.000 -0.177 0.687 0.668

Armed.Forces 0.465 0.446 -0.177 1.000 0.364 0.417

Population 0.979 0.991 0.687 0.364 1.000 0.994

Year 0.991 0.995 0.668 0.417 0.994 1.000

There are several large pairwise correlations. It can be also revealed from their pairwise scatter plots:

$values

[1] 6.665299e+07 2.090730e+05 1.053550e+05 1.803976e+04 2.455730e+01 2.015117e+00

$vectors

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] -0.04990131 0.06979071 -0.03416853 0.04265870 0.95653127 -0.2733126381

[2,] -0.19075418 0.72496814 -0.34330489 0.55402997 -0.07487553 0.0872940138

[3,] -0.15702286 0.62152746 0.56371985 -0.52067703 -0.00716578 0.0105568115

[4,] -0.12796016 0.10434859 -0.74630465 -0.64468394 -0.01222896 -0.0001122542

[5,] -0.05758090 0.03841364 -0.01095845 0.03583083 -0.28108541 -0.9564496276

[6,] -0.95748481 -0.26625145 0.07812474 0.05679111 -0.01522131 0.0526555591

There is a wide range in the eigenvalues and several condition numbers are large. This means that problems are being caused by more than just one linear combination. Now check out the variance inflation factors (VIF). For the first variable this is:

which is large - the VIF for orthogonal predictors is 1. Here is a function to compute all the VIF's in one go:

There's definitely a lot of variance inflation! For example, we can interpret sqrt(1788)=42.28 as telling us that the standard error for GNP is 42 times larger than it would have been without collinearity. How can we get rid of this problem? One way is to throw out some of the variables. Examine the full correlation matrix above. Notice that variables 3 and 4 (Unemployed and Armed.Forces) do not have extremely large pairwise correlations with the other variables so we should keep them and focus on the others for candidates for removal:

These four variables are strongly correlated with each other - any one of them could do the job of representing the others - we could pick year arbitrarily:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.797e+03 6.864e+01 -26.183 5.89e-12 ***

Armed.Forces -7.723e-03 1.837e-03 -4.204 0.00122 **

Unemployed -1.470e-02 1.671e-03 -8.793 1.41e-06 ***

Year 9.564e-01 3.553e-02 26.921 4.24e-12 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.3321 on 12 degrees of freedom

Multiple R-Squared: 0.9928, Adjusted R-squared: 0.9911

F-statistic: 555.2 on 3 and 12 DF, p-value: 3.916e-13

Compare this with the original fit - what do you find? We see that the fit is very similar but only three rather than six predictors are used.

One final point - extreme collinearity can cause problems in computing the estimates - look what happens when we use the direct formula for b-hat on the data with a 100 times larger scale change on first predictor:

> xsc <- as.matrix(cbind(1, 100*x[,1], x[,2:6]))
> solve(t(xsc) %*% xsc) %*% t(xsc) %*% longley$Employed

R, like most statistical packages, uses a more numerically stable method for computing the estimates in lm().

Call:

lm(formula = longley$Employed ~ xsc - 1)

Coefficients:

xsc xsc xscGNP xscUnemployed xscArmed.Forces

-3.482e+03 1.506e-04 -3.582e-02 -2.020e-02 -1.033e-02

xscPopulation xscYear

-5.110e-02 1.829e+00

Principal Components (Reading: Faraway (2005, 1st edition), 9.1)

Look at the relative size - the first is big. Consider the first eigenvector (column) below:

> dimnames(e$vectors) <- list(c("GNP def","GNP","Unem","Armed","Popn","Year"),paste("EV",1:6))

EV 1 EV 2 EV 3 EV 4 EV 5 EV 6

GNP def -0.04990131 0.06979071 -0.03416853 0.04265870 0.95653127 -0.2733126381

GNP -0.19075418 0.72496814 -0.34330489 0.55402997 -0.07487553 0.0872940138

Unem -0.15702286 0.62152746 0.56371985 -0.52067703 -0.00716578 0.0105568115

Armed -0.12796016 0.10434859 -0.74630465 -0.64468394 -0.01222896 -0.0001122542

Popn -0.05758090 0.03841364 -0.01095845 0.03583083 -0.28108541 -0.9564496276

Year -0.95748481 -0.26625145 0.07812474 0.05679111 -0.01522131 0.0526555591

The first eigenvector is dominated by Year (because the coefficient for Year in the 1st eigenvector is 0.95748481, which is much larger than other coefficients). Why is it? Now examining the X-matrix. What are the scales of the variables?

     GNP deflator     GNP Unemployed Armed Forces Population Year
1947         83.0 234.289      235.6        159.0    107.608 1947
1948         88.5 259.426      232.5        145.6    108.632 1948
1949         88.2 258.054      368.2        161.6    109.773 1949
1950         89.5 284.599      335.1        165.0    110.929 1950
1951         96.2 328.975      209.9        309.9    112.075 1951
1952         98.1 346.999      193.2        359.4    113.270 1952
1953         99.0 365.385      187.0        354.7    115.094 1953
1954        100.0 363.112      357.8        335.0    116.219 1954
1955        101.2 397.469      290.4        304.8    117.388 1955
1956        104.6 419.180      282.2        285.7    118.734 1956
1957        108.4 442.769      293.6        279.8    120.445 1957
1958        110.8 444.546      468.1        263.7    121.950 1958
1959        112.6 482.704      381.3        255.2    123.366 1959
1960        114.2 502.601      393.1        251.4    125.368 1960
1961        115.7 518.173      480.6        257.2    127.852 1961
1962        116.9 554.894      400.7        282.7    130.081 1962

We see that these predictors have values in very different regions. The reason that Year stands up in the 1st principal component is because its values are much larger than others. It might make more sense to center the predictors before trying principal components. This is equivalent to doing principal components on the covariance matrix:

> dimnames(e$vectors) <- list(c("GNP def","GNP","Unem","Armed","Popn","Year"),paste("EV",1:6))

EV 1 EV 2 EV 3 EV 4 EV 5 EV 6

GNP def -0.08247788 0.03437017 0.04184609 0.95671439 -0.269588027 0.047800192

GNP -0.75623528 0.31936977 0.55729097 -0.07632282 0.076772448 -0.061795997

Unem -0.62622375 -0.58026266 -0.52050055 -0.00739462 0.008954116 -0.009131889

Armed -0.15750866 0.74819438 -0.64438951 -0.01231829 -0.000605424 -0.002500043

Popn -0.05439163 0.01348608 0.03425705 -0.27999345 -0.927886716 0.237310015

Year -0.03717521 0.01184015 0.01853299 0.01642086 0.245711186 0.968241040

Can you give these P.C.'s (at least the first 3 P.C's) an interpretation based on the values in their eigenvectors? It can be observed that now GNP, Unemployed, and Armed Forces play the most important roles in the first three principal components, whose corresponding eigenvalues are large in contrast to rest eigenvalues. Why is it? Let's check the variances of these variables (see below). What have you observed and can you now understand why it happens?

GNP.deflator GNP Unemployed Armed.Forces Population Year

GNP.deflator 116.45762 1063.6041 625.8666 349.0254 73.50300 50.92333

GNP 1063.60412 9879.3537 5612.4370 3088.0428 685.24094 470.97790

Unemployed 625.86663 5612.4370 8732.2343 -1153.7876 446.27415 297.30333

Armed.Forces 349.02538 3088.0428 -1153.7876 4843.0410 176.40981 138.24333

Population 73.50300 685.2409 446.2742 176.4098 48.38735 32.91740

Year 50.92333 470.9779 297.3033 138.2433 32.91740 22.66667

We can create the orthogonalized predictors --- the Z=XU operation as follows:

EV 1 EV 2 EV 3 EV 4 EV 5 EV 6

EV 1 15358.52 0.000 0.000 0.000 0.000 0.000

EV 2 0.00 7077.406 0.000 0.000 0.000 0.000

EV 3 0.00 0.000 1204.409 0.000 0.000 0.000

EV 4 0.00 0.000 0.000 1.639 0.000 0.000

EV 5 0.00 0.000 0.000 0.000 0.139 0.000

EV 6 0.00 0.000 0.000 0.000 0.000 0.028

We can see that these new predictors, i.e. the principal components, are mutually orthogonal, and please note their variances are equal to their corresponding eigenvalues. Because these P.C.'s are obtained from non-centered X-matrix, they are not orthogonal to the constant term. You can check it by:

> xmean <- apply(x, 2, mean) # apply mean operation on the columns of x
> cx <- sweep(x, 2, xmean) # sweep out mean on each columns of x

We now create the orthogonalized predictors using centered X-matrix:
> cxpc <- cx %*% e$vector

Try to verify if the new predictors in cxpc are mutually orthogonal and orthogonal to the constant term in regression analysis:

Call:

lm(formula = y ~ xpc)

Residuals:

Min 1Q Median 3Q Max

-0.41011 -0.15767 -0.02816 0.10155 0.45539

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.482e+03 8.904e+02 -3.911 0.00356 **

xpcEV 1 -2.510e-02 6.351e-04 -39.511 2.12e-11 ***

xpcEV 2 1.404e-02 9.356e-04 15.004 1.13e-07 ***

xpcEV 3 2.999e-02 2.268e-03 13.223 3.36e-07 ***

xpcEV 4 6.177e-02 6.149e-02 1.005 0.34137

xpcEV 5 4.899e-01 2.115e-01 2.316 0.04577 *

xpcEV 6 1.762e+00 4.673e-01 3.770 0.00441 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3049 on 9 degrees of freedom

Multiple R-Squared: 0.9955, Adjusted R-squared: 0.9925

F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10

Correlation of Coefficients:

(Intercept) xpcEV 1 xpcEV 2 xpcEV 3 xpcEV 4 xpcEV 5

xpcEV 1 0.00

xpcEV 2 0.00 0.00

xpcEV 3 0.00 0.00 0.00

xpcEV 4 0.00 0.00 0.00 0.00

xpcEV 5 -0.09 0.00 0.00 0.00 0.00

xpcEV 6 -1.00 0.00 0.00 0.00 0.00 0.00

In the fitted model, the b estimate, interpretation, and t-tests are much simplified because of the orthogonality between predictors.

Suppose you feel that the P.C.'s should not be influenced by the means and variances of the predictors. You may like to normalize these predictors to a same scale before trying principal component. This is equivalent to apply eigen-decomposition on its correlation matrix:

> dimnames(e$vectors) <- list(c("GNP def","GNP","Unem","Armed","Popn","Year"),paste("EV",1:6))

EV 1 EV 2 EV 3 EV 4 EV 5 EV 6

GNP def 0.4618349 0.0578427677 0.1491199 0.792873559 -0.337937826 0.13518707

GNP 0.4615043 0.0532122862 0.2776823 -0.121621225 0.149573192 -0.81848082

Unem 0.3213167 -0.5955137627 -0.7283057 0.007645795 -0.009231961 -0.10745268

Armed 0.2015097 0.7981925480 -0.5616075 -0.077254979 -0.024252472 -0.01797096

Popn 0.4622794 -0.0455444698 0.1959846 -0.589744965 -0.548578173 0.31157087

Year 0.4649403 0.0006187884 0.1281157 -0.052286554 0.749542836 0.45040888

Can you give these P.C.'s (at least the first 3 P.C's) an interpretation based on the values in their eigenvectors?

One commonly used method of judging how many principal components are worth considering is to plot eigenvalues:

Often, the plot has a noticeable "elbow". Here the elbow is at about 3, which tells us that we need only consider the first two principle components. The first 2 P.C.'s explain about 94.9% (=0.650+0.299, see below) variation in the normalized X-matrix.

One advantage of principal components is that it transforms the predictors to an orthogonal basis. To figure out the orthogonalized predictors for this data based on the eigen-decomposition for the correlation matrix of X, we must first normalize the data: The functions "scale()" is useful for doing this:

Let us take a look of the covariance matrix of nx to examine whether it is normalized:

GNP.deflator GNP Unemployed Armed.Forces Population Year

GNP.deflator 1.0000000 0.9915892 0.6206334 0.4647442 0.9791634 0.9911492

GNP 0.9915892 1.0000000 0.6042609 0.4464368 0.9910901 0.9952735

Unemployed 0.6206334 0.6042609 1.0000000 -0.1774206 0.6865515 0.6682566

Armed.Forces 0.4647442 0.4464368 -0.1774206 1.0000000 0.3644163 0.4172451

Population 0.9791634 0.9910901 0.6865515 0.3644163 1.0000000 0.9939528

Year 0.9911492 0.9952735 0.6682566 0.4172451 0.9939528 1.0000000

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 65.31700 0.07621 857.026 < 2e-16 ***

nxpcEV 1 1.56511 0.03669 42.662 1.07e-11 ***

nxpcEV 2 0.39183 0.07260 5.397 0.000435 ***

nxpcEV 3 -1.86039 0.17452 -10.660 2.10e-06 ***

nxpcEV 4 0.35730 0.64423 0.555 0.592672

nxpcEV 5 -6.16983 1.55812 -3.960 0.003305 **

nxpcEV 6 6.96337 4.05550 1.717 0.120105

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.3049 on 9 degrees of freedom

Multiple R-Squared: 0.9955, Adjusted R-squared: 0.9925

F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10

Notice that the p-values of the 4th and 6th P.C.'s are not significant while the 5th is. Because the directions of the eigenvectors are set successively in the greatest remaining direction of variation in the predictors, it's natural to expect that they are ordered in significance in predicting the response. However, there is no guarantee of this --- we see here that the 5th P.C. is significant while the 4th is not even though there is about six times greater variation in the 4th direction than the 5th. In this example, it hardly matters since most of the variation is explained by the earlier values, but look out for this effect in other dataset in the first few eigenvalues.

(Intercept) nxpcEV 1 nxpcEV 2 nxpcEV 3 nxpcEV 4 nxpcEV 5 nxpcEV 6

(Intercept) 1 0 0 0 0 0 0

nxpcEV 1 0 1 0 0 0 0 0

nxpcEV 2 0 0 1 0 0 0 0

nxpcEV 3 0 0 0 1 0 0 0

nxpcEV 4 0 0 0 0 1 0 0

nxpcEV 5 0 0 0 0 0 1 0

nxpcEV 6 0 0 0 0 0 0 1

Notice now the P.C.'s are not only mutually orthogonal, they are also orthogonal to the constant term.

Principal components are really only useful if we can interpret the meaning of the new linear combinations. Look back at the first eigenvector - this is roughly a linear combination of all the (normalized variables). Now plot each of the variables as they change over time:
> par(mfrow=c(2,3))
> for(i in 1:6) plot(longley[,6], longley[,i], type="l", xlab="Year", ylab=names(longley)[i])

What do you notice? This suggests we identify the first principal component with a time trend effect. The second principal component is roughly a contrast between numbers unemployed and in the armed forces. Let's try fitting a regression with those two components:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.391e+03 7.889e+01 -17.638 1.84e-10 ***

Year 7.454e-01 4.037e-02 18.463 1.04e-10 ***

I(Armed.Forces - Unemployed) 4.119e-03 1.525e-03 2.701 0.0182 *

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.7178 on 13 degrees of freedom

Multiple R-Squared: 0.9638, Adjusted R-squared: 0.9582

F-statistic: 173 on 2 and 13 DF, p-value: 4.285e-10

This approaches the fit of the full model and is easily interpretable. We need to do more work on the other principal components.

This example illustrates a typical use of principal components for regression. Some intuition is required to form new variables as combinations of older ones. It it works, a simplified and interpretable model is obtained, but it doesn't always work out that way.

Ridge regression (Reading: Faraway (2005, 1st edition), 9.3)

We demonstrate the method on Longley data. You will need to load a library called "MASS" to help you on the analysis:

> gr <- lm.ridge(Employed~., longley, lambda=seq(0,0.1,by=0.001)) # explore ridge trace in the region lÎ[0, 0.1]

> matplot(gr$lambda, t(gr$coef), type="l", xlab=expression(lambda), ylab=expression(hat(gamma)), lwd=3)
> abline(h=0, lwd=3)

You can find the ridge trace plot in graphic window. There are some automate methods to select an appropriate l:
> select(gr)

> abline(v=0.004275357, lty=10, lwd=3) # the HKB estimator
> abline(v=0.03229531, lty=11, lwd=3) # the L-W estimator

The Hoerl-Kennard (the originators of ridge regression) choice and L-W choice of l has been shown on the plot. I would prefer the value of 0.03. For this choice of l, the estimate of b is:

Note that these values are based on centered and scaled predictors which explains the difference from previous fit. Consider the change in the coefficient for GNP. For the least squares fit, the effect of GNP is negative on the response --- number of people employed. This is counter-intuitive since we would expect the effect to be positive. The ridge estimate is positive, which is more in line with what we'd expect.

Collinearity (Reading: Faraway (2005, 1st edition), 5.3)

Principal Components (Reading: Faraway (2005, 1st edition), 9.1)

Ridge regression (Reading: Faraway (2005, 1st edition), 9.3)