Here is an example to demonstrate the difficulty of parameter interpretation. Let us use the saving data introduced in previous lab.
> saving.x <- read.table("saving.x",header=T) # read the data into R
> p15 <- saving.x[,1]; p75 <- saving.x[,2]; inc <- saving.x[,3]; gro <- saving.x[,4]; sav <- saving.x[,5] # assign columns
In the following, we will
consider the effect of p75 on the savings rate in the savings dataset, and
fit four different models, all including p75 but varying the inclusion of other variables.
Let us start from the model with all predictors.
> g1 <- lm(sav~p15+p75+inc+gro) # fit the model with all four predictors
> summary(g1, cor=T) # take a look of the fitted model
Call:
lm(formula = sav ~ p15 + p75 + inc + gro)
Residuals:
Min 1Q Median 3Q Max
-8.2423 -2.6859 -0.2491 2.4278 9.7509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.5666100 7.3544986 3.884 0.000334 ***
p15 -0.4612050 0.1446425 -3.189 0.002602 **
p75 -1.6915757 1.0835862 -1.561 0.125508
inc -0.0003368 0.0009311 -0.362 0.719296
gro 0.4096998 0.1961961 2.088 0.042468 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-Squared: 0.3385, Adjusted R-squared: 0.2797
F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007902
Correlation of Coefficients:
(Intercept) p15 p75 inc
p15 -0.98
p75 -0.81 0.77
inc -0.17 0.18 -0.37
gro -0.19 0.10 -0.05 0.26
Notice that
p75 is not significant in this model
p75 is negatively correlated with p15 as shown in the lab (Note: their estimates are positively correlated with a correlation of 0.77)
Since countries with proportionately more younger people are likely to have relatively fewer older ones and vice versa. The two variables both measure the nature of the age distribution in a country.
When two variables that represent roughly the same thing are included in a regression equation, it is not unusual for one (or even both) of them to appear insignificant even though prior knowledge about the effects of these variables might lead one to expect them to be important.
Now, let's drop p15 from the model.
Q: Will you expect to see a significant result of p75 after p15 is dropped?
Let us find out.
> g2 <-
lm(sav~p75+inc+gro) # fit the model with predictors p75,
inc, gro
> summary(g2, cor=T)
# take a look of the fitted model
Call:
lm(formula = sav ~ p75 + inc + gro)
Residuals:
Min 1Q Median 3Q Max
-8.0577 -3.2144 0.1689 2.4260 10.0763
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4875193 1.4276623 3.844 0.00037 ***
p75 0.9527173 0.7637785 1.247 0.21857
inc 0.0001974 0.0010031 0.197 0.84482
gro 0.4738062 0.2137268 2.217 0.03161 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.164 on 46 degrees of freedom
Multiple R-Squared: 0.189, Adjusted R-squared: 0.1361
F-statistic: 3.573 on 3 and 46 DF, p-value: 0.02093
Correlation of Coefficients:
(Intercept) p75 inc
p75 -0.49
inc 0.06 -0.80
gro -0.50 -0.21 0.24
Unfortunately, p75 is still not significant, same as inc.
Yet, one might expect both of them to have something to do with savings rates.
Higher values of these variables are both associated with wealthier countries. (Draw a scatter plot to check it)
We can find the correlation between their estimates is -0.80 (you can also check the positive correlation (=0.787) between the two variables, command: cor(p75, inc)).
Let's see what happens when we drop inc from the model.
> g3 <- lm(sav~p75+gro) # fit the model with p75 and gro
> summary(g3, cor=T) # take a look of the fitted model
Call:
lm(formula = sav ~ p75 + gro)
Residuals:
Min 1Q Median 3Q Max
-8.02229 -3.29489 0.08895 2.45702 10.10688
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4695 1.4101 3.879 0.000325 ***
p75 1.0726 0.4563 2.351 0.022992 *
gro 0.4636 0.2052 2.259 0.028562 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.122 on 47 degrees of freedom
Multiple R-Squared: 0.1883, Adjusted R-squared: 0.1538
F-statistic: 5.452 on 2 and 47 DF, p-value: 0.007423
Correlation of Coefficients:
(Intercept) p75
p75 -0.73
gro -0.53 -0.03
Now, p75 is statistically significant with a positive coefficient.
Note that the correlation between the estimates of p75 and gro is very small.
Roughly speaking, p75 and gro are almost orthogonal.
Let's try dropping gro.
> g4 <- lm(sav~p75) # fit the model only with p75
> summary(g4, cor=T) # take a look of the fitted model
Call:
lm(formula = sav ~ p75)
Residuals:
Min 1Q Median 3Q Max
-9.26566 -3.22947 0.05428 2.33359 11.84979
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.1517 1.2475 5.733 6.4e-07 ***
p75 1.0987 0.4753 2.312 0.0251 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.294 on 48 degrees of freedom
Multiple R-Squared: 0.1002, Adjusted R-squared: 0.08144
F-statistic: 5.344 on 1 and 48 DF, p-value: 0.02513
Correlation of Coefficients:
(Intercept)
p75 -0.87
Notice that the coefficient estimate and p-value do not change much when compared to model g3. Q: Why?
It is because the low correlation between p75 and gro and the two models have similar σ-hat.
Let's compare the coefficient estimates and p-values for p75 throughout.
model | estimates | sign | significant? |
g1 | -1.6915757 | - | no |
g2 | 0.9527173 | + | no |
g3 | 1.0726 | + | yes |
g4 | 1.0987 | + | yes |
We see that
the significance and the direction of the effect of p75 change according to what other variables are also included in the model.
no simple conclusion about the effect of p75 is possible.
We must find interpretations for a variety of models.
We certainly will not be able to make any strong causal conclusions.
Notice that
Prediction is more stable than parameter estimation.
Consider a prediction made using each of the four models above:
> x0 <- data.frame(p15=32, p75=3, inc=700,
gro=3) # assign values for prediction
> predict(g1,x0,interval="prediction"); predict(g2,x0,interval="prediction"); predict(g3,x0,interval="prediction");
predict(g4,x0,interval="prediction")
# calculate predicted values and
confidence intervals using the four models
fit lwr upr
[1,] 9.726689 1.795553 17.65782
fit lwr upr
[1,] 9.905299 1.225853 18.58475
fit lwr upr
[1,] 10.07808 1.672549 18.48361
fit lwr upr
[1,] 10.44777 1.701873 19.19366
We can see
these values do not change much.
when the objective of a regression analysis is only for prediction, the choice of fitted models is less of a concern.
Of course, the robustness depends on the values of x0.
When x0 is located near the "center" of observed data, the prediction is more robust to the choice of fitted models.
On the other hand, if x0 is an extrapolation, the results could be quite different.