Here is an example to demonstrate the difficulty of parameter interpretation. Let us use the saving data introduced in previous lab.
> saving.x <- read.table("saving.x",header=T) # read the data into R
> p15 <- saving.x[,1]; p75 <- saving.x[,2]; inc <- saving.x[,3]; gro <- saving.x[,4]; sav <- saving.x[,5] # assign columns
In the following, we will
consider the effect of p75 on the savings rate in the savings dataset, and
fit four different models, all including p75 but varying the inclusion of other variables.
Let us start from the model with all predictors.
> g1 <- lm(sav~p15+p75+inc+gro) # fit the model with all four predictors
> summary(g1, cor=T) # take a look of the fitted model
p75 is not significant in this model
p75 is negatively correlated with p15 as shown in the lab (Note: their estimates are positively correlated with a correlation of 0.77)
Since countries with proportionately more younger people are likely to have relatively fewer older ones and vice versa. The two variables both measure the nature of the age distribution in a country.
When two variables that represent roughly the same thing are included in a regression equation, it is not unusual for one (or even both) of them to appear insignificant even though prior knowledge about the effects of these variables might lead one to expect them to be important.
Now, let's drop p15 from the model.
Q: Will you expect to see a significant result of p75 after p15 is dropped?
Let us find out.
> g2 <-
lm(sav~p75+inc+gro) # fit the model with predictors p75,
> summary(g2, cor=T) # take a look of the fitted model
Unfortunately, p75 is still not significant, same as inc.
Yet, one might expect both of them to have something to do with savings rates.
Higher values of these variables are both associated with wealthier countries. (Draw a scatter plot to check it)
We can find the correlation between their estimates is -0.80 (you can also check the positive correlation (=0.787) between the two variables, command: cor(p75, inc)).
Let's see what happens when we drop inc from the model.
> g3 <- lm(sav~p75+gro) # fit the model with p75 and gro
> summary(g3, cor=T) # take a look of the fitted model
Now, p75 is statistically significant with a positive coefficient.
Note that the correlation between the estimates of p75 and gro is very small.
Roughly speaking, p75 and gro are almost orthogonal.
Let's try dropping gro.
> g4 <- lm(sav~p75) # fit the model only with p75
> summary(g4, cor=T) # take a look of the fitted model
Notice that the coefficient estimate and p-value do not change much when compared to model g3. Q: Why?
It is because the low correlation between p75 and gro and the two models have similar σ-hat.
Let's compare the coefficient estimates and p-values for p75 throughout.
We see that
the significance and the direction of the effect of p75 change according to what other variables are also included in the model.
no simple conclusion about the effect of p75 is possible.
We must find interpretations for a variety of models.
We certainly will not be able to make any strong causal conclusions.
Prediction is more stable than parameter estimation.
Consider a prediction made using each of the four models above:
> x0 <- data.frame(p15=32, p75=3, inc=700,
gro=3) # assign values for prediction
> predict(g1,x0,interval="prediction"); predict(g2,x0,interval="prediction"); predict(g3,x0,interval="prediction"); predict(g4,x0,interval="prediction") # calculate predicted values and confidence intervals using the four models
We can see
these values do not change much.
when the objective of a regression analysis is only for prediction, the choice of fitted models is less of a concern.
Of course, the robustness depends on the values of x0.
When x0 is located near the "center" of observed data, the prediction is more robust to the choice of fitted models.
On the other hand, if x0 is an extrapolation, the results could be quite different.