Variance known
Using the same data as in the weighted least squares lab, we are going to test whether the model fitted in the previous lab is adequate.
Because of the way the weights are defined, i.e., wi=1/var(yi), the known variance is implicitly equal to one (why?).
Let's first fit the weighted least square model.
> p <- read.table("phy.data", header=T) # read the data into R
> p # take a look of the data
momentum energy crossx sd
1 4 0.345 367 17
2 6 0.287 311 9
3 8 0.251 295 9
...deleted...
10 150 0.060 192 5
> x <- p[,2]; y <- p[,3]; w <- 1/p[,4]^2 # extract variables and calculate weights
> g <- lm(y~x,weights=w) # fit a simple regression model with w 
as weights
> summary(g) 
# take a look of the fitted model
Call:
lm(formula = y ~ x, weights = w)
Residuals:
Min 1Q Median 3Q Max
-2.323e+00 -8.842e-01 1.266e-06 1.390e+00 2.335e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 148.473 8.079 18.38 7.91e-08 ***
x 530.835 47.550 11.16 3.71e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.657 on 8 degrees of freedom
Multiple R-Squared: 0.9397, Adjusted R-squared: 0.9321
F-statistic: 124.6 on 1 and 8 DF, p-value: 3.710e-06
Look at the high R2 and the very small p-value for overall F-test
Q: do you think the model is a good fit?
		Now 
plot the data and the fitted regression line:
> 
par(mfrow=c(1,2)) # 
split the graphic window into two
		> plot(x, y, cex=2) # draw the scatter plot of energy and 
crossx
> abline(g$coef) # add the fitted line
		
Now, let's also plot the residuals against crossx.
> plot(x, g$res, cex=2); abline(0, 0) # draw scatter plot of residuals and crossx and add y=0 line
			
Q: What have you found from the plots?
Q: Whether or not the plots show that the fitted model is not good enough?
Let's perform the test for lack of fit. Compute the test statistic and its p-value:
> (8*1.657^2)/1 # test statistics = (n-p)*σ-hat2/σ2
[1] 21.96519
> 1-pchisq(21.96519, 8) # p-value of the test, the "pchisq" command returns the cdf value of a chi-square distribution
[1] 0.004980762
Because the p-value is small, we conclude that there is a lack of fit.
Notice that although the original linear model had an R2 of 94% and a very small overall F p-value, this in itself is not enough to conclude that this particular model fits.
Let us try to add a quadratic term, as suggested by the residual plot above, to the model and test again:
> g1 <- lm(y ~ x + I(x^2), weights=w) # fit a 2nd-order polynomial model
> 
summary(g1, cor=T) # take a look of the new fitted model
Call:
lm(formula = y ~ x + I(x^2), weights = w)
Residuals:
Min 1Q Median 3Q Max
-0.89928 -0.43508 0.01374 0.37999 1.14238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 183.8305 6.4591 28.461 1.7e-08 ***
x 0.9709 85.3688 0.011 0.991243
I(x^2) 1597.5047 250.5869 6.375 0.000376 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6788 on 7 degrees of freedom
Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9886
F-statistic: 391.4 on 2 and 7 DF, p-value: 6.554e-08
Correlation of Coefficients:
(Intercept) x
x -0.94
I(x^2) 0.86 -0.97
The quadratic term is significant.
The linear term becomes insignificant because of the strong collinearity between linear and quadratic terms.
However, it doesn't mean we should just drop the linear term. A better solution is to rescale the linear and quadratic terms so that they can be (almost) orthogonal.
Let's perform the test for lack of fit on the new model.
> (7*0.6788^2)/1 # calculate test statistic
[1] 3.225386
> 1-pchisq(3.225386, 7) # calculate p-value
[1] 0.8633989
This time we cannot detect a lack of fit.
Let's plot the fits of the two models and a residual plot under the new model:
	> plot(x, y, cex=2) # draw the scatter plot of energy and 
crossx
> abline(g$coef) # add the fitted line of model g
	
> xm <- seq(0.05, 0.35, 0.01); lines(xm, g1$coef[1]+g1$coef[2]*xm+g1$coef[3]*xm^2, lty=2) # add the fitted line of model g1
> plot(x, g1$res, cex=2); abline(0, 0) # a residual plot under model g1
		
The quadratic model seems more appropriate than the linear model.
Variance unknown
The data for this example consist of thirteen specimens of 90/10 Cu-Ni alloys with varying iron content in percent.
The specimens were submerged in sear water for 60 days and the weight loss due to corrosion was recorded in units of milligrams per square decimeter per day.
> corrosion <- read.table("corrosion.data") # read the data into R
> corrosion # take a look of the data
Fe loss
1 0.01 127.6
2 0.48 124.0
3 0.71 110.8
4 0.95 103.9
5 1.19 101.5
6 0.01 130.1
7 0.48 122.0
8 1.44 92.3
9 0.71 113.1
10 1.96 83.7
11 0.01 128.0
12 1.44 91.4
13 1.96 86.2
Notice that this is a data with some replication.
Q: What is the source of variation in the replication?
> g <- lm(loss ~ Fe, data=corrosion) # Fit 
a simple linear model with loss as response and Fe as predictor
> 
summary(g) # take a look of the fitted model
Call:
lm(formula = loss ~ Fe, data = corrosion)
Residuals:
Min 1Q Median 3Q Max
-3.7980 -1.9464 0.2971 0.9924 5.7429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.787 1.403 92.52 < 2e-16 ***
Fe -24.020 1.280 -18.77 1.06e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.058 on 11 degrees of freedom
Multiple R-Squared: 0.9697, Adjusted R-squared: 0.967
F-statistic: 352.3 on 1 and 11 DF, p-value: 1.055e-09
Look at the high R2 and the very small p-value for overall F-test.
Q: Do you think this is a good fit?
Let's check the fit graphically:
		> plot(corrosion$Fe, 
corrosion$loss, ylim=c(80,140), cex=2) # draw the scatter plot of Fe and loss
		> abline(g$coef) # add the fitted line
			
Examine the plot and guess whether the test for lack of fit will be accepted or rejected.
To perform the test for lack of fit,
We now fit a saturated model that reserves a parameter for each group of data with the same value of Fe.
This is accomplished by declaring the predictor to be a factor in R.
> ga <- lm(loss ~ factor(Fe), data=corrosion) # fit the saturated model, the "factor" command treats Fe as a qualitative predictor
> summary(ga) # take a look of the fitted saturated model
Call:
lm(formula = loss ~ factor(Fe), data = corrosion)
Residuals:
Min 1Q Median 3Q Max
-1.250e+00 -9.667e-01 9.991e-17 1.000e+00 1.533e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 128.567 0.809 158.914 4.19e-12 ***
factor(Fe)0.48 -5.567 1.279 -4.352 0.00481 **
factor(Fe)0.71 -16.617 1.279 -12.990 1.28e-05 ***
factor(Fe)0.95 -24.667 1.618 -15.245 5.03e-06 ***
factor(Fe)1.19 -27.067 1.618 -16.728 2.91e-06 ***
factor(Fe)1.44 -36.717 1.279 -28.703 1.18e-07 ***
factor(Fe)1.96 -43.617 1.279 -34.097 4.24e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.401 on 6 degrees of freedom
Multiple R-Squared: 0.9965, Adjusted R-squared: 0.9931
F-statistic: 287.3 on 6 and 6 DF, p-value: 4.152e-07
Notice that
the saturated model has 7 parameters (including a constant term), which is equal to the number of distinct values of Fe.
The residual standard error, 1.401, is a pure error estimate of true σ.
The fitted values of the saturated model are the means in each group - put these on the plot:
> points(corrosion$Fe, ga$fit, pch=18, cex=2) # add the means in each group to the plot
		
From the plot, do you think the model can pass the test for lack of fit?
Here, let's calculate the SS and degrees of freedom for the pure error "by hand":
> sum(ga$res^2) # calculate SSp.e.
[1] 11.78167
> ga$df # the dfp.e.
[1] 6
> sum(g$res^2) # the RSS calculated from the model g
[1] 102.8502
We can put these values into the ANOVA table to calculate test statistic:
| d.f. | SS | MS | F | |
| Residual | n-p (11) | RSS (102.8502) | ||
| Lack of fit | n-p-dfp.e. (5) | RSS-SSp.e. | (RSS-SSp.e.)/(n-p-dfp.e. ) | ratio of MS | 
| Pure error | dfp.e. (6) | SSp.e. (11.78167) | (SSp.e.)/(dfp.e.) | 
And, now the test statistic:
> ((102.8502-11.78167)/(11-6))/((11.78167/6)) # test statistic for lack of fit
[1] 9.275615
> 1-pf(9.275615, 5, 6) # the p-value
[1] 0.008622849
A slightly quicker way to get the same result:
> anova(g, ga) # compute the ANOVA table for models g and ga
Analysis of Variance Table
Model 1: loss ~ Fe
Model 2: loss ~ factor(Fe)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 11 102.850
2 6 11.782 5 91.069 9.2756 0.008623 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The low p-value indicates that we must conclude that there is a lack of fit.
Let's try to add more higher order polynomial terms one-by-one until the model pass the test for lack of fit:
> g1 <- lm(loss~Fe+I(Fe^2), data=corrosion) 
# fit a 2nd-order polynomial model 
> 
anova(g1, ga) # compute the 
ANOVA table for models g1 and ga
Analysis of Variance Table
Model 1: loss ~ Fe + I(Fe^2)
Model 2: loss ~ factor(Fe)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 10 100.086
2 6 11.782 4 88.305 11.243 0.005949 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> g2 <- 
lm(loss~Fe+I(Fe^2)+I(Fe^3), data=corrosion) # fit a 
3rd-order polynomial model 
> anova(g2, ga) 
# compute the ANOVA table for models g2 
and ga
Analysis of Variance Table
Model 1: loss ~ Fe + I(Fe^2) + I(Fe^3)
Model 2: loss ~ factor(Fe)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 45.051
2 6 11.782 3 33.269 5.6476 0.03506 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> g3 <- 
lm(loss~Fe+I(Fe^2)+I(Fe^3)+I(Fe^4), data=corrosion) # 
fit a 4th-order polynomial model 
> 
anova(g3, ga) # compute the 
ANOVA table for models g3 and ga
Analysis of Variance Table
Model 1: loss ~ Fe + I(Fe^2) + I(Fe^3) + I(Fe^4)
Model 2: loss ~ factor(Fe)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 34.975
2 6 11.782 2 23.194 5.9059 0.03822 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> g4 <- 
lm(loss~Fe+I(Fe^2)+I(Fe^3)+I(Fe^4)+I(Fe^5), data=corrosion) 
# fit a 5th-order polynomial model 
> 
anova(g4, ga) # compute the 
ANOVA table for models g4 and ga
Analysis of Variance Table
Model 1: loss ~ Fe + I(Fe^2) + I(Fe^3) + I(Fe^4) + I(Fe^5)
Model 2: loss ~ factor(Fe)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 14.465
2 6 11.782 1 2.683 1.3663 0.2868
Finally, the 5th-order polynomial model passes the test for lack of fit.
This model is very close to the saturated model (the 6th-order polynomial model is a saturated model, why? Hint. count the number of parameters).
Let's plot the fitted line of the 5-th order polynomial model and compare it with the straight line model:
> grid <- seq(0, 2, length=50); lines(grid, predict(g4,data.frame(Fe=grid)), lty=2) # add the fitted line of the 5-th order model
		
Do you think the 5th order polynomial model is better than the straight line model?
We actually cannot explain why the corrosion loss should increase then decrease in the region 0<Fe<0.5, and decrease then increase in the region 1.5<Fe<2.0.
This is probably a consequence of over-fitting the data.
Because of the complexity of the model, the fitting on the observed Fe's is good, but the prediction on the region without observed Fe's becomes unstable.
In this case, we may want to stay with the straight line model and consider other reasons that may causes the lack of fit.
For example, you may want to find out whether the replicates are genuine.
Perhaps, the low pure error estimate can be explained by some correlation in the measurements.
The source of variation should also be examined.
For example, if the specimens with same Fe's are taken from different positions of a same unit, the pure error estimate does not include the variation between different units.
In the case, it is reasonable to get a pure error estimate lower than expected.
Other possible explanation is unmeasured predictors is causing the lack of fit.