Look at here for a description of the data that we will use in this section.
Q: Can you figure out from the description what weights should be used for the data and why?
> p <- read.table("phy.data", header=T)
# note the extra argument to cope with not having row labels
> p # take a look of the data
The sd decreases along with the reduction of energy.
Since the weights should be inversely proportional to the variance, this suggests that the weights should be set to 1/sd2.
> x <- p[, 2]
# extract the energy variable, the predictor
> y <- p[, 3] # extract the crossx variable, the response
> w <- 1/p[, 4]^2 # the weights
> g <- lm(y~x, weights=w) # fit the model with w as weights. "weights" adds as an option.
> summary(g) # take a look of the results
We can test the significance of x by looking at the p-value from the coefficients table or by looking at the overall F-stat (they are equivalent, 11.162=124.6, because this is a simple regression case).
But, if we want to do it the hard way, we must construct the analysis of variance table by hand as follows:
> sum(w*g$res*g$res) # the residual sum of squares
> wm <- sum(w*y)/sum(w)
# the weighted mean of y
> sum(w*(y-wm)*(y-wm)) # the weighted total SS
> 363.944 - 21.95265 # the regression SS
> (341.9914/1)/(21.95265/8) # the overall F statistic
Let's try fitting the regression without weights and see what the difference is:
> g1 <- lm(y~x)
# fit the model without
> summary(g1) # take a look of the fitted model
Compare the output to the summary of model g. Explain their differences.
R-square: why would g1 has a higher value than g?
σ-hat: Why are they so different? Do they estimate the same parameter?
residuals: Why are their ranges so different?
The two fits can be compared:
# scatter plot of x and y
> abline(g) # the fitted line of model g, solid line
> abline(g1, lty=2) # the fitted line of model g2, dashed line
The un-weighted fit (dashed line) appears to fit the data better overall
But remember that for lower values of energy, the variance in the response is less.
The weighted fit (solid line) tries to catch these points with lower energy better than the others.
Weighted least square can be regarded as regressing sqrt(wi)xi on sqrt(wi)yi using ordinary least square:
# calculate sqrt(wi)yi
> xm<-sqrt(w)*x # calculate sqrt(wi)xi
> g2 <- lm(ym~sqrt(w)+xm-1) # regressing sqrt(wi)xi on sqrt(wi)yi. "-1" drop the intercept.
Note that the 1 column in X-matrix is replaced by sqrt(wi)
> summary(g2) # take a look of the results
Compare the output to the summary of model g. What quantities are different? Explain why.
To get the correct overall F-test, we can use:
> anova(lm(ym~sqrt(w)-1), g2) # the correct F-statistics
You can use the two RSSs in the correct F-statistic to obtain the right R-square (How?)
Let's also check their residuals.
> print(g$res, digits=2) # the residuals of model g
> print(g2$res, digits=2) # the residuals of model g2
Why are the two groups of residuals so different?
Which residuals is yi minus yi-hat? Which residuals is sqrt(wi)yi minus sqrt(wi)yi-hat?
Which residuals can be used to perform diagnostics (i.e., residual analysis) and to calculate σ-hat by directly taking sum of their squares?
Let's find out the relationship between the two groups of residuals:
> print(sqrt(w)*g$res, digits=2) # check if it equals residuals of model g2