Here is a data set that is built-in in Splus, called "saving.x". Details about the data can be found in Splus by the command "help(saving.x)", which returns the following results:
The saving.x data set is originally from unpublished data of Arlie Sterling. It is a matrix with 50 rows representing countries and 5 columns representing variables. The data contain average saving rates over the time period of 1960-1970; the data are averaged to remove business cycle or other short-term fluctuations. Income is per-capita disposable income in U.S. dollars, growth is the percent rate of change in per capita disposable income, and savings rate is aggregate personal saving divided by disposable income.
Belsley, D.A., Kuh, E., and Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley and Sons, Inc.
Let us read the data into R:
> saving.x <- read.table("saving.x", header=T)
# the option "header" is a logical value indicating
whether the file contains the names of the variables as its first line. It is
set to "TRUE" if and only if the first row contains one fewer field than the
number of columns.
> saving.x # take a look
of the data
Pop15 Pop75 Disp Growth Savings
1 29.35 2.87 2329.68 2.87 11.43
2 23.32 4.41 1507.99 3.93 12.07
3 23.80 4.43 2108.47 3.82 13.17
... much deleted ...
50 47.20 0.66 242.69 5.08 4.71
The dataset has 50 observations and 5 variables. The variables are:
Pop15 | the percentage population under 15 |
Pop75 | the percentage population over 75 |
Disp | per-capita disposable income in US dollars |
Growth | the percent rate of change in per capita disposable income |
Savings | aggregate personal saving divided by disposable income |
Let's assign the variables:
> p15 <- saving.x[,1]
# save the 1st column as p15
> p75 <- saving.x[,2]
# save the 2nd column as p75
> inc <- saving.x[,3]
# save the 3rd column as inc
> gro <- saving.x[,4]
# save the 4th column as gro
> sav <- saving.x[,5]
# save the 5th column as sav
Now, let us fit a simple model, with the variable Savings as the response and the rest variables as predictors, i.e.,
savi | = | β0 + βp15*p15i + βp75*p75i + βinc*inci + βgro*groi + εi, |
i=1, 2, ..., 50. |
In the following, we demonstrate how the t-statistics and overall F-test can be obtained from R, or by doing the calculations directly.
> g <- lm(sav~p15+p75+inc+gro)
# fit the model
with sav as response and the rest as predictors
> summary(g) # make a note of the
p-values for the overall F-test and
the t-test H0:
bp15=0
Call:
lm(formula = sav ~ p15 + p75 + inc + gro)
Residuals:
Min 1Q Median 3Q Max
-8.2423 -2.6859 -0.2491 2.4278 9.7509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.5666100 7.3544986 3.884
0.000334 ***
p15 -0.4612050
0.1446425 -3.189 0.002602 **
p75 -1.6915757
1.0835862 -1.561 0.125508
inc -0.0003368
0.0009311 -0.362 0.719296
gro 0.4096998
0.1961961 2.088 0.042468 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-Squared: 0.3385, Adjusted R-squared: 0.2797
F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007902
(Q: There are many quantities and many information in the output. What have caught your attention? For different datasets, you might find their most important/useful information existing in different statistics. For this dataset, which statistics/quantities are particularly important?)
Notice that
because the overall F-test (H0: βp15=βp75=βinc=βgro=0) has a small p-value, 0.0007902, the null hypothesis that none of the predictors are useful in explaining the response is rejected.
Q: Why the degrees of freedom for the overall F-test are 4 and 45?
although the overall F-test is significant, the R2(=0.3385) is not very high. Only 33.85% of the variation in Savings is explained by the predictors.
However, it is difficult to judge whether such an R2 is much lower than expectation when we don't have further background information about the data.
Be aware of the relationship between R2 and the test statistics of the overall F-test
> (0.3385/(1-0.3385))*(45/4) # calculate the F test statistic from R2
[1] 5.756803
> (5.756*(4/45))/(1+5.756*(4/45)) # calculate R2 from the F test statistic
[1] 0.3384688
t-value = Estimate/(Std. Error) (Q: Why?)
because the p-value of the t-test H0: βp15=0 is very small, we can conclude that the null should be rejected. But, remember that
the t-test of p15 is relative to the other predictors in the model, namely, p75, inc, gro, and
it is not possible to look at the effect of p15 in isolation, i.e, simply stating the null as H0: βp15=0 is insufficient because information about what other predictors are not included
the result of the test may be different if the predictors change
Q: Is the significance for the intercept, i.e, H0: β0=0, important?
Test of all the predictors
Now, let's do the calculation directly. First, the overall F-statistic --- we will perform the test of H0: bp15=bp75=binc=bgro=0 "by hand". For the test,
W: |
sav= |
b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and |
w: | sav= | b0 + e. |
> ip <- function(x,y) sum(x*y)
# write a function for inner product, use "help("function")"
to find out how to write a self-defined function in R
> ip(g$res, g$res) # the RSSW
[1] 650.706
> g$df # the degrees of freedom of RSSW, i.e., n-p
[1] 45
> ip(sav-mean(sav), sav-mean(sav)) # the RSSw, which is TSS in this case
[1] 983.6283
> ((983.6283-650.706)/4)/(650.706/45) # the overall F-statistics
[1] 5.755865
> 1-pf(5.755865, 4, 45) # the p-value of the overall F-test, the command "pf" returns the cdf value of an F distribution
[1] 0.0007902025
A convenient way in R to calculate the F-statistics and its p-value is as follows:
> g1 <- lm(sav~1)
# fit the null model
w
> summary(g1)
# take a look
Call:
lm(formula = sav ~ 1)
Residuals:
Min 1Q Median 3Q Max
-9.071 -2.701 0.839 2.947 11.429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6710 0.6336 15.26 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.48 on 49 degrees of freedom
> anova(g1, g) # compare the two models W and w, the command "anova" computes the analysis of variance (or deviance) tables for one or more fitted models. Use "help(anova)" to get more information.
Analysis of Variance Table
Model 1: sav ~ 1
Model 2: sav ~ p15 + p75 + inc + gro
Res.Df RSS Df
Sum of Sq F Pr(>F)
1 49 983.63
2 45 650.71 4
332.92 5.7559 0.0007902 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Check if it agrees with the result of the overall F-test in the summary of g.
In many textbooks, you can find the common format of ANOVA table as follows. From the above overall F-test, can you fill in its ANOVA table?
Source | Degree of Freedom | Sum of Square | Mean Square | F |
Regression | p-1 | SSreg | SSreg/(p-1) | F = [SSreg/(p-1)]/[RSS/(n-p)] |
Residual | n-p | RSS | RSS/(n-p) | |
Total | n-1=(p-1)+(n-p) | TSS=SSreg+RSS |
Testing just one predictor
Now, let's do the t-test, H0: bp15=0, by hand. For the test,
W: |
sav= |
b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and |
w: | sav= | b0 + bp75*p75 + binc*inc + bgro*gro + e. |
> g2 <- lm(sav~p75+inc+gro)
# fit the model without p15
> ip(g2$res, g2$res) # the
RSSw
[1] 797.7234
> (797.7234-650.706)/(650.706/45) # the F-statistic
[1] 10.16708
> sqrt(10.16708) # t-statistics is square root of the F-statistics,
[1] 3.188586
Check if it agrees with the result in summary of g (except for the sign).
> 2*(1-pt(3.188586, 45)) # compute the tail probability --- the command "pt" returns the cdf value of a t distribution. (Q: why multiplied by 2? Ans: it's a two-sided test)
[1] 0.002602461
It should agree with the p-value in summary of g.
> anova(g2, g) # a convenient way to compare two models, one nested in another
Analysis of Variance Table
Model 1: sav ~ p75 + inc + gro
Model 2: sav ~ p15 + p75 + inc + gro
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 797.72
2 45 650.71 1 147.02 10.167 0.002602 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Testing a pair of predictors
We can find that the t-tests for bp75 and binc are both insignificant from the summary of g (Q: Does this mean that both p75 and inc can be directly eliminated from the model?). We may therefore be interested in testing the pair of predictors: H0: bp75=binc=0. For the test,
W: |
sav= |
b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and |
w: | sav= | b0 + bp15*p15 + bgro*gro + e. |
> g3 <-
lm(sav~p15+gro)
# fit the model without
p75 and inc
> anova(g3, g)
# a convenient way to
compare two models, one nested in another
Analysis of Variance Table
Model 1: sav ~ p15 + gro
Model 2: sav ~ p15 + p75 + inc + gro
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 700.55
2 45 650.71 2 49.84 1.7234 0.1900
Because the p-value is not extreme, there is not enough evidence to reject H0: bp75=binc=0 when p15 and gro are in the model.
Testing a subspace w
We might be interested in testing the hypothesis that the effect of young and old people on the savings rate (with estimated value -0.4612 and -1.6915, respectively, in the summary of g) was the same, i.e., test H0: bp15=bp75. For the test,
W: |
sav= |
b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and |
w: | sav= | b0 + bp15*(p15+p75) + binc*inc + bgro*gro + e. |
> g4 <- lm(sav~I(p15+p75)+inc+gro)
# The function "I()" ensures that the argument is evaluated rather
than interpreted as part of the model formula
> anova(g4, g)
# a convenient way to compare two models, one nested in
another
Analysis of Variance Table
Model 1: sav ~ I(p15 + p75) + inc + gro
Model 2: sav ~ p15 + p75 + inc + gro
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 673.62
2 45 650.71 1 22.92 1.5849 0.2146
We find that
the p-value of 0.2146 indicates that the null cannot be rejected here
it means that there is not evidence that young and old people need to be treated separately in the context of this particular model.
Testing a subset w
Suppose that we want to test whether one of the coefficients can be set to a particular value. For example, test H0: bgro=1 (the estimated value of bgro is 0.4097 in the summary of g). For the test,
W: |
sav= |
b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and |
w: | sav= | b0 + bp15*p15 + bp75*p75 + binc*inc + gro + e. |
Notice that now in w there is a fixed coefficient on the gro term. Such a fixed term in the regression equation is called an offset.
> g5 <- lm(sav~p15+p75+inc+offset(gro))
# An offset is a term to be added to a linear predictor.
The command "offset" is used to include an offset into model
> anova(g5, g)
#
a convenient way to compare two models, one nested in another
Analysis of Variance Table
Model 1: sav ~ p15 + p75 + inc + offset(gro)
Model 2: sav ~ p15 + p75 + inc + gro
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 781.61
2 45 650.71 1 130.90 9.0524 0.004286 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the p-value is large and the null hypothesis here is rejected.
An alternative is to use t-test:
> (0.4096998-1)/0.1961961 # t-statistic (Q: where in the summary of g can you find the estimate of bgro and its standard error?)
[1] -3.008725
> 2*pt(-3.008725, 45) # p-value
[1] 0.004286205
> (-3.008725)^2 # agree with the F-statistic?
[1] 9.052426
Note that in the model g5, these is an offset term (i.e., gro). To fit a linear model with offset, can we use the response minus the offset (i.e., sav-gro) as a new response? That is, fit the model
sav - gro = b0 + bp15*p15 + bp75*p75 + binc*inc + e. |
Let us check how R handle the two models.
> g6 <- lm(I(sav-gro)~p15+p75+inc) # fit a model using the same predictors and a new response which equals the original response minus the offset
> summary(g6) # take a look of the fitted model g6
Call:
lm(formula = I(sav - gro) ~ p15 + p75 + inc)
Residuals:
Min 1Q Median 3Q Max
-10.1949 -2.1483 -0.2972 2.4888 8.8658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.4005843 7.8296970 3.116 0.00315 **
p15 -0.4166095 0.1559668 -2.671 0.01042 *
p75 -1.8699958 1.1728456 -1.594 0.11769
inc 0.0003790 0.0009759 0.388 0.69954
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.122 on 46 degrees of freedom
Multiple R-Squared: 0.2208, Adjusted R-squared: 0.17
F-statistic: 4.346 on 3 and 46 DF, p-value: 0.008888
> summary(g5) # take a look of the fitted model g5
Call:
lm(formula = sav ~ p15 + p75 + inc + offset(gro))
Residuals:
Min 1Q Median 3Q Max
-10.1949 -2.1483 -0.2972 2.4888 8.8658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.4005843 7.8296970 3.116 0.00315 **
p15 -0.4166095 0.1559668 -2.671 0.01042 *
p75 -1.8699958 1.1728456 -1.594 0.11769
inc 0.0003790 0.0009759 0.388 0.69954
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.122 on 46 degrees of freedom
Multiple R-Squared: 0.4523, Adjusted R-squared: 0.4166
F-statistic: 12.66 on 3 and 46 DF, p-value: 3.67e-06
Compare the summary outputs of g6 and g5 and find what values are identical and what are different. Explain why they are identical or different?
> cbind(g5$fit, g6$fit, g6$fit+gro) # compare the fitted values under g5 and g6 models
[,1] [,2] [,3]
1 10.559130 7.689130 10.559130
2 10.940081 7.010081 10.940081
3 10.820283 7.000283 10.820283
... much deleted ...
50 8.674396 3.594396 8.674396
> sum(g5$fit*g5$res) # the inner product of fitted values and residuals under g5.
[1] -221.7506
Note that in g5, the fitted values and residuals are not orthogonal.
> sum(g6$fit*g6$res) # the inner product of fitted values and residuals under g6.
[1] -2.220446e-14
Note that in g6, the fitted values and residuals are orthogonal.
For your information, to test a general linear hypothesis H0:Ab=c
in R, you can use the "linear.hypothesis" command in "car" package. Take a look
of
this document for more information.