Hypothesis Test (Reading: Faraway (2005, 1st edition), section 3.2)

 

Here is a data set that is built-in in Splus, called "saving.x". Details about the data can be found in Splus by the command "help(saving.x)", which returns the following results:

Savings Rates for Countries

SUMMARY:

The saving.x data set is originally from unpublished data of Arlie Sterling. It is a matrix with 50 rows representing countries and 5 columns representing variables. The data contain average saving rates over the time period of 1960-1970; the data are averaged to remove business cycle or other short-term fluctuations. Income is per-capita disposable income in U.S. dollars, growth is the percent rate of change in per capita disposable income, and savings rate is aggregate personal saving divided by disposable income.

SOURCE:

Belsley, D.A., Kuh, E., and Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley and Sons, Inc.

 

 

Let us read the data into R:

> saving.x <- read.table("saving.x", header=T) # the option "header" is a logical value indicating whether the file contains the names of the variables as its first line. It is set to "TRUE" if and only if the first row contains one fewer field than the number of columns.
> saving.x # take a look of the data

   Pop15 Pop75    Disp Growth Savings

1  29.35  2.87 2329.68   2.87   11.43

2  23.32  4.41 1507.99   3.93   12.07

3  23.80  4.43 2108.47   3.82   13.17

  ... much deleted ...

50 47.20  0.66  242.69   5.08    4.71

The dataset has 50 observations and 5 variables. The variables are:

Pop15 the percentage population under 15
Pop75 the percentage population over 75
Disp per-capita disposable income in US dollars
Growth the percent rate of change in per capita disposable income
Savings aggregate personal saving divided by disposable income

 

Let's assign the variables:

> p15 <- saving.x[,1] # save the 1st column as p15
> p75 <- saving.x[,2] # save the 2nd column as p75
> inc <- saving.x[,3] # save the 3rd column as inc
> gro <- saving.x[,4] # save the 4th column as gro
> sav <- saving.x[,5] # save the 5th column as sav

 

Now, let us fit a simple model, with the variable Savings as the response and the rest variables as predictors, i.e.,

savi = β0 + βp15*p15i + βp75*p75i + βinc*inci + βgro*groi  + εi,
    i=1, 2, ..., 50.

In the following, we demonstrate how the t-statistics and overall F-test can be obtained from R, or by doing the calculations directly. 

> g <- lm(sav~p15+p75+inc+gro) # fit the model with sav as response and the rest as predictors
> summary(g) # make a note of the p-values for the overall F-test and the t-test H0: bp15=0

Call:
lm(formula = sav ~ p15 + p75 + inc + gro)

Residuals:
Min 1Q Median 3Q Max 
-8.2423 -2.6859 -0.2491 2.4278 9.7509 

Coefficients:
               Estimate  Std. Error  t value   Pr(>|t|) 
(Intercept)  28.5666100   7.3544986    3.884   0.000334 ***
p15          -0.4612050   0.1446425   -3.189   0.002602 ** 
p75          -1.6915757   1.0835862   -1.561   0.125508 
inc          -0.0003368   0.0009311   -0.362   0.719296 
gro           0.4096998   0.1961961    2.088   0.042468 * 
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-Squared: 0.3385, Adjusted R-squared: 0.2797 
F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007902 

(Q: There are many quantities and many information in the output. What have caught your attention? For different datasets, you might find their most important/useful information existing in different statistics. For this dataset, which statistics/quantities are particularly important?)

Notice that

 


Test of all the predictors

Now, let's do the calculation directly. First, the overall F-statistic --- we will perform the test of H0: bp15=bp75=binc=bgro=0 "by hand". For the test,

W:

sav=

b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and
w: sav= b0 + e.

> ip <- function(x,y) sum(x*y) # write a function for inner product, use "help("function")" to find out how to write a self-defined function in R
> ip(g$res, g$res) # the RSSW

[1] 650.706

> g$df # the degrees of freedom of RSSW, i.e., n-p

[1] 45

> ip(sav-mean(sav), sav-mean(sav)) # the RSSw, which is TSS in this case

[1] 983.6283

> ((983.6283-650.706)/4)/(650.706/45) # the overall F-statistics

[1] 5.755865

> 1-pf(5.755865, 4, 45) # the p-value of the overall F-test, the command "pf" returns the cdf value of an F distribution

[1] 0.0007902025

 

A convenient way in R to calculate the F-statistics and its p-value is as follows:

> g1 <- lm(sav~1) # fit the null model w
> summary(g1) # take a look

Call:

lm(formula = sav ~ 1)

 

Residuals:

   Min     1Q Median     3Q    Max

-9.071 -2.701  0.839  2.947 11.429

 

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   9.6710     0.6336   15.26   <2e-16 ***

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 4.48 on 49 degrees of freedom

> anova(g1, g) # compare the two models W and w, the command "anova" computes the analysis of variance (or deviance) tables for one or more fitted models. Use "help(anova)" to get more information.

Analysis of Variance Table

Model 1: sav ~ 1
Model 2: sav ~ p15 + p75 + inc + gro
   Res.Df      RSS   Df   Sum of Sq       F     Pr(>F) 
1      49   983.63 
2      45   650.71    4      332.92  5.7559  0.0007902 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Check if it agrees with the result of the overall F-test in the summary of g.

In many textbooks, you can find the common format of ANOVA table as follows. From the above overall F-test, can you fill in its ANOVA table?

Source Degree of Freedom Sum of Square Mean Square F
Regression p-1 SSreg SSreg/(p-1)  F = [SSreg/(p-1)]/[RSS/(n-p)]
Residual n-p RSS RSS/(n-p)  
Total n-1=(p-1)+(n-p) TSS=SSreg+RSS    

 


Testing just one predictor

Now, let's do the t-test, H0: bp15=0, by hand. For the test,

W:

sav=

b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and
w: sav= b0 + bp75*p75 + binc*inc + bgro*gro + e.

> g2 <- lm(sav~p75+inc+gro) # fit the model without p15
> ip(g2$res, g2$res) # the RSSw

[1] 797.7234

> (797.7234-650.706)/(650.706/45) # the F-statistic

[1] 10.16708

> sqrt(10.16708) # t-statistics is square root of the F-statistics,

[1] 3.188586

Check if it agrees with the result in summary of g (except for the sign).

> 2*(1-pt(3.188586, 45)) # compute the tail probability --- the command "pt" returns the cdf value of a t distribution. (Q: why multiplied by 2? Ans: it's a two-sided test)

[1] 0.002602461

It should agree with the p-value in summary of g.

> anova(g2, g) # a convenient way to compare two models, one nested in another

Analysis of Variance Table

 

Model 1: sav ~ p75 + inc + gro

Model 2: sav ~ p15 + p75 + inc + gro

  Res.Df    RSS Df Sum of Sq      F   Pr(>F)  

1     46 797.72                               

2     45 650.71  1    147.02 10.167 0.002602 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 


Testing a pair of predictors

We can find that the t-tests for bp75 and binc are both insignificant from the summary of g (Q: Does this mean that both p75 and inc can be directly eliminated from the model?). We may therefore be interested in testing the pair of predictors: H0: bp75=binc=0. For the test,

W:

sav=

b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and
w: sav= b0 + bp15*p15 + bgro*gro + e.

 > g3 <- lm(sav~p15+gro) # fit the model without p75 and inc
> anova(g3, g) # a convenient way to compare two models, one nested in another

Analysis of Variance Table

 

Model 1: sav ~ p15 + gro

Model 2: sav ~ p15 + p75 + inc + gro

  Res.Df    RSS Df Sum of Sq      F Pr(>F)

1     47 700.55                          

2     45 650.71  2     49.84 1.7234 0.1900

Because the p-value is not extreme, there is not enough evidence to reject H0: bp75=binc=0 when p15 and gro are in the model.

 


Testing a subspace w  

We might be interested in testing the hypothesis that the effect of young and old people on the savings rate (with estimated value -0.4612 and -1.6915, respectively, in the summary of g) was the same, i.e., test H0: bp15=bp75. For the test,

W:

sav=

b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and
w: sav= b0 + bp15*(p15+p75) + binc*inc + bgro*gro + e.

> g4 <- lm(sav~I(p15+p75)+inc+gro) # The function "I()" ensures that the argument is evaluated rather than interpreted as part of the model formula
> anova(g4, g) # a convenient way to compare two models, one nested in another

Analysis of Variance Table

 

Model 1: sav ~ I(p15 + p75) + inc + gro

Model 2: sav ~ p15 + p75 + inc + gro

  Res.Df    RSS Df Sum of Sq      F Pr(>F)

1     46 673.62                          

2     45 650.71  1     22.92 1.5849 0.2146

We find that

 


Testing a subset w   

Suppose that we want to test whether one of the coefficients can be set to a particular value. For example, test H0: bgro=1 (the estimated value of bgro is 0.4097 in the summary of g). For the test,

W:

sav=

b0 + bp15*p15 + bp75*p75 + binc*inc + bgro*gro + e, and
w: sav= b0 + bp15*p15 + bp75*p75 + binc*inc + gro + e.

Notice that now in w there is a fixed coefficient on the gro term. Such a fixed term in the regression equation is called an offset.

> g5 <- lm(sav~p15+p75+inc+offset(gro)) # An offset is a term to be added to a linear predictor. The command "offset" is used to include an offset into model
> anova(g5, g) # a convenient way to compare two models, one nested in another

Analysis of Variance Table

 

Model 1: sav ~ p15 + p75 + inc + offset(gro)

Model 2: sav ~ p15 + p75 + inc + gro

  Res.Df    RSS Df Sum of Sq      F   Pr(>F)  

1     46 781.61                               

2     45 650.71  1    130.90 9.0524 0.004286 **

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value is large and the null hypothesis here is rejected.

 

An alternative is to use t-test:

> (0.4096998-1)/0.1961961 # t-statistic (Q: where in the summary of g can you find the estimate of bgro and its standard error?)

[1] -3.008725

> 2*pt(-3.008725, 45) # p-value

[1] 0.004286205

> (-3.008725)^2 # agree with the F-statistic?

[1] 9.052426

 

Note that in the model g5, these is an offset term (i.e., gro). To fit a linear model with offset, can we use the response minus the offset (i.e., sav-gro) as a new response? That is, fit the model

sav - gro = b0 + bp15*p15 + bp75*p75 + binc*inc + e.

Let us check how R handle the two models.

> g6 <- lm(I(sav-gro)~p15+p75+inc) # fit a model using the same predictors and a new response which equals the original response minus the offset 

> summary(g6) # take a look of the fitted model g6

Call:

lm(formula = I(sav - gro) ~ p15 + p75 + inc)

 

Residuals:

     Min       1Q   Median       3Q      Max

-10.1949  -2.1483  -0.2972   2.4888   8.8658

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)  

(Intercept) 24.4005843  7.8296970   3.116  0.00315 **

p15         -0.4166095  0.1559668  -2.671  0.01042 *

p75         -1.8699958  1.1728456  -1.594  0.11769  

inc          0.0003790  0.0009759   0.388  0.69954  

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 4.122 on 46 degrees of freedom

Multiple R-Squared: 0.2208,     Adjusted R-squared:  0.17

F-statistic: 4.346 on 3 and 46 DF, p-value: 0.008888

> summary(g5) # take a look of the fitted model g5

Call:

lm(formula = sav ~ p15 + p75 + inc + offset(gro))

 

Residuals:

     Min       1Q   Median       3Q      Max

-10.1949  -2.1483  -0.2972   2.4888   8.8658

 

Coefficients:

              Estimate Std. Error t value Pr(>|t|)  

(Intercept) 24.4005843  7.8296970   3.116  0.00315 **

p15         -0.4166095  0.1559668  -2.671  0.01042 *

p75         -1.8699958  1.1728456  -1.594  0.11769  

inc          0.0003790  0.0009759   0.388  0.69954  

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 4.122 on 46 degrees of freedom

Multiple R-Squared: 0.4523,     Adjusted R-squared: 0.4166

F-statistic: 12.66 on 3 and 46 DF,  p-value: 3.67e-06

Compare the summary outputs of g6 and g5 and find what values are identical and what are different. Explain why they are identical or different?

> cbind(g5$fit, g6$fit, g6$fit+gro) # compare the fitted values under g5 and g6 models

        [,1]      [,2]      [,3]

1  10.559130  7.689130 10.559130

2  10.940081  7.010081 10.940081

3  10.820283  7.000283 10.820283

 ... much deleted ...

50  8.674396  3.594396  8.674396

> sum(g5$fit*g5$res) # the inner product of fitted values and residuals under g5.

[1] -221.7506

Note that in g5, the fitted values and residuals are not orthogonal.

> sum(g6$fit*g6$res) # the inner product of fitted values and residuals under g6.

[1] -2.220446e-14

Note that in g6, the fitted values and residuals are orthogonal.


For your information, to test a general linear hypothesis H0:A
b=c in R, you can use the "linear.hypothesis" command in "car" package. Take a look of this document for more information.