Mean Structure for Qualitative Predictors

Dichotomous Predictor and Analysis of Covariance (Reading: Faraway (2005, 1st edition), 13.1)

The data for this example consist of

x= nave height and
y = total length in feet for English medieval cathedrals.
Some are in the Romanesque (r) style and others are in the Gothic (g) style. Some cathedrals have parts in both styles and are listed twice.

We wish to investigate how the length is related to height for the two styles. Read in the data and make a summary on the two styles separately.

> cathedral <- read.table("cathedral.data")
> cathedral # take a look

style x y

Durham r 75 502

Canterbury r 80 522

...deleted...

Old.St.Paul g 103 611

Salisbury g 84 473

> lapply(split(cathedral, cathedral$style),summary) # "split" divides the data into groups; "lapply" applies a function (in the case, the "summary") over a list of data (in the case, the two groups of data defined by style)

style x y

g:16 Min. : 45.00 Min. :182.0

r: 0 1st Qu.: 60.75 1st Qu.:298.8

Median : 73.50 Median :412.0

Mean : 74.94 Mean :397.4

3rd Qu.: 86.50 3rd Qu.:481.3

Max. :103.00 Max. :611.0

style x y

g:0 Min. :64.00 Min. :344.0

r:9 1st Qu.:70.00 1st Qu.:425.0

Median :75.00 Median :502.0

Mean :74.44 Mean :475.4

3rd Qu.:80.00 3rd Qu.:530.0

Max. :83.00 Max. :551.0

Notice that in this case the two groups have about the same average heights.
Look at the difference in the lengths.

Now plot the data:

> plot(cathedral$x,cathedral$y,type="n",xlab="Nave height",ylab="Length")
> text(cathedral$x,cathedral$y,as.character(cathedral$s))

Q: What information have you gathered from the summary and the plot?

Before fitting model, examine how R treat the data:

> is.factor(cathedral$style); is.numeric(cathedral$x); is.numeric(cathedral$y)

style is treated as a qualitative variable and x and y as quantitative variables.

Let's start from fitting the full model, i.e., separate regression lines for each groups.

> g <- lm(y ~ x+style+x:style, data=cathedral) # to understand the meaning of "x:style", check "help(formula)". Notice that the model can be equivalently written as "y~x*style" in R.
> summary(g)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.111 85.675 0.433 0.669317

x 4.808 1.112 4.322 0.000301 ***

styler 204.722 347.207 0.590 0.561733

x:styler -1.669 4.641 -0.360 0.722657

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 79.11 on 21 degrees of freedom

Multiple R-Squared: 0.5412, Adjusted R-squared: 0.4757

F-statistic: 8.257 on 3 and 21 DF, p-value: 0.0008072

Q: Do you know how to interpret the estimated parameters in the output?
- You must first understand what coding you are using.
Because style is non-numeric, R automatically treats it as a qualitative variables and sets up a coding - but which coding?

> model.matrix(g)

             (Intercept)   x styler x:styler

Durham                 1 75      1       75

Canterbury             1 80      1       80

Gloucester             1 68      1       68

Hereford               1 64      1       64

Norwich                1 83      1       83

Peterborough           1 80      1       80

St.Albans              1 70      1       70

Winchester             1 76      1       76

Ely                    1 74      1       74

York                   1 100      0        0

Bath                   1 75      0        0

Bristol                1 52      0        0

Chichester             1 62      0        0

Exeter                 1 68      0        0

GloucesterG            1 86      0        0

Lichfield              1 57      0        0

Lincoln                1 82      0        0

NorwichG               1 72      0        0

Ripon                  1 88      0        0

Southwark              1 55      0        0

Wells                  1 67      0        0

St.Asaph               1 45      0        0

WinchesterG            1 103      0        0

Old.St.Paul            1 103      0        0

Salisbury              1 84      0        0

attr(,"assign")

[1] 0 1 2 3

attr(,"contrasts")

attr(,"contrasts")$style

[1] "contr.treatment"
- The Romanesque (r) style is coded as 1 and the Gothic (g) style is coded as 0.
  - That is, Gothic style is treated as the reference.
- This coding is assigned by the contrast function "contr.treatment" in R.
Suppose you would like to code the two groups as 1 and -1, individually. You can use the contrast function "contr.sum" and the following command:
> options(contrasts=c("contr.sum","contr.poly")) # use "help(options)" to understand how to assign contrast

　

Let's fit the full model again under (-1, 1) coding:

> g <- lm(y ~ x+style+x:style, data=cathedral)
> summary(g)

Coefficients:

             Estimate Std. Error t value Pr(>|t|)

(Intercept) 139.4718   173.6037   0.803    0.431

x              3.9728     2.3206   1.712    0.102

style1      -102.3610   173.6037 -0.590    0.562

x:style1      0.8347     2.3206   0.360    0.723

Residual standard error: 79.11 on 21 degrees of freedom

Multiple R-Squared: 0.5412,     Adjusted R-squared: 0.4757

F-statistic: 8.257 on 3 and 21 DF, p-value: 0.0008072
- Compare to the result under (0, 1) coding.
  - What's different and what's the same? Can you explain why?
　

Let's take a look of the model matrix under (-1, 1) coding:

> model.matrix(g)

             (Intercept)   x style1 x:style1

Durham                 1 75     -1      -75

Canterbury             1 80     -1      -80

Gloucester             1 68     -1      -68

Hereford               1 64     -1      -64

Norwich                1 83     -1      -83

Peterborough           1 80     -1      -80

St.Albans              1 70     -1      -70

Winchester             1 76     -1      -76

Ely                    1 74     -1      -74

York                   1 100      1      100

Bath                   1 75      1       75

Bristol                1 52      1       52

Chichester             1 62      1       62

Exeter                 1 68      1       68

GloucesterG            1 86      1       86

Lichfield              1 57      1       57

Lincoln                1 82      1       82

NorwichG               1 72      1       72

Ripon                  1 88      1       88

Southwark              1 55      1       55

Wells                  1 67      1       67

St.Asaph               1 45      1       45

WinchesterG            1 103      1      103

Old.St.Paul            1 103      1      103

Salisbury              1 84      1       84

attr(,"assign")

[1] 0 1 2 3

attr(,"contrasts")

attr(,"contrasts")$style

[1] "contr.sum"

Now, let's go back to (0, 1) coding:

> options(contrasts=c("contr.treatment","contr.poly"))

We see that the full model under (0, 1) coding can be simplified to:

> g1 <- lm(y ~ x+style, data=cathedral)
> summary(g1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 44.298 81.648 0.543 0.5929

x 4.712 1.058 4.452 0.0002 ***

styler 80.393 32.306 2.488 0.0209 *

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 77.53 on 22 degrees of freedom

Multiple R-Squared: 0.5384, Adjusted R-squared: 0.4964

F-statistic: 12.83 on 2 and 22 DF, p-value: 0.0002028

Put the two parallel regression on the plot:

> abline(44.298,4.7116)
> abline(44.298+80.393,4.7116,lty=2)

Check out the diagnostics:

> par(mfrow=c(2,2))
> plot(g)

The check on the diagnostics reveals no particular problems.
Our conclusion is that for cathedrals of the same height, Romanesque ones are about 80 (=2*40.2) feet longer.
For each extra foot height, both types of cathedral are about 4.7 feet longer.

Suppose you want to perform an analysis of covariance. You can do it as follows:

> g2 <- lm(y~x, cathedral)
> anova(g1, g2)

Analysis of Variance Table

Model 1: y ~ x + style

Model 2: y ~ x

Res.Df RSS Df Sum of Sq F Pr(>F)

1 22 132223

2 23 169440 -1 -37217 6.1924 0.02089 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But, of course, you know it will be the same as the t-test for style in model g1.

Polytomous predictor (Reading: Faraway (2005), 13.2, 13.3)

The data for this example come from a 1966 paper by Cyril Burt entitled "The genetic determination of differences in intelligence: A study of monozygotic twins reared apart".
The data consist of IQ scores for identical twins, one raised by foster parents, the other by the natural parents.
We also know the social class of natural parents (high, middle or low).
We are interested in predicting the IQ of the twin with foster parents from the IQ of the twin with the natural parents and the social class of natural parents.

Let's read in and take a look at the data:

> twins <- read.table("twin.data",header=T)
> twins

Foster Biological Social

1 82 82 high

...deleted...

7 132 131 high

8 71 78 middle

...deleted...

13 111 107 middle

14 63 68 low

...deleted...

27 98 111 low

> plot(twins$B,twins$F,type="n",xlab="Biological IQ",ylab="Foster IQ")
> text(twins$B,twins$F,substring(as.character(twins$S),1,1))

From the plot, what model seems appropriate?

The most general model we'll consider is to fit separate lines for each groups:

> g <- lm(Foster ~ Biological*Social, twins)
> summary(g)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.872044 17.808264 -0.105 0.917

Biological 0.977562 0.163192 5.990 6.04e-06 ***

Sociallow 9.076654 24.448704 0.371 0.714

Socialmiddle 2.688068 31.604178 0.085 0.933

Biological:Sociallow -0.029140 0.244580 -0.119 0.906

Biological:Socialmiddle -0.004995 0.329525 -0.015 0.988

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 7.921 on 21 degrees of freedom

Multiple R-Squared: 0.8041, Adjusted R-squared: 0.7574

F-statistic: 17.24 on 5 and 21 DF, p-value: 8.31e-07

Q: What do those coefficients mean?
Must know the coding to determine this:

> model.matrix(g)

   (Intercept) Biological Sociallow Socialmiddle Biological:Sociallow Biological:Socialmiddle

1            1         82         0            0                    0                       0

2            1         90         0            0                    0                       0

3            1         91         0            0                    0                       0

4            1        115         0            0                    0                       0

5            1        115         0            0                    0                       0

6            1       129         0            0                    0                       0

7            1        131         0            0                    0                       0

8            1         78         0            1                    0                      78

9            1         79         0            1                    0                      79

10           1         82         0            1                    0                      82

11           1         97         0            1                    0                      97

12           1        100         0            1                    0                     100

13           1        107         0            1                    0                     107

14           1         68         1            0                   68                       0

15           1         73         1            0                   73                       0

16           1        81         1            0                   81                       0

17           1         85         1            0                   85                       0

18           1         87         1            0                   87                       0

19           1         87         1            0                   87                       0

20           1         93         1            0                   93                       0

21           1         94         1            0                   94                       0

22           1         95         1            0                   95                       0

23           1         97         1            0                   97                       0

24           1         97         1            0                   97                       0

25           1        103         1            0                  103                       0

26           1        106         1            0                  106                       0

27           1        111         1            0                  111                       0

attr(,"assign")

[1] 0 1 2 2 3 3

attr(,"contrasts")

attr(,"contrasts")$Social

[1] "contr.treatment"
- We can see that the reference is "high" class, being first alphabetically.
- The intercept and slope for "low" class line would be (-1.872044 + 9.076654) and (0.977562 -0.029140)
- the intercept and slope for "middle" class line would be (-1.872044 + 2.688068) and (0.977562 -0.004995).
Now see if the model can be simplified to the parallel lines model:
> gp<- lm(Foster ~ Biological+Social, twins)
> anova(gp,g)

Analysis of Variance Table

Model 1: Foster ~ Biological + Social

Model 2: Foster ~ Biological + Social + Biological:Social

Res.Df     RSS Df Sum of Sq      F Pr(>F)

1     23 1318.40

2     21 1317.47 2      0.93 0.0074 0.9926
- Yes it can.
- The sequential testing can be done in one go:
  > anova(g)
  
  Analysis of Variance Table
  
  Response: Foster
  
                    Df Sum Sq Mean Sq F value   Pr(>F)
  
  Biological         1 5231.1 5231.1 83.3823 9.28e-09 ***
  
  Social             2 175.1    87.6 1.3958   0.2697
  
  Biological:Social 2    0.9     0.5 0.0074   0.9926
  
  Residuals         21 1317.5    62.7
  
  ---
  
  Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
  - We see that a further reduction to a single line model is possible:
    > gs <- lm(Foster ~ Biological, twins)
    
    > anova(gs,gp)
    
    Analysis of Variance Table
    
    Model 1: Foster ~ Biological
    
    Model 2: Foster ~ Biological + Social
    
    Res.Df     RSS Df Sum of Sq      F Pr(>F)
    
    1     25 1493.53
    
    2     23 1318.40 2    175.13 1.5276 0.2383
    - This is the result of analysis of covariance, which compare single line model (null hypothesis) v.s. parallel lines model (alternative hypothesis).

Plot the regression line on the plot:

> abline(gs$coef)

Check the diagnostics:

> par(mfrow=c(2,2))
> plot(gr)
It shows no cause for concern. The final model is:

> summary(gs)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.20760 9.29990 0.990 0.332

Biological 0.90144 0.09633 9.358 1.20e-09 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 7.729 on 25 degrees of freedom

Multiple R-Squared: 0.7779, Adjusted R-squared: 0.769

F-statistic: 87.56 on 1 and 25 DF, p-value: 1.204e-09

Burt was interested in demonstrating the importance of heredity over environment in intelligence and this data certainly point that way. (Although it would be helpful to know the social class of the foster parents)
However, before jumping to any conclusions, one should note that there is now considerable evidence that Cyril Burt invented some of his data on identical twins.
In light of this, can you see anything in the above analysis that might lead one to suspect this data?

In the analysis, we adopt the treatment coding (it's easier for interpreting the parameters):

> contr.treatment(3)

2 3

1 0 0

2 1 0

3 0 1

There are various codings you can choose if you fell they are more reasonable for your data, such as the Helmert coding:

> contr.helmert(3)

[,1] [,2]
1   -1   -1
2    1   -1
3    0    2

and the sum coding:

> contr.sum(3)

[,1] [,2]

1 1 0

2 0 1

3 -1 -1

Once you decide your coding, you can use it by changing the "contrasts" setting in "options".