Mean Structure for Qualitative Predictors


Dichotomous Predictor and Analysis of Covariance (Reading: Faraway (2005, 1st edition), 13.1)

The data for this example consist of

We wish to investigate how the length is related to height for the two styles. Read in the data and make a summary on the two styles separately.

> cathedral <- read.table("cathedral.data")
> cathedral # take a look

             style   x   y

Durham           r  75 502

Canterbury       r  80 522

...deleted...

Old.St.Paul      g 103 611

Salisbury        g  84 473

> lapply(split(cathedral, cathedral$style),summary) # "split" divides the data into groups; "lapply" applies a function (in the case, the "summary") over a list of data (in the case, the two groups of data defined by style)

$g

 style        x                y       

 g:16   Min.   : 45.00   Min.   :182.0 

 r: 0   1st Qu.: 60.75   1st Qu.:298.8 

        Median : 73.50   Median :412.0 

        Mean   : 74.94   Mean   :397.4 

        3rd Qu.: 86.50   3rd Qu.:481.3 

        Max.   :103.00   Max.   :611.0 

 

$r

 style       x               y       

 g:0   Min.   :64.00   Min.   :344.0 

 r:9   1st Qu.:70.00   1st Qu.:425.0 

       Median :75.00   Median :502.0 

       Mean   :74.44   Mean   :475.4 

       3rd Qu.:80.00   3rd Qu.:530.0 

       Max.   :83.00   Max.   :551.0

Now plot the data:

> plot(cathedral$x,cathedral$y,type="n",xlab="Nave height",ylab="Length")
> text(cathedral$x,cathedral$y,as.character(cathedral$s))

Before fitting model, examine how R treat the data: 

> is.factor(cathedral$style); is.numeric(cathedral$x); is.numeric(cathedral$y) 

 

Let's start from fitting the full model, i.e.,  separate regression lines for each groups. 

> g <- lm(y ~ x+style+x:style, data=cathedral) # to understand the meaning of "x:style", check "help(formula)". Notice that the model can be equivalently written as "y~x*style" in R.
> summary(g)

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   37.111     85.675   0.433 0.669317   

x              4.808      1.112   4.322 0.000301 ***

styler       204.722    347.207   0.590 0.561733   

x:styler      -1.669      4.641  -0.360 0.722657   

---

Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

 

Residual standard error: 79.11 on 21 degrees of freedom

Multiple R-Squared: 0.5412,     Adjusted R-squared: 0.4757

F-statistic: 8.257 on 3 and 21 DF,  p-value: 0.0008072

 

Now, let's go back to (0, 1) coding:

> options(contrasts=c("contr.treatment","contr.poly"))

We see that the full model under (0, 1) coding can be simplified to:

> g1 <- lm(y ~ x+style, data=cathedral)
> summary(g1)

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)   44.298     81.648   0.543   0.5929   

x              4.712      1.058   4.452   0.0002 ***

styler        80.393     32.306   2.488   0.0209 * 

---

Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

 

Residual standard error: 77.53 on 22 degrees of freedom

Multiple R-Squared: 0.5384,     Adjusted R-squared: 0.4964

F-statistic: 12.83 on 2 and 22 DF,  p-value: 0.0002028

Put the two parallel regression on the plot:

> abline(44.298,4.7116)
> abline(44.298+80.393,4.7116,lty=2)

Check out the diagnostics:

> par(mfrow=c(2,2))
> plot(g)

 

Suppose you want to perform an analysis of covariance. You can do it as follows:

> g2 <- lm(y~x, cathedral)
> anova(g1, g2)

Analysis of Variance Table

 

Model 1: y ~ x + style

Model 2: y ~ x

  Res.Df    RSS Df Sum of Sq      F  Pr(>F) 

1     22 132223                             

2     23 169440 -1    -37217 6.1924 0.02089 *

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 


Polytomous predictor (Reading: Faraway (2005), 13.2, 13.3)

 

Let's read in and take a look at the data:

> twins <- read.table("twin.data",header=T)
> twins 

   Foster Biological Social

1      82         82   high

...deleted...

7     132        131   high

8      71         78 middle

...deleted...

13    111        107 middle

14     63         68    low

...deleted...

27     98        111    low

> plot(twins$B,twins$F,type="n",xlab="Biological IQ",ylab="Foster IQ")
> text(twins$B,twins$F,substring(as.character(twins$S),1,1))

The most general model we'll consider is to fit separate lines for each groups:

> g <- lm(Foster ~ Biological*Social, twins)
> summary(g)

Coefficients:

                         Estimate Std. Error t value Pr(>|t|)   

(Intercept)             -1.872044  17.808264  -0.105    0.917   

Biological               0.977562   0.163192   5.990 6.04e-06 ***

Sociallow                9.076654  24.448704   0.371    0.714    

Socialmiddle             2.688068  31.604178   0.085    0.933   

Biological:Sociallow    -0.029140   0.244580  -0.119    0.906   

Biological:Socialmiddle -0.004995   0.329525  -0.015    0.988   

---

Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

 

Residual standard error: 7.921 on 21 degrees of freedom

Multiple R-Squared: 0.8041,     Adjusted R-squared: 0.7574

F-statistic: 17.24 on 5 and 21 DF,  p-value: 8.31e-07

 

Plot the regression line on the plot:

> abline(gs$coef)

Check the diagnostics:

> par(mfrow=c(2,2))
> plot(gr)
It shows no cause for concern. The final model is:

> summary(gs)

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept)  9.20760    9.29990   0.990    0.332   

Biological   0.90144    0.09633   9.358 1.20e-09 ***

---

Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

 

Residual standard error: 7.729 on 25 degrees of freedom

Multiple R-Squared: 0.7779,     Adjusted R-squared: 0.769

F-statistic: 87.56 on 1 and 25 DF,  p-value: 1.204e-09

 

In the analysis, we adopt the treatment coding (it's easier for interpreting the parameters):

> contr.treatment(3)

  2 3

1 0 0

2 1 0

3 0 1

There are various codings you can choose if you fell they are more reasonable for your data, such as the Helmert coding:

> contr.helmert(3)

  [,1] [,2] 
1   -1   -1
2    1   -1
3    0    2

and the sum coding:

> contr.sum(3)

  [,1] [,2]

1    1    0

2    0    1

3   -1   -1

Once you decide your coding, you can use it by changing the "contrasts" setting in "options".