The data for this example consist of
x= nave height and
y = total length in feet for English medieval cathedrals.
Some are in the Romanesque (r) style and others are in the Gothic (g) style. Some cathedrals have parts in both styles and are listed twice.
We wish to investigate how the length is related to height for the two styles. Read in the data and make a summary on the two styles separately.
cathedral <- read.table("cathedral.data")
> cathedral # take a look
> lapply(split(cathedral, cathedral$style),summary) # "split" divides the data into groups; "lapply" applies a function (in the case, the "summary") over a list of data (in the case, the two groups of data defined by style)
Notice that in this case the two groups have about the same average heights.
Look at the difference in the lengths.
Now plot the data:
Q: What information have you gathered from the summary and the plot?
Before fitting model, examine how R treat the data:
> is.factor(cathedral$style); is.numeric(cathedral$x); is.numeric(cathedral$y)
style is treated as a qualitative variable and x and y as quantitative variables.
Let's start from fitting the full model, i.e., separate regression lines for each groups.
g <- lm(y ~ x+style+x:style, data=cathedral) # to understand the meaning of
"x:style", check "help(formula)". Notice that the model can be equivalently
written as "y~x*style" in R.
Q: Do you know how to interpret the estimated parameters in the output?
You must first understand what coding you are using.
Because style is non-numeric, R automatically treats it as a qualitative variables and sets up a coding - but which coding?
The Romanesque (r) style is coded as 1 and the Gothic (g) style is coded as 0.
That is, Gothic style is treated as the reference.
This coding is assigned by the contrast function "contr.treatment" in R.
Suppose you would like to code the two groups as 1 and -1, individually. You can use the contrast function "contr.sum" and the following command:
> options(contrasts=c("contr.sum","contr.poly")) # use "help(options)" to understand how to assign contrast
Let's fit the full model again under (-1, 1) coding:
g <- lm(y ~ x+style+x:style, data=cathedral)
Compare to the result under (0, 1) coding.
What's different and what's the same? Can you explain why?
Let's take a look of the model matrix under (-1, 1) coding:
Now, let's go back to (0, 1) coding:
We see that the full model under (0, 1) coding can be simplified to:
g1 <- lm(y ~ x+style, data=cathedral)
Put the two parallel regression on the plot:
Check out the diagnostics:
The check on the diagnostics reveals no particular problems.
Our conclusion is that for cathedrals of the same height, Romanesque ones are about 80 (=2*40.2) feet longer.
For each extra foot height, both types of cathedral are about 4.7 feet longer.
Suppose you want to perform an analysis of covariance. You can do it as follows:
g2 <- lm(y~x, cathedral)
> anova(g1, g2)
But, of course, you know it will be the same as the t-test for style in model g1.
Polytomous predictor (Reading: Faraway (2005), 13.2, 13.3)
The data for this example come from a 1966 paper by Cyril Burt entitled "The genetic determination of differences in intelligence: A study of monozygotic twins reared apart".
The data consist of IQ scores for identical twins, one raised by foster parents, the other by the natural parents.
We also know the social class of natural parents (high, middle or low).
We are interested in predicting the IQ of the twin with foster parents from the IQ of the twin with the natural parents and the social class of natural parents.
Let's read in and take a look at the data:
twins <- read.table("twin.data",header=T)
plot(twins$B,twins$F,type="n",xlab="Biological IQ",ylab="Foster IQ")
From the plot, what model seems appropriate?
The most general model we'll consider is to fit separate lines for each groups:
g <- lm(Foster ~ Biological*Social, twins)
Q: What do those coefficients mean?
Must know the coding to determine this:
We can see that the reference is "high" class, being first alphabetically.
The intercept and slope for "low" class line would be (-1.872044 + 9.076654) and (0.977562 -0.029140)
the intercept and slope for "middle" class line would be (-1.872044 + 2.688068) and (0.977562 -0.004995).
Now see if the model can be simplified to the parallel lines model:
gp<- lm(Foster ~ Biological+Social, twins)
Yes it can.
The sequential testing can be done in one go:
We see that a further reduction to a single line model is possible:
> gs <- lm(Foster ~ Biological, twins)
This is the result of analysis of covariance, which compare single line model (null hypothesis) v.s. parallel lines model (alternative hypothesis).
Plot the regression line on the plot:
Check the diagnostics:
It shows no cause for concern. The final model is:
Burt was interested in demonstrating the importance of heredity over environment in intelligence and this data certainly point that way. (Although it would be helpful to know the social class of the foster parents)
However, before jumping to any conclusions, one should note that there is now considerable evidence that Cyril Burt invented some of his data on identical twins.
In light of this, can you see anything in the above analysis that might lead one to suspect this data?
In the analysis, we adopt the treatment coding (it's easier for interpreting the parameters):
There are various codings you can choose if you fell they are more reasonable for your data, such as the Helmert coding:
and the sum coding:
Once you decide your coding, you can use it by changing the "contrasts" setting in "options".