Ordinal Variables (Reading: Faraway (2006), section 4.5)
Consider a data from the 1996 American National Election Study (Rosenstone et al., 1997). Let us first read the data into R and take a look:
> nes96 <- read.table("nes96.txt")
Consider just the data on party affiliation (PID) and level of education (educ). The default of R sorts the levels in alphabetical order:
Because both variables are ordinal in this example, we need to arrange the levels according to their natural order:
> nes96$PID <- factor(nes96$PID, levels=c("strDem", "weakDem", "indDem", "indind", "indRep", "weakRep", "strRep"))
> nes96$educ <- factor(nes96$educ, levels=c("MS", "HSdrop", "HS", "Coll", "CCdeg", "BAdeg", "MAdeg"))
Now, the orders of the levels are correctly reflected.
W can construct a two-way table for the two variables:
> xtabs( ~ PID + educ, nes96)
(Q: what information have you found from the contingency table?) But, we need to convert the table to a dataframe of R with one count per line to enable model fitting:
> (partyed <- as.data.frame.table(xtabs( ~ PID + educ, nes96)))
If we treat PID and educ as nominal variables and fit a nominal-by-nominal Poisson model containing only main effect terms:
> nomod <- glm(Freq ~ PID + educ, partyed, family= poisson)
We see that the model has a deviance of 40.742 on 36 degrees of freedom, which gives a p-value:
Therefore, the Poisson GLM with only main effect terms is a good fit and there is no evidence against independence.
However, we can take advantage of the ordinal structure of both variables and define some scores. As there seems to be no strong reason to the contrary, we assign evenly spaced scores, i.e., 1 to 7, for the levels of both PID and educ:
> partyed$oPID <-
> partyed$oeduc <- unclass(partyed$educ)
Now, let us fit the linear-by-linear association model:
glm(Freq ~ PID + educ + I(oPID*oeduc), partyed, family= poisson)
the model only use one degrees of freedom (i.e. one interaction effect) to characterize the association between party affiliation and education
if the linear-by-linear interaction effect (I(oPID * oeduc)in this case) is insignificant, the model is reduced to the independence model
we can use the difference-in-deviance test to compare this model to the independence model:
We see that:
the significance of the test suggests that there is some evidence of an association, which is a different conclusion from what obtained under the nominal-by-nominal approach
Q: what causes the difference? Note that the 36 interaction effects in the nominal-by-nominal approach reduce a deviance of 40.743. Under the ordinal-by-ordinal approach, the deviance can be further split into two parts: a deviance of 10.175 on one degree of freedom for the linear-by-linear association effect (which is of course significant) and a deviance of 30.568 for the rest 35 interaction effects. When we only look at the overall deviance (i.e., 40.743), the information of the fewer significant effects (such as I(oPID * oeduc) in this case) could be blocked by the the insignificance of other interaction effects.
the use of the ordinal information gives us more power to detect an association
the estimate of the parameter for the linear-by-linear interaction (denoted by g) is 0.028745, with a p-value 0.001513.
the p-value here can also be used to test the significance of the association although, as a Wald test, it is less reliable than the difference-in-deviance test given above.
here is a qualitative conclusion for the estimate of g : Because it is positive, it means that, given the way that we have assigned the scores, a higher level of education is associated with a greater probability of tending to the Republican end of the spectrum
here is a quantitative conclusion for the estimate of g: Note that the estimate of association parameter g can be interpreted in terms of log-odds. For our example, where the scores are spaced one apart, the fitted log-odds ratio is g (not sample log-odds ratios). To illustrate this point, consider the fitted values under this model:
> round(ormodfit <- xtabs(predict(ormod,type="response") ~ PID + educ, partyed),2)
Now, compute the log-odds ratio for, say, the right bottom two-by-two table:
We see this is, after rounding, equal to the estimate of g. (exercise: for other adjacent 2-by-2 tables, check whether they are still equal)
it is always worth examining the residuals to check if there is more structure than the linear-by-linear association model suggests. We use the raw response residuals (the unscaled difference between observed and expected) because we would like to see effects which are large in an absolute sense:
> round(xtabs(residuals(ormod,type="response") ~ PID + educ, partyed),2)
We do see some indications of remaining structure, for example:
many more weak Republicans with some college than expected (with a positive residual 11.32)
fewer Republicans with master's degree or higher (with negative residuals -6.19 and -3.85)
There may not be only a monotone relationship between party affiliation and education. We will investigate it later by considering an ordinal-by-nominal model.
Just to check the robustness of the assignment of the scores, it is worth trying some different choices. Suppose that we choose scores for PID so that there is more of a distinction between Democrats and Independents as well as Independents and Republican. Our assignment of scores for PID below achieves this:
apid <- c(1,2,5,6,7,10,11)
Another idea for educ might be that
people who complete high school or less are not different;
those who go to college, but do not get a BA degree are not different;
those who get a BA or higher are not different.
Our assignment of scores for educ below achieves this:
> aedu <- c(1,1,1,2,2,3,3)
Let us use the new scores to fit a linear-by-linear association model:
> ormoda <- glm(Freq
~ PID + educ + I(apid[oPID]*aedu[oeduc]), partyed, family= poisson)
the numerical outcomes are slightly different from what obtained under the evenly-spaced scores
but, the qualitative conclusions are quite similar, for example,
the difference-in-deviance test is still significant:
and, the sign of the linear-by-linear association term is still positive
some experimentation with other plausible choices of scores (exercise: what other scores would you want to try? check whether the qualitative conclusions are still similar under your scores) indicates that we can fairly confident about the association here
To investigate the remaining structure not explained by the linear-by-linear association term (as mentioned above), we might consider an ordinal-by-nominal model (i.e. column-effect model) where for the interaction terms we now treat educ as a nominal variable. The model is more complex than the linear-by-linear association model because the former assigns more interaction effects to characterize the association than the latter:
<- glm(Freq ~ PID + educ + educ:oPID, partyed, family= poisson)
the last coefficient, educMAdeg:oPID, is not identifiable because the interaction terms are over-parameterized (educ has 7 levels so that 6 interaction effects are enough). In this circumstance, R automatically removes some unidentifiable effects to make the model identifiable; in other words, the coefficients of these unidentifiable effects are set as zero. Because educMAdeg:oPID is removed, we can regard the level MAdeg as a reference level in the interpretation of the interaction terms. For example, the difference between the "slopes" of oPID within each of the two educational groups MS and MAdeg is about -0.312217.
now, examine the fitted coefficients, let us look at just the interaction terms as the main effects have no particular interest. We see that:
if the linear-by-linear association model using evenly-spaced scores were a good fit (i.e., if there was really a monotone trend in the effect of educational level on party affiliation), we would expect the observed column-effect coefficients to be roughly evenly spaced (i.e., these coefficients have a linear trend). However, we can see that they are not.
when we look at these coefficients we observe that for high school and above, the coefficients are not significantly different from zero (i.e., for the five educational groups HS, Coll, CCdeq, BAdeq, and MAdeq, the "slopes" of oPID are quite similar) while for the lowest two categories, these is so difference (i.e., the "slopes" of oPID in the two groups MS and HSdrop are quite different from that in the other five groups).
we can compare this ordinal-by-nominal model to the independence model (nomod):
We find that the column-effects model is preferred.
if we compare the column-effects model to the linear-by-linear association model using evenly-spaced scores (ormod)
We see that the simpler linear-by-linear association is preferred to the more complex column-effects model.
The above finding suggests an alternative assignment of scores for educ (Q: how to interpret the scores?):
> aedu <- c(1,1,2,2,2,2,2)
> ormodb <- glm(Freq ~ PID + educ + I(oPID*aedu[oeduc]), partyed, family= poisson)
We see that:
the deviance of this model is even lower than the linear-by-linear association model using evenly-spaced scores:
and this model also offers similar qualitative conclusions as the previous linear-by-linear association models.
the model gives credence to the view that whether a person finished high school or not is the determining factor in party affiliation. However, since we used what we found from the data itself to assign the scores and came up with this hypothesis, we would be temping fate to then use the data again to test this hypothesis.