¡@
Suppose some of the values in the Chicago Insurance dataset were missing. I randomly declared some the observations in this modified dataset. Read it in and take a look:
>
chm <- read.table("chmiss.txt")
> chm # take a look
race fire theft age involact income
60626 10.0 6.2 29 60.4 NA 11.744
60640 22.2 9.5 44 76.5 0.1 9.323
60613 19.6 10.5 36 NA 1.2 9.948
60657 17.3 7.7 37 NA 0.5 10.656
60614 24.5 8.6 53 81.4 0.7 9.730
60610 54.0 34.1 68 52.6 0.3 8.231
60611 4.9 11.0 75 42.6 0.0 21.480
60625 7.1 6.9 18 78.5 0.0 11.104
60618 5.3 7.3 31 90.1 NA 10.694
60647 21.5 15.1 NA 89.8 1.1 9.631
60622 43.1 29.1 34 82.7 1.9 7.995
60631 1.1 2.2 14 40.2 0.0 13.722
60646 NA 5.7 11 27.9 0.0 16.250
60656 1.7 2.0 11 7.7 0.0 13.686
60630 1.6 2.5 22 63.8 0.0 12.405
60634 1.5 3.0 NA 51.2 0.0 12.198
60641 1.8 5.4 27 85.1 0.0 11.600
60635 1.0 2.2 9 44.4 0.0 12.765
60639 2.5 7.2 29 84.2 0.2 11.084
60651 NA 15.1 30 89.8 0.8 10.510
60644 59.8 16.5 40 NA 0.8 9.784
60624 94.4 18.4 32 72.9 1.8 7.342
60612 86.2 36.2 41 63.1 1.8 6.565
60607 50.2 NA 147 83.0 0.9 7.459
60623 74.2 18.5 22 78.3 1.9 8.014
60608 55.5 NA 29 79.0 1.5 8.177
60616 NA 12.2 46 48.0 0.6 8.212
60632 4.4 5.6 23 71.5 0.3 11.230
60609 46.2 21.8 4 73.1 1.3 NA
60653 99.7 21.6 31 65.0 0.9 5.583
60615 73.5 9.0 39 75.4 0.4 8.564
60638 10.7 3.6 15 20.8 0.0 12.102
60629 1.5 5.0 32 61.8 NA 11.876
60636 48.8 28.6 27 78.1 1.4 9.742
60621 98.9 17.4 NA 68.6 2.2 7.520
60637 90.6 11.3 34 73.4 0.8 7.388
60652 1.4 3.4 17 2.0 0.0 13.842
60620 71.2 11.9 46 NA 0.9 11.040
60619 94.1 10.5 42 55.9 0.9 10.332
60649 66.1 10.7 NA 67.5 0.4 10.908
60617 NA 10.8 34 58.0 0.9 11.156
60655 1.0 4.8 19 15.2 0.0 13.323
60643 42.5 10.4 25 40.8 0.5 12.960
60628 35.1 15.6 28 57.8 1.0 NA
60627 47.4 7.0 3 11.4 0.2 10.080
60633 34.0 7.1 23 49.2 0.3 11.428
60645 3.1 4.9 27 NA 0.0 13.731
There are 20 missing observations, denoted by NA here. It's important to know what the missing value code is for the data and/or software you are using. What happens if we try to fit the model:
> g <- lm(involact ~ ., data=chm)
> summary(g)
Call:
lm(formula = involact ~ ., data = chm)
Residuals:
Min 1Q Median 3Q Max
-0.53370 -0.16325 -0.07015 0.12615 0.66316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.116483 0.605761 -1.843 0.079475 .
race 0.010487 0.003128 3.352 0.003018 **
fire 0.043876 0.010319 4.252 0.000356 ***
theft -0.017220 0.005900 -2.918 0.008215 **
age 0.009377 0.003494 2.684 0.013904 *
income 0.068701 0.042156 1.630 0.118077
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3382 on 21 degrees of freedom
Multiple R-Squared: 0.7911, Adjusted R-squared: 0.7414
F-statistic: 15.91 on 5 and 21 DF, p-value: 1.594e-06
Any case with at least one missing value is omitted from the regression. You can see there are now only 21 degrees of freedom - almost half the data is lost. We can fill in the missing values by their variable means as in:
> cmeans <- apply(chm, 2, mean, na.rm=T)
> cmeans
race fire theft
age involact income
35.6093023 11.4244444 32.6511628 59.9690476 0.6477273 10.7358667
> mchm <- chm
> for (i in c(1,2,3,4,6)) mchm[is.na(chm[,i]), i] <- cmeans[i]
> mchm
race fire theft age involact income
60626 10.0000 6.20000 29.00000 60.40000 NA 11.74400
60640 22.2000 9.50000 44.00000 76.50000 0.1 9.32300
60613 19.6000 10.50000 36.00000 59.96905 1.2 9.94800
60657 17.3000 7.70000 37.00000 59.96905 0.5 10.65600
60614 24.5000 8.60000 53.00000 81.40000 0.7 9.73000
60610 54.0000 34.10000 68.00000 52.60000 0.3 8.23100
60611 4.9000 11.00000 75.00000 42.60000 0.0 21.48000
60625 7.1000 6.90000 18.00000 78.50000 0.0 11.10400
60618 5.3000 7.30000 31.00000 90.10000 NA 10.69400
60647 21.5000 15.10000 32.65116 89.80000 1.1 9.63100
60622 43.1000 29.10000 34.00000 82.70000 1.9 7.99500
60631 1.1000 2.20000 14.00000 40.20000 0.0 13.72200
60646 35.6093 5.70000 11.00000 27.90000 0.0 16.25000
60656 1.7000 2.00000 11.00000 7.70000 0.0 13.68600
60630 1.6000 2.50000 22.00000 63.80000 0.0 12.40500
60634 1.5000 3.00000 32.65116 51.20000 0.0 12.19800
60641 1.8000 5.40000 27.00000 85.10000 0.0 11.60000
60635 1.0000 2.20000 9.00000 44.40000 0.0 12.76500
60639 2.5000 7.20000 29.00000 84.20000 0.2 11.08400
60651 35.6093 15.10000 30.00000 89.80000 0.8 10.51000
60644 59.8000 16.50000 40.00000 59.96905 0.8 9.78400
60624 94.4000 18.40000 32.00000 72.90000 1.8 7.34200
60612 86.2000 36.20000 41.00000 63.10000 1.8 6.56500
60607 50.2000 11.42444 147.00000 83.00000 0.9 7.45900
60623 74.2000 18.50000 22.00000 78.30000 1.9 8.01400
60608 55.5000 11.42444 29.00000 79.00000 1.5 8.17700
60616 35.6093 12.20000 46.00000 48.00000 0.6 8.21200
60632 4.4000 5.60000 23.00000 71.50000 0.3 11.23000
60609 46.2000 21.80000 4.00000 73.10000 1.3 10.73587
60653 99.7000 21.60000 31.00000 65.00000 0.9 5.58300
60615 73.5000 9.00000 39.00000 75.40000 0.4 8.56400
60638 10.7000 3.60000 15.00000 20.80000 0.0 12.10200
60629 1.5000 5.00000 32.00000 61.80000 NA 11.87600
60636 48.8000 28.60000 27.00000 78.10000 1.4 9.74200
60621 98.9000 17.40000 32.65116 68.60000 2.2 7.52000
60637 90.6000 11.30000 34.00000 73.40000 0.8 7.38800
60652 1.4000 3.40000 17.00000 2.00000 0.0 13.84200
60620 71.2000 11.90000 46.00000 59.96905 0.9 11.04000
60619 94.1000 10.50000 42.00000 55.90000 0.9 10.33200
60649 66.1000 10.70000 32.65116 67.50000 0.4 10.90800
60617 35.6093 10.80000 34.00000 58.00000 0.9 11.15600
60655 1.0000 4.80000 19.00000 15.20000 0.0 13.32300
60643 42.5000 10.40000 25.00000 40.80000 0.5 12.96000
60628 35.1000 15.60000 28.00000 57.80000 1.0 10.73587
60627 47.4000 7.00000 3.00000 11.40000 0.2 10.08000
60633 34.0000 7.10000 23.00000 49.20000 0.3 11.42800
60645 3.1000 4.90000 27.00000 59.96905 0.0 13.73100
Note that we don't fill in missing values in the response. Why? Now refit:
>
g <- lm(involact ~ ., data=mchm)
> summary(g)
Call:
lm(formula = involact ~ ., data = mchm)
Residuals:
Min 1Q Median 3Q Max
-1.024012 -0.156811 -0.003327 0.222011 0.811744
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.070802 0.509453 0.139 0.89020
race 0.007117 0.002706 2.631 0.01224 *
fire 0.028742 0.009385 3.062 0.00402 **
theft -0.003059 0.002746 -1.114 0.27224
age 0.006080 0.003208 1.895 0.06570 .
income -0.027092 0.031678 -0.855 0.39779
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3841 on 38 degrees of freedom
Multiple R-Squared: 0.682, Adjusted R-squared: 0.6401
F-statistic: 16.3 on 5 and 38 DF, p-value: 1.409e-08
Compare with the previous results - what differences do you see? Different statistical packages have different ways of handling missing observations. For example, the default behavior in S-PLUS would refuse to fit the model at all.
¡@
We can also use regression methods to predict the missing values of the covariates. Let's try to fill-in the missing race values:
>
gr <- lm(race ~ fire + theft + age + income, data=chm)
>
chm[is.na(chm$race),] # cases for which race is missing
race fire theft age
involact income
60646 NA 5.7 11
27.9 0.0 16.250
60651 NA
15.1 30 89.8 0.8
10.510
60616 NA 12.2 46
48.0 0.6 8.212
60617 NA
10.8 34 58.0 0.9
11.156
> predict(gr,chm[is.na(chm$race),])
60646
60651 60616 60617
-17.84722
26.3597 70.39441 32.62029
Can you see a problem with filling these values in?
More sophisticated and superior methods is required.
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@
¡@