Incomplete Data (Reading: Faraway (2005, 1st edition), Chapter 12)

¡@

Suppose some of the values in the Chicago Insurance dataset were missing. I randomly declared some the observations in this modified dataset. Read it in and take a look:

> chm <- read.table("chmiss.txt")
> chm # take a look

      race fire theft  age involact income

60626 10.0  6.2    29 60.4       NA 11.744

60640 22.2  9.5    44 76.5      0.1  9.323

60613 19.6 10.5    36   NA      1.2  9.948

60657 17.3  7.7    37   NA      0.5 10.656

60614 24.5  8.6    53 81.4      0.7  9.730

60610 54.0 34.1    68 52.6      0.3  8.231

60611  4.9 11.0    75 42.6      0.0 21.480

60625  7.1  6.9    18 78.5      0.0 11.104

60618  5.3  7.3    31 90.1       NA 10.694

60647 21.5 15.1    NA 89.8      1.1  9.631

60622 43.1 29.1    34 82.7      1.9  7.995

60631  1.1  2.2    14 40.2      0.0 13.722

60646   NA  5.7    11 27.9      0.0 16.250

60656  1.7  2.0    11  7.7      0.0 13.686

60630  1.6  2.5    22 63.8      0.0 12.405

60634  1.5  3.0    NA 51.2      0.0 12.198

60641  1.8  5.4    27 85.1      0.0 11.600

60635  1.0  2.2     9 44.4      0.0 12.765

60639  2.5  7.2    29 84.2      0.2 11.084

60651   NA 15.1    30 89.8      0.8 10.510

60644 59.8 16.5    40   NA      0.8  9.784

60624 94.4 18.4    32 72.9      1.8  7.342

60612 86.2 36.2    41 63.1      1.8  6.565

60607 50.2   NA   147 83.0      0.9  7.459

60623 74.2 18.5    22 78.3      1.9  8.014

60608 55.5   NA    29 79.0      1.5  8.177

60616   NA 12.2    46 48.0      0.6  8.212

60632  4.4  5.6    23 71.5      0.3 11.230

60609 46.2 21.8     4 73.1      1.3     NA

60653 99.7 21.6    31 65.0      0.9  5.583

60615 73.5  9.0    39 75.4      0.4  8.564

60638 10.7  3.6    15 20.8      0.0 12.102

60629  1.5  5.0    32 61.8       NA 11.876

60636 48.8 28.6    27 78.1      1.4  9.742

60621 98.9 17.4    NA 68.6      2.2  7.520

60637 90.6 11.3    34 73.4      0.8  7.388

60652  1.4  3.4    17  2.0      0.0 13.842

60620 71.2 11.9    46   NA      0.9 11.040

60619 94.1 10.5    42 55.9      0.9 10.332

60649 66.1 10.7    NA 67.5      0.4 10.908

60617   NA 10.8    34 58.0      0.9 11.156

60655  1.0  4.8    19 15.2      0.0 13.323

60643 42.5 10.4    25 40.8      0.5 12.960

60628 35.1 15.6    28 57.8      1.0     NA

60627 47.4  7.0     3 11.4      0.2 10.080

60633 34.0  7.1    23 49.2      0.3 11.428

60645  3.1  4.9    27   NA      0.0 13.731

There are 20 missing observations, denoted by NA here. It's important to know what the missing value code is for the data and/or software you are using. What happens if we try to fit the model:

> g <- lm(involact ~ ., data=chm)

> summary(g)

  Call:

lm(formula = involact ~ ., data = chm)

 

Residuals:

     Min       1Q   Median       3Q      Max

-0.53370 -0.16325 -0.07015  0.12615  0.66316

 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)   

(Intercept) -1.116483   0.605761  -1.843 0.079475 . 

race         0.010487   0.003128   3.352 0.003018 **

fire         0.043876   0.010319   4.252 0.000356 ***

theft       -0.017220   0.005900  -2.918 0.008215 **

age          0.009377   0.003494   2.684 0.013904 * 

income       0.068701   0.042156   1.630 0.118077   

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 0.3382 on 21 degrees of freedom

Multiple R-Squared: 0.7911,     Adjusted R-squared: 0.7414

F-statistic: 15.91 on 5 and 21 DF,  p-value: 1.594e-06

Any case with at least one missing value is omitted from the regression. You can see there are now only 21 degrees of freedom - almost half the data is lost. We can fill in the missing values by their variable means as in:

> cmeans <- apply(chm, 2, mean, na.rm=T)

> cmeans  

      race       fire      theft        age   involact     income

35.6093023 11.4244444 32.6511628 59.9690476  0.6477273 10.7358667

> mchm <- chm

> for (i in c(1,2,3,4,6)) mchm[is.na(chm[,i]), i] <- cmeans[i]

> mchm

      race     fire     theft      age involact   income

60626 10.0000  6.20000  29.00000 60.40000       NA 11.74400

60640 22.2000  9.50000  44.00000 76.50000      0.1  9.32300

60613 19.6000 10.50000  36.00000 59.96905      1.2  9.94800

60657 17.3000  7.70000  37.00000 59.96905      0.5 10.65600

60614 24.5000  8.60000  53.00000 81.40000      0.7  9.73000

60610 54.0000 34.10000  68.00000 52.60000      0.3  8.23100

60611  4.9000 11.00000  75.00000 42.60000      0.0 21.48000

60625  7.1000  6.90000  18.00000 78.50000      0.0 11.10400

60618  5.3000  7.30000  31.00000 90.10000       NA 10.69400

60647 21.5000 15.10000  32.65116 89.80000      1.1  9.63100

60622 43.1000 29.10000  34.00000 82.70000      1.9  7.99500

60631  1.1000  2.20000  14.00000 40.20000      0.0 13.72200

60646 35.6093  5.70000  11.00000 27.90000      0.0 16.25000

60656  1.7000  2.00000  11.00000  7.70000      0.0 13.68600

60630  1.6000  2.50000  22.00000 63.80000      0.0 12.40500

60634  1.5000  3.00000  32.65116 51.20000      0.0 12.19800

60641  1.8000  5.40000  27.00000 85.10000      0.0 11.60000

60635  1.0000  2.20000   9.00000 44.40000      0.0 12.76500

60639  2.5000  7.20000  29.00000 84.20000      0.2 11.08400

60651 35.6093 15.10000  30.00000 89.80000      0.8 10.51000

60644 59.8000 16.50000  40.00000 59.96905      0.8  9.78400

60624 94.4000 18.40000  32.00000 72.90000      1.8  7.34200

60612 86.2000 36.20000  41.00000 63.10000      1.8  6.56500

60607 50.2000 11.42444 147.00000 83.00000      0.9  7.45900

60623 74.2000 18.50000  22.00000 78.30000      1.9  8.01400

60608 55.5000 11.42444  29.00000 79.00000      1.5  8.17700

60616 35.6093 12.20000  46.00000 48.00000      0.6  8.21200

60632  4.4000  5.60000  23.00000 71.50000      0.3 11.23000

60609 46.2000 21.80000   4.00000 73.10000      1.3 10.73587

60653 99.7000 21.60000  31.00000 65.00000      0.9  5.58300

60615 73.5000  9.00000  39.00000 75.40000      0.4  8.56400

60638 10.7000  3.60000  15.00000 20.80000      0.0 12.10200

60629  1.5000  5.00000  32.00000 61.80000       NA 11.87600

60636 48.8000 28.60000  27.00000 78.10000      1.4  9.74200

60621 98.9000 17.40000  32.65116 68.60000      2.2  7.52000

60637 90.6000 11.30000  34.00000 73.40000      0.8  7.38800

60652  1.4000  3.40000  17.00000  2.00000      0.0 13.84200

60620 71.2000 11.90000  46.00000 59.96905      0.9 11.04000

60619 94.1000 10.50000  42.00000 55.90000      0.9 10.33200

60649 66.1000 10.70000  32.65116 67.50000      0.4 10.90800

60617 35.6093 10.80000  34.00000 58.00000      0.9 11.15600

60655  1.0000  4.80000  19.00000 15.20000      0.0 13.32300

60643 42.5000 10.40000  25.00000 40.80000      0.5 12.96000

60628 35.1000 15.60000  28.00000 57.80000      1.0 10.73587

60627 47.4000  7.00000   3.00000 11.40000      0.2 10.08000

60633 34.0000  7.10000  23.00000 49.20000      0.3 11.42800

60645  3.1000  4.90000  27.00000 59.96905      0.0 13.73100

Note that we don't fill in missing values in the response. Why? Now refit:

> g <- lm(involact ~ ., data=mchm)
> summary(g)

   Call:

lm(formula = involact ~ ., data = mchm)

 

Residuals:

      Min        1Q    Median        3Q       Max

-1.024012 -0.156811 -0.003327  0.222011  0.811744

 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)  

(Intercept)  0.070802   0.509453   0.139  0.89020  

race         0.007117   0.002706   2.631  0.01224 *

fire         0.028742   0.009385   3.062  0.00402 **

theft       -0.003059   0.002746  -1.114  0.27224  

age          0.006080   0.003208   1.895  0.06570 .

income      -0.027092   0.031678  -0.855  0.39779  

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 0.3841 on 38 degrees of freedom

Multiple R-Squared: 0.682,      Adjusted R-squared: 0.6401

F-statistic:  16.3 on 5 and 38 DF,  p-value: 1.409e-08

Compare with the previous results - what differences do you see? Different statistical packages have different ways of handling missing observations. For example, the default behavior in S-PLUS would refuse to fit the model at all.

¡@


We can also use regression methods to predict the missing values of the covariates. Let's try to fill-in the missing race values:

> gr <- lm(race ~ fire + theft + age + income, data=chm)
> chm[is.na(chm$race),] # cases for which race is missing

      race fire theft  age involact income 
60646   NA  5.7    11 27.9      0.0 16.250
60651   NA 15.1    30 89.8      0.8 10.510
60616   NA 12.2    46 48.0      0.6  8.212
60617   NA 10.8    34 58.0      0.9 11.156

> predict(gr,chm[is.na(chm$race),])

     60646   60651    60616    60617 
 -17.84722 26.3597 70.39441 32.62029

Can you see a problem with filling these values in?

More sophisticated and superior methods is required.

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@

¡@