NTHU STAT 5040 Lab - Incomplete Data

Suppose some of the values in the Chicago Insurance dataset were missing. I randomly declared some the observations in this modified dataset. Read it in and take a look:

race fire theft age involact income

60626 10.0 6.2 29 60.4 NA 11.744

60640 22.2 9.5 44 76.5 0.1 9.323

60613 19.6 10.5 36 NA 1.2 9.948

60657 17.3 7.7 37 NA 0.5 10.656

60614 24.5 8.6 53 81.4 0.7 9.730

60610 54.0 34.1 68 52.6 0.3 8.231

60611 4.9 11.0 75 42.6 0.0 21.480

60625 7.1 6.9 18 78.5 0.0 11.104

60618 5.3 7.3 31 90.1 NA 10.694

60647 21.5 15.1 NA 89.8 1.1 9.631

60622 43.1 29.1 34 82.7 1.9 7.995

60631 1.1 2.2 14 40.2 0.0 13.722

60646 NA 5.7 11 27.9 0.0 16.250

60656 1.7 2.0 11 7.7 0.0 13.686

60630 1.6 2.5 22 63.8 0.0 12.405

60634 1.5 3.0 NA 51.2 0.0 12.198

60641 1.8 5.4 27 85.1 0.0 11.600

60635 1.0 2.2 9 44.4 0.0 12.765

60639 2.5 7.2 29 84.2 0.2 11.084

60651 NA 15.1 30 89.8 0.8 10.510

60644 59.8 16.5 40 NA 0.8 9.784

60624 94.4 18.4 32 72.9 1.8 7.342

60612 86.2 36.2 41 63.1 1.8 6.565

60607 50.2 NA 147 83.0 0.9 7.459

60623 74.2 18.5 22 78.3 1.9 8.014

60608 55.5 NA 29 79.0 1.5 8.177

60616 NA 12.2 46 48.0 0.6 8.212

60632 4.4 5.6 23 71.5 0.3 11.230

60609 46.2 21.8 4 73.1 1.3 NA

60653 99.7 21.6 31 65.0 0.9 5.583

60615 73.5 9.0 39 75.4 0.4 8.564

60638 10.7 3.6 15 20.8 0.0 12.102

60629 1.5 5.0 32 61.8 NA 11.876

60636 48.8 28.6 27 78.1 1.4 9.742

60621 98.9 17.4 NA 68.6 2.2 7.520

60637 90.6 11.3 34 73.4 0.8 7.388

60652 1.4 3.4 17 2.0 0.0 13.842

60620 71.2 11.9 46 NA 0.9 11.040

60619 94.1 10.5 42 55.9 0.9 10.332

60649 66.1 10.7 NA 67.5 0.4 10.908

60617 NA 10.8 34 58.0 0.9 11.156

60655 1.0 4.8 19 15.2 0.0 13.323

60643 42.5 10.4 25 40.8 0.5 12.960

60628 35.1 15.6 28 57.8 1.0 NA

60627 47.4 7.0 3 11.4 0.2 10.080

60633 34.0 7.1 23 49.2 0.3 11.428

60645 3.1 4.9 27 NA 0.0 13.731

There are 20 missing observations, denoted by NA here. It's important to know what the missing value code is for the data and/or software you are using. What happens if we try to fit the model:

Call:

lm(formula = involact ~ ., data = chm)

Residuals:

Min 1Q Median 3Q Max

-0.53370 -0.16325 -0.07015 0.12615 0.66316

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.116483 0.605761 -1.843 0.079475 .

race 0.010487 0.003128 3.352 0.003018 **

fire 0.043876 0.010319 4.252 0.000356 ***

theft -0.017220 0.005900 -2.918 0.008215 **

age 0.009377 0.003494 2.684 0.013904 *

income 0.068701 0.042156 1.630 0.118077

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3382 on 21 degrees of freedom

Multiple R-Squared: 0.7911, Adjusted R-squared: 0.7414

F-statistic: 15.91 on 5 and 21 DF, p-value: 1.594e-06

Any case with at least one missing value is omitted from the regression. You can see there are now only 21 degrees of freedom - almost half the data is lost. We can fill in the missing values by their variable means as in:

race fire theft age involact income

60626 10.0000 6.20000 29.00000 60.40000 NA 11.74400

60640 22.2000 9.50000 44.00000 76.50000 0.1 9.32300

60613 19.6000 10.50000 36.00000 59.96905 1.2 9.94800

60657 17.3000 7.70000 37.00000 59.96905 0.5 10.65600

60614 24.5000 8.60000 53.00000 81.40000 0.7 9.73000

60610 54.0000 34.10000 68.00000 52.60000 0.3 8.23100

60611 4.9000 11.00000 75.00000 42.60000 0.0 21.48000

60625 7.1000 6.90000 18.00000 78.50000 0.0 11.10400

60618 5.3000 7.30000 31.00000 90.10000 NA 10.69400

60647 21.5000 15.10000 32.65116 89.80000 1.1 9.63100

60622 43.1000 29.10000 34.00000 82.70000 1.9 7.99500

60631 1.1000 2.20000 14.00000 40.20000 0.0 13.72200

60646 35.6093 5.70000 11.00000 27.90000 0.0 16.25000

60656 1.7000 2.00000 11.00000 7.70000 0.0 13.68600

60630 1.6000 2.50000 22.00000 63.80000 0.0 12.40500

60634 1.5000 3.00000 32.65116 51.20000 0.0 12.19800

60641 1.8000 5.40000 27.00000 85.10000 0.0 11.60000

60635 1.0000 2.20000 9.00000 44.40000 0.0 12.76500

60639 2.5000 7.20000 29.00000 84.20000 0.2 11.08400

60651 35.6093 15.10000 30.00000 89.80000 0.8 10.51000

60644 59.8000 16.50000 40.00000 59.96905 0.8 9.78400

60624 94.4000 18.40000 32.00000 72.90000 1.8 7.34200

60612 86.2000 36.20000 41.00000 63.10000 1.8 6.56500

60607 50.2000 11.42444 147.00000 83.00000 0.9 7.45900

60623 74.2000 18.50000 22.00000 78.30000 1.9 8.01400

60608 55.5000 11.42444 29.00000 79.00000 1.5 8.17700

60616 35.6093 12.20000 46.00000 48.00000 0.6 8.21200

60632 4.4000 5.60000 23.00000 71.50000 0.3 11.23000

60609 46.2000 21.80000 4.00000 73.10000 1.3 10.73587

60653 99.7000 21.60000 31.00000 65.00000 0.9 5.58300

60615 73.5000 9.00000 39.00000 75.40000 0.4 8.56400

60638 10.7000 3.60000 15.00000 20.80000 0.0 12.10200

60629 1.5000 5.00000 32.00000 61.80000 NA 11.87600

60636 48.8000 28.60000 27.00000 78.10000 1.4 9.74200

60621 98.9000 17.40000 32.65116 68.60000 2.2 7.52000

60637 90.6000 11.30000 34.00000 73.40000 0.8 7.38800

60652 1.4000 3.40000 17.00000 2.00000 0.0 13.84200

60620 71.2000 11.90000 46.00000 59.96905 0.9 11.04000

60619 94.1000 10.50000 42.00000 55.90000 0.9 10.33200

60649 66.1000 10.70000 32.65116 67.50000 0.4 10.90800

60617 35.6093 10.80000 34.00000 58.00000 0.9 11.15600

60655 1.0000 4.80000 19.00000 15.20000 0.0 13.32300

60643 42.5000 10.40000 25.00000 40.80000 0.5 12.96000

60628 35.1000 15.60000 28.00000 57.80000 1.0 10.73587

60627 47.4000 7.00000 3.00000 11.40000 0.2 10.08000

60633 34.0000 7.10000 23.00000 49.20000 0.3 11.42800

60645 3.1000 4.90000 27.00000 59.96905 0.0 13.73100

Call:

lm(formula = involact ~ ., data = mchm)

Residuals:

Min 1Q Median 3Q Max

-1.024012 -0.156811 -0.003327 0.222011 0.811744

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.070802 0.509453 0.139 0.89020

race 0.007117 0.002706 2.631 0.01224 *

fire 0.028742 0.009385 3.062 0.00402 **

theft -0.003059 0.002746 -1.114 0.27224

age 0.006080 0.003208 1.895 0.06570 .

income -0.027092 0.031678 -0.855 0.39779

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3841 on 38 degrees of freedom

Multiple R-Squared: 0.682, Adjusted R-squared: 0.6401

F-statistic: 16.3 on 5 and 38 DF, p-value: 1.409e-08

Compare with the previous results - what differences do you see? Different statistical packages have different ways of handling missing observations. For example, the default behavior in S-PLUS would refuse to fit the model at all.

We can also use regression methods to predict the missing values of the covariates. Let's try to fill-in the missing race values:

> gr <- lm(race ~ fire + theft + age + income, data=chm)
> chm[is.na(chm$race),] # cases for which race is missing

Incomplete Data (Reading: Faraway (2005, 1st edition), Chapter 12)