# Initial Data Analysis (Reading: Faraway (2005, 1st edition), section 1.2)

This is a critical step that should always be performed. You should

1. understand the background of a dataset and what each variables in the dataset represent.

2. calculate some descriptive statistics, such as means, standard deviation, maximum and minimum, correlation, and whatever else is appropriate.

3. draw graphical summaries, such as histograms, box plots, density plots, scatter plots, and many more.

In these numerical and graphical summaries, you can look for

• outliers,

• data-entry errors,

• skewed or unusual distributions and structure,

and check

• whether the data are distributed according to prior expectations and

• whether some assumptions in the models that will be conducted in further data analyses are violated.

Here is a data set from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases on 768 adult female Pima Indians living near Phoenix. We start by reading the data into R.

> pima # take a look

pregnant glucose blood triceps insulin   bmi  pedigree age test
1           6     148    72      35       0  33.6     0.627  50    1
2           1      85    66      29       0  26.6     0.351  31    0
3           8     183    64       0       0  23.3     0.672  32    1

... much deleted ...
768         1      93    70      31       0  30.4     0.315  23    0

The variables represents:

 pregnant the number times pregnant glucose the plasma glucose concentration at 2 hours in an oral glucose tolerance test blood the diastolic blood pressure (mmHg) triceps the triceps skin fold thickness (mm) insulin the 2-hour serum insulin (mu U/ml) bmi the body mass index (weight in kg/(height in m2)) pedigree the diabetes pedigree function age the age (years) test whether the patient showed signs of diabetes (0=negative, 1=positive)

(Q: Are these variable quantitative or qualitative? If quantitative, continuous or discrete? If qualitative, whether order exists between levels)

At this stage, we are looking for anything unusual or unexpected, say indication of a data-entry error, or anything that show inconsistency with the pre-knowledge about the data. Let's first calculate some numerical summaries.

> summary(pima)  # some numerical summaries

pregnant         glucose          blood          triceps

Min.   : 0.000   Min.   :  0.0   Min.   :  0.0   Min.   : 0.00

1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.0   1st Qu.: 0.00

Median : 3.000   Median :117.0   Median : 72.0   Median :23.00

Mean   : 3.845   Mean   :120.9   Mean   : 69.1   Mean   :20.54

3rd Qu.: 6.000   3rd Qu.:140.3   3rd Qu.: 80.0   3rd Qu.:32.00

Max.   :17.000   Max.   :199.0   Max.   :122.0   Max.   :99.00

insulin           bmi           pedigree           age

Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00

1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00

Median : 30.5   Median :32.00   Median :0.3725   Median :29.00

Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24

3rd Qu.:127.3   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00

Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00

test

Min.   :0.0000

1st Qu.:0.0000

Median :0.0000

Mean   :0.3490

3rd Qu.:1.0000

Max.   :1.0000

Take a close look at the minimum and maximum values of each variable. What have you found?

• It is weird that blood pressure equals zero (also check variables glucose, triceps, insulin, bmi). Let's check their sorted values to find out how many 0's in the variable blood.

> sort(pima\$blood)  # sort the values of this variable from small to large

[1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

[19]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  24

[37]  30  30  38  40  44  44  44  44  46  46  48  48  48  48  48  50  50  50

... much deleted ...

[739]  94  94  94  94  94  94  95  96  96  96  96  98  98  98 100 100 100 102

[757] 104 104 106 106 106 108 108 110 110 110 114 122

• It seems likely that the zero has been used as a missing value code. In a real investigation, one would likely be able to question what really happened and if missing, whether there exists a systematic missing mechanism.

• R use "NA" as the missing value code. Let's set all zero values of the variables to NA.
> pima\$blood[pima\$blood == 0] <- NA # set zero values in the variable blood to "NA", where "==" means "equal" in R
> pima\$glucose[pima\$glucose == 0] <- NA # set zero values in the variable glucose to "NA"
> pima\$triceps[pima\$triceps == 0] <- NA # set zero values in the variable triceps to "NA"
> pima\$insulin[pima\$insulin == 0] <- NA # set zero values in the variable insulin to "NA"
> pima\$bmi[pima\$bmi == 0] <- NA # set zero values in the variable bmi to "NA"

• The variable test is a qualitative variable, whose numerical coding is meaningless. In R, a qualitative variable should be assigned as a "factor" so that R can handle it in an appropriate way.

• > pima\$test <- factor(pima\$test) # assign the variable test as a factor in R

> summary(pima\$test) # take a look

0   1

500 268

It is even better to use descriptive labels:

> levels(pima\$test) # check how variable test is coded now

[1] "0" "1"

> levels(pima\$test) <- c("negative", "positive") # assign descriptive labels to variable test

> levels(pima\$test) # check how variable test is coded now

[1] "negative" "positive"

Now, let's take a look of the summary of the dataset again.

> summary(pima) # take a look

pregnant         glucose          blood          triceps

Min.   : 0.000   Min.   : 44.0   Min.   : 24.0   Min.   :  7.00

1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.0   1st Qu.: 22.00

Median : 3.000   Median :117.0   Median : 72.0   Median : 29.00

Mean   : 3.845   Mean   :121.7   Mean   : 72.4   Mean   : 29.15

3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.0   3rd Qu.: 36.00

Max.   :17.000   Max.   :199.0   Max.   :122.0   Max.   : 99.00

NA's   :  5.0   NA's   : 35.0   NA's   :227.00

insulin            bmi           pedigree           age

Min.   : 14.00   Min.   : 0.00   Min.   :0.0780   Min.   :21.00

1st Qu.: 76.25   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00

Median :125.00   Median :32.00   Median :0.3725   Median :29.00

Mean   :155.55   Mean   :31.99   Mean   :0.4719   Mean   :33.24

3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00

Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00

NA's   :374.00

test

negative:500

positive:268

Try to compare it with the previous summary and see how the results are different.

Now we can do some plots to examine the distribution of variables. Use the variable blood as an example.

> hist(pima\$blood) # draw histogram of variable blood

From the plot,

• We see a bell-shaped distribution for the blood pressures centered around 70.

• Notice that histogram plot may obscure some features of the data because its construction requires some inputs specified by the user, such as the spacing on the horizontal axis.

• For this reason, a smoothed version of the histogram might be preferred.

• > plot(density(pima\$blood, na.rm=TRUE))  # the function "density" computes kernel density estimates, "na.rm=True" option removes missing values.

We see the plot avoids the distracting blocks in the histogram.

Another alternative is to plot the sorted data against its index.

> plot(sort(pima\$blood), pch=".")

One advantage of this plot is that we can see all the cases individually, which may offer some information about outliers in addition to the distribution of data.

We can draw the three plots in a window for a better comperison:

> par(mfrow = c(1, 3))  # the graphical parameter "mfrow" is a 2-dim vectors in which the first number assigns the number of rows, the second the number of columns; try the command "help(par)" to get more information on graphical parameters

> hist(pima\$blood); plot(density(pima\$blood, na.rm=TRUE)); plot(sort(pima\$blood), pch=".")

> par(mfrow = c(1, 1))  # set the parameter back to itsoriginal setting

Now, note a couple of bi-variate plots.

> par(mfrow = c(1, 2))

> plot(pedigree ~ blood, pima)  # the command draws a scatter plot because the variable blood is a quantitative variable
> plot(pedigree ~ test, pima)  # it draws a side-by-side box plot because the variable test is a qualitative variable
> par(mfrow = c(1,1))

Notice that

• the scatter plot (left panel) shows the relationship between two quantitative variables,

• the side-by-side boxplot (right panel) is suitable for showing how the distribution of a quantitative variable is influenced by a qualitative variable.

Also useful is a scatter plot matrix.

> pairs(pima)  # produce a matrix of scatter plots

What information can you find from these scatter plots? Are there some plots that particularly catch you attention?