Initial Data Analysis (Reading: Faraway (2005, 1st edition), section 1.2)


This is a critical step that should always be performed. You should

  1. understand the background of a dataset and what each variables in the dataset represent.

  2. calculate some descriptive statistics, such as means, standard deviation, maximum and minimum, correlation, and whatever else is appropriate.

  3. draw graphical summaries, such as histograms, box plots, density plots, scatter plots, and many more.


In these numerical and graphical summaries, you can look for

and check


Here is a data set from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases on 768 adult female Pima Indians living near Phoenix. We start by reading the data into R.

> pima <- read.table("", header=T) # read the data into R
> pima # take a look

     pregnant glucose blood triceps insulin   bmi  pedigree age test
1           6     148    72      35       0  33.6     0.627  50    1
2           1      85    66      29       0  26.6     0.351  31    0
3           8     183    64       0       0  23.3     0.672  32    1

... much deleted ...
768         1      93    70      31       0  30.4     0.315  23    0


The variables represents:

pregnant the number times pregnant
glucose the plasma glucose concentration at 2 hours in an oral glucose tolerance test
blood the diastolic blood pressure (mmHg)
triceps the triceps skin fold thickness (mm)
insulin the 2-hour serum insulin (mu U/ml)
bmi the body mass index (weight in kg/(height in m2))
pedigree the diabetes pedigree function
age the age (years)
test whether the patient showed signs of diabetes (0=negative, 1=positive)

(Q: Are these variable quantitative or qualitative? If quantitative, continuous or discrete? If qualitative, whether order exists between levels)


At this stage, we are looking for anything unusual or unexpected, say indication of a data-entry error, or anything that show inconsistency with the pre-knowledge about the data. Let's first calculate some numerical summaries.

> summary(pima)  # some numerical summaries 

   pregnant         glucose          blood          triceps    

 Min.   : 0.000   Min.   :  0.0   Min.   :  0.0   Min.   : 0.00 

 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.0   1st Qu.: 0.00 

 Median : 3.000   Median :117.0   Median : 72.0   Median :23.00 

 Mean   : 3.845   Mean   :120.9   Mean   : 69.1   Mean   :20.54 

 3rd Qu.: 6.000   3rd Qu.:140.3   3rd Qu.: 80.0   3rd Qu.:32.00 

 Max.   :17.000   Max.   :199.0   Max.   :122.0   Max.   :99.00 

    insulin           bmi           pedigree           age      

 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00 

 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00 

 Median : 30.5   Median :32.00   Median :0.3725   Median :29.00 

 Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24 

 3rd Qu.:127.3   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00 

 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00 


 Min.   :0.0000 

 1st Qu.:0.0000 

 Median :0.0000 

 Mean   :0.3490 

 3rd Qu.:1.0000 

 Max.   :1.0000 

Take a close look at the minimum and maximum values of each variable. What have you found?

Now we can do some plots to examine the distribution of variables. Use the variable blood as an example.

> hist(pima$blood) # draw histogram of variable blood

From the plot,

Another alternative is to plot the sorted data against its index.

> plot(sort(pima$blood), pch=".") 

One advantage of this plot is that we can see all the cases individually, which may offer some information about outliers in addition to the distribution of data.


We can draw the three plots in a window for a better comperison:

> par(mfrow = c(1, 3))  # the graphical parameter "mfrow" is a 2-dim vectors in which the first number assigns the number of rows, the second the number of columns; try the command "help(par)" to get more information on graphical parameters

> hist(pima$blood); plot(density(pima$blood, na.rm=TRUE)); plot(sort(pima$blood), pch=".")

> par(mfrow = c(1, 1))  # set the parameter back to itsoriginal setting


Now, note a couple of bi-variate plots.

> par(mfrow = c(1, 2))

> plot(pedigree ~ blood, pima)  # the command draws a scatter plot because the variable blood is a quantitative variable
> plot(pedigree ~ test, pima)  # it draws a side-by-side box plot because the variable test is a qualitative variable
> par(mfrow = c(1,1))

Notice that


Also useful is a scatter plot matrix.

> pairs(pima)  # produce a matrix of scatter plots

What information can you find from these scatter plots? Are there some plots that particularly catch you attention?