Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

ST104: Statistical Laboratory WARWICK

Lecture 8
Samuel Touchard
1/18
Exploratory Data Analysis

Exploratory Data Analysis (EDA) refers to a collection of techniques for initial exploration of
a data set as a way of summarizing the main characteristics.

Features of EDA include:


I Numerical summaries of centre or location of data: median, mean, trimmed mean.
I Numerical summaries of spread of data: variance, standard deviation, hinges, quartiles and
inter-quartile range.
I Numerical summaries of shape of data: modality, skewness and kurtosis.
I Initial graphical plots: stem-and-leaf plots, histograms (or bar charts), time plots, boxplots.

2/18
Hinges and quartiles

Definition: The hinges represent the division of the sample in 4 roughly equal parts.
I The lower hinge H1 is the median of the set {data values ≤ sample median}.
I The upper hinge H3 is the median of the set {data values ≥ sample median}.

Definition: The quartiles also represent the division of the sample in 4 roughly equal parts.
I The first quartile Q1 is the data value with rank (n+1)4 (i.e., Q1 = x( (n+1) ) ).
4
3(n+1)
I The third quartile Q3 is the data value with rank 4 (i.e., Q3 = x( 3(n+1) ) ).
4

I When these ranks are not integers, we will interpolate. Different authors/packages
interpolate in different ways.
The R command to get Q1 and Q3 is quantile(x, c(0.25,0.75)).

3/18
Interquartile range

Definition: The interquartile range is defined as IQR = Q3 − Q1 and measures the spread
around the median.

I We define outliers to be points more than 1.5 × IQR away from the hinges (or quartiles).

Practice: For the dataset {3, 1, 5, 9, 0, −1} find the median, the mean, the 20% trimmed mean,
the variance, the standard deviation, the hinges H1 and H3 , the quartiles Q1 and Q3 and the
interquantile range IQR.

4/18
Discoveries dataset
?discoveries
Description: The numbers of “great” inventions and scientific discoveries in each year from
1860 to 1959.

> discoveries
Time Series:
Start = 1860
End = 1959
Frequency = 1
[1] 5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4
[21] 4 0 2 3 7 12 3 10 9 2 3 7 7 2 3 3 6 2 4 3
[41] 5 2 2 4 0 4 2 5 2 3 3 6 5 8 3 6 6 0 5 2
[61] 2 2 6 3 4 4 2 2 4 7 5 3 3 0 2 2 2 1 3 4
[81] 2 2 1 1 1 2 1 4 4 3 2 1 4 1 1 1 0 0 2 0

5/18
Discoveries dataset
> median(discoveries)
[1] 3
> mean(discoveries)
[1] 3.1
> mean(discoveries, trim = 0.2)
[1] 2.766667
> var(discoveries)
[1] 5.080808
> sd(discoveries)
[1] 2.254065
> IQR(discoveries)
[1] 2
> summary(discoveries)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 2.0 3.0 3.1 4.0 12.0

6/18
Initial Graphical Plots

Graphical plots, help you to check for:


I the overall pattern/shape of variation (e.g. symmetric, skewed, bi-modal etc.),
I any unusual features within a pattern or striking deviations from a pattern (outliers),
I whether any unusual features are just random occurrences or are systematic features,
I any evidence of clustering or granularity (data clumping at certain sequences of values,
reflecting measurement scale).

For more than one variable, begin by examining each variable by itself, then move on to study
relationships between variables.

For data from more than one population, you can check for evidence of variation/pattern
within each data set relative to variation between data sets.

7/18
Histogram

A histogram is a graphical display of data using bars of different heights.


I Divide the range of data values into say K intervals (cells or bins) of equal width. If the
width is too large, the plot may be too coarse to see the details of any pattern; if it is too
small there may be lots of cells with just one or two observations.
I Count the number (frequency) or the percentage of observations falling into each interval.
Be consistent with the allocation of values that equal the end points of intervals.
I Display the outcome as a plot of joined columns or bars above each interval, with height
proportional to the count or percentage for that interval. So taller bars show that more
data falls in that range.

A histogram can be produced in R using the command hist().

8/18
Histogram in R
The standard histogram is produced using the command hist(). The plots can be customised
using sub-commands, such as:
I by using freq=FALSE to display density (i.e., proportions) rather than frequency counts,
I by specifying breaks to give a certain number of cells or to give cells of a desired width,
I by adding titles using main="Plot name",
I by adding labels to axes using xlab="Label for X axis" or similarly for ylab.
For example we can customise the histogram for the discoveries dataset as follows:

> hist(discoveries, freq=FALSE, breaks=seq(0,12,1),


+ xlab="Yearly Numbers of Important Discoveries", ylab="Proportions",
+ main="Histogram for Yearly Numbers of Important Discoveries")

Note how the prompt in R changes from > to + when a command is continued onto a new line.

9/18
Histograms for Discoveries Dataset

10/18
Modality and Skewness
Modality: Number of peaks in the sample distribution.

Skewness: Measure of symmetry, or more precisely, the lack of symmetry. A distribution, or


data set, is symmetric if it looks the same to the left and right of the center point.
Definition: The skewness of a random variable X with mean µ and variance σ 2 , denoted by
µ̃3 (X ), is the third standardized moment, defined as
h X − µ i
3
µ̃3 (X ) = E .
σ
Definition: We say that a distribution is:
I skewed to the right or has positive skewness if the histogram has a long right-hand tail,
that is if (H3 - median) > (median - H1 ).
I skewed to the left or has negative skewness if the histogram has a long left-hand tail, that
is if (H3 - median) < (median - H1 ).
I symmetric or has zero skewness if the histogram is symmetric (e.g., normal distribution).

11/18
Skewness examples

Symmetric Skewed to the right Skewed to the left

(b) Symmetric (c) Skewed to the right (d) Skewed to the left

12/18
Kurtosis
Kurtosis: Measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data
sets with low kurtosis tend to have light tails, or lack of outliers.
Definition: The kurtosis of a random variable X with mean µ and variance σ 2 , denoted by
κ(X ), is the fourth standardized moment, defined as
h X − µ i
4
κ(X ) = E .
σ
We say that a distribution is:
I mesokurtic if it has kurtosis value 3, that is if it has the same peakedness as the Normal
distribution.
I leptokurtic if it has kurtosis value > 3, that is if it is more peaked than the Normal
distribution (slim and long-tailed).
I platykurtic if it has kurtosis value < 3, that is if it is less peaked than the Normal
distribution (flat or short-tailed).

13/18
Kurtosis examples

14/18
Time Plot

A time plot (or just plot) is a plot of the data in the order they were obtained (or recorded).
This plot may give valuable information when the values represent the successive outcomes of
repetitions of a single statistical experiment repeated over time.

In R this can be obtained with the command plot() and it can be customised using
sub-commands.

Try the commands in the following slide and compare the corresponding outputs.

Note: If you want to experiment with more datasets, typing data() will give you a list of
available datasets.

15/18
Time Plot in R
> x <- as.numeric(discoveries)
> par(mfrow=c(1,2))
> plot(x)
> plot(x, main="Discoveries Data Time Plot",
+ xlab="Time recordings were taken", ylab="Number of Discoveries",
+ cex.main=2.5, cex.lab=2, cex.axis=1.5)
> par(mfrow=c(1,2),mar=c(6,5,5,3))
> plot(x, main="Discoveries Data Time Plot",
+ xlab="Time recordings were taken", ylab="Number of Discoveries",
+ cex.main=2.5, cex.lab=2, cex.axis=1.5,
+ pch=18, col="red")
> plot(x, main="Discoveries Data Time Plot",
+ xlab="Time recordings were taken", ylab="Number of Discoveries",
+ cex.main=2.5, cex.lab=2, cex.axis=1.5,
+ col="blue", type = "l")

16/18
Time plot examples

17/18
More time plot examples

18/18

You might also like