Chapter 2: Numerical Summary Measures

Measures of Center

Measures of Variability
More Detailed Summary Quantities
Quantile Plots

Outline of Chapter 2

2.1 Measures of Center

2.2 Measures of Variability
2.3 More Detailed Summary Quantities
2.4 Quantile Plots

The sample mean

The most frequently used measure of center is simply the

arithmetic average of the available observations
The sample mean of observations 𝑥1 , ..., 𝑥𝑛 , denoted by 𝑥, is
given by ∑𝑛
𝑥 = 𝑖=1 (1)
The mean suffers from one deficiency that makes it an
inappropriate measure of center under some circumstances: its
value can be greatly affected by the presence of even a single
outlier (i.e., unusual large or small observation)

The sample median

An alternative measure of center to resist the effect of outliers

is the median.
The sample median, denoted by 𝑥 ˜, is obtained by first ordering
the sample observations from smallest to largest. Then
{ ( )
single middle value = 𝑛+1
2 th value on ordered list if 𝑛 odd
˜= ( ) ( )
average of two middle values = average of 𝑛2 th and 𝑛2 + 1 th if 𝑛

Discrete distributions

The mean value (expected value) of a discrete variable 𝑥,

denoted by 𝜇 [or 𝐸(𝑥)] is given by

𝜇= 𝑥𝑝(𝑥) (3)

where the summation is over all possible 𝑥 values.

if 𝑥 is a binomial variable with parameters 𝑛 (group size) and 𝜌
(success proportion), then 𝜇 = 𝑛𝜌
if 𝑥 is a Poisson variable with parameters 𝜆, then the mean
value of 𝑥 is 𝜆 itself.

Continuous distributions

The mean value (expected value) of a continuous variable 𝑥,

is given by ∫ ∞
𝜇𝑥 = 𝑥𝑓 (𝑥)𝑑𝑥 (4)
If 𝑥1 , 𝑥2 , ..., 𝑥𝑛 have been randomly selected from some
population or process distribution with mean value 𝜇, then the
sample mean 𝑥 gives a point estimate for 𝜇







0.05 µ =0
−6 −4 −2 0 2 4 6

The median of a continuous distribution

Like the sample median 𝑥˜ separates the sample into two equal
halves, the median 𝜇
˜ of a continuous distribution divides the
area under the density curve into two equal halves, i.e.,
∫ 𝜇˜
𝑓 (𝑥)𝑑𝑥 = .5 (5)







The median
0 100 200 300 400 500

Measures of variability for sample data

The simplest measure of variability in a sample is the range

between the largest and smallest sample values.
The sample variance, denoted by 𝑠2 , is defined by
2 (𝑥𝑖 − 𝑥)2 𝑆𝑥𝑥
𝑠 = 𝑖=1 = (6)
𝑛−1 𝑛−1
The sample standard deviation, √
denoted by 𝑠, is the (positive)
square root of the variance 𝑠 = 𝑠2

Variance of a discrete distribution

Let 𝑥 be a discrete variable with mass function 𝑝(𝑥) and mean

value 𝜇.
The variance of a discrete distribution for a variable 𝑥 is
defined by ∑
𝜎2 = (𝑥 − 𝜇)2 𝑝(𝑥) (7)
where the sum is over all possible 𝑥 values

The standard deviation: 𝜎 = 𝜎 2

Variance of a continuous distribution

The variance of a continuous distribution with density

function 𝑓 (𝑥) is obtained by replacing summation in the
discrete case by integration and substituting 𝑓 (𝑥) for 𝑝(𝑥).
The variance of a continuous distribution is defined by
∫ ∞
𝜎 = (𝑥 − 𝜇)2 𝑓 (𝑥)𝑑𝑥 (8)

The standard deviation: 𝜎 = 𝜎 2
In the case of a normal distribution, the related variance can
be determined by
∫ ∞
2 1 2 2
𝜎 = (𝑥 − 𝜇)2 √ 𝑒−(𝑥−𝜇) /(2𝜎 ) 𝑑𝑥 = 𝜎 2 (9)
−∞ 2𝜋𝜎

Quartiles and the interquartile range

The median separates a data set or distribution into two

equal parts (i.e., 50% of the values exceed the median and
50% are smaller than the median)
Quartiles and percentiles give more detailed information
about location of a data set or distribution by considering
percentages other than 50%.
The lower and upper quartiles along with the median separate
a data set or distribution into four equal parts:
25% all values smaller than the lower quartiles
25% exceed the upper quartiles
25% lie between each quartile and the median

Quartiles and the interquartile range: definitions

Separate the 𝑛 ordered sample observations into a lower half

and an upper half.
If 𝑛 is an odd number, include the median 𝑥
˜ in each half.
lower quartile = median of the lower half of the data
upper quartile = median of the upper half of the data
The interquartile range (IQR), a measure of variability that
is resistant to the effect of outliers, is the difference between
the two quartiles: IQR = upper quartile - lower quartile

A boxplot is a visual display of data based on the following

five-number summary:
smallest 𝑥𝑖 , lower-quartile, median, upper-quartile, largest 𝑥𝑖
To create a boxplot, do the following:
draw a horizontal measurement scale.
place a rectangle (above the axis) with left and right edges are
at the lower and upper quartiles, respectively.
place a vertical line segment or a symbol inside the rectangle
at the location of the median
draw ”whiskers” out from either end of the rectangle to the
smallest and largest values in the sample

Any observation farther than 1.5 IQR from the closet quartile
is called an outlier.
An outlier is extreme if it is more than 3 IQR from the
nearest quartile, and it is mild otherwise.

A Let 𝑝 denote a number between 0 and 1. Then the

(100𝑝)th percentile, 𝜂𝑝 also called the 𝑝th quantile, separates
the smallest 100𝑝% of the data or distribution from the
remaining values.
For instance, 90% of all values lie below the 90th percentile,
𝜂.9 , and only 10% of all values exceed the 90th percentile.
The median is the 50th percentile.
For a continuous distribution, 𝜂𝑝 is the solution to the
equation ∫ 𝜂𝑝
𝑓 (𝑥)𝑑𝑥 = 𝑝 (10)
where 𝑝 is the area under the density curve to the left of 𝜂𝑝
Example/figure 2.7: ...
An investigator usually wishes to know whether it is plausible

that a numerical sample 𝑥1 , 𝑥2 , ..., 𝑥𝑛 was selected from a
paricular type of popular distribution.
Many inferential procedures are based on the assumption that
the underlying distribution is of a specified type.
However, the use of such procedure is inappropriate if the
actual distribution differs greatly from the assumed type.
Understanding the underlying distribution can sometimes give
insight into the physical mechanisms involved in generating
the data.

Introduction (cont.)

An effective way to check the distribution assumption is to

construct a quantile plot.
The essence of such a plot is that if the plot is based on the
correct distribution, the points in the plot will fall close to a
straight line.
If not, the points should depart substantially from a linear

Let 𝑥(1) denote the smallest sample observation, 𝑥(2) the

second smallest sample observation,..., and 𝑥(𝑛) the largest.
sample observation
Take 𝑥(1) to be the (.5/𝑛)th sample quantile, 𝑥(2) to be the
(1.5/𝑛)th sample quantile,..., and finally 𝑥(𝑛) to be the
(𝑛 − .5/𝑛)th sample quantile.
Generally, for 𝑖 = 1, 2, ..., 𝑛 𝑥(𝑖) to be the (𝑖 − .5/𝑛)th sample

A normal quantile plot

For 𝑖 = 1, 2, ..., 𝑛, the (𝑖 − .5/𝑛)th quantiles are determined

for a specified population or process distribution whose
plausibility is being investigated.
If the sample were actually selected from the specified
distribution, the related sample quantiles should be reasonably
close to the corresponding distributional quantiles, i.e., for
𝑖 = 1, 2, ..., 𝑛, there should be reasonable agreement between
𝑥(𝑖) and the (𝑖 − .5/𝑛)th quantiles of the specified distribution.
After determining the appropriate quantiles for the
distribution (under investigated), form the 𝑛 pair as follows:
(( ) ( ))
.5 𝑛 − .5
th quantile, 𝑥(1) , ..., th quantile, 𝑥(𝑛)
𝑛 𝑛
Each such pair can be plotted as a point on a two-dimensional
coordinate system
A normal quantile plot: comments and an example

In each pair (i.e., each plotted point), if the first number is

close to the second number, the point in the plot will fall close
to a 45𝑜 line with slope 1 passing through the point (0,0)
Example: this program can be carried out to decide whether a
normal distribution with 𝜇 = 100 and 𝜎 = 15 is plausible.One
may need to do:
determine the appropriate 𝑧 quantiles (𝑧 refers to standard
normal distribution),
the considered normal distribution quantiles are expressed in
the form 𝜇 + (corresponding 𝑧 quantile) × 𝜎
It is noted that quantile for normal (𝜇, 𝜎) distribution =
𝜇 + (corresponding 𝑧 quantile) × 𝜎

A normal quantile plot: definition

A normal quantile plot is a plot of the (𝑧 quantile,

observation) pairs.
The linear relation between normal (𝜇,𝜎) quantiles and 𝑧
quantiles implies that if the sample has come from a normal
distribution with parameters of 𝜇 and 𝜎, the points in the plot
should fall close to a straight line with slope 𝜎 and vertical
intercept 𝜇
A plot for which the points fall close to some straight line
suggests that the assumption of a normal population or
process distribution is plausible.

