Descriptive Statistics PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

Numerical Descriptive

Descriptive Statistics

The best way to work with data is to

summarize and display the data.

Numbers that have not been

summarized and organized are called
raw data.
Descriptive measures

A descriptive measure is a single

number that is used to describe a set
of data.

Descriptive measures include

measures of central tendency and
measures of dispersion.
Summary Definitions
 The central tendency is the extent to
which all the data values group around
a central value.

 The variation is the amount of

dispersion, or scattering, of values

 The shape is the pattern of the

distribution of values from the lowest
value to the highest value.
Distribution curve
Describing Data Numerically

Describing Data Numerically

Central Tendency Variation

Arithmetic Mean Range

Median Interquartile Range

Mode Variance

Standard Deviation

Coefficient of Variation
Calculating the Mean, Median
and Mode for ungrouped data
The Sample Mean

Pronounced x-bar The ith observation

∑x x1 + x2 +  + xn
=x =
i =1

n n
Sample size=number of observations n observations
Example 1
 For this sample data Xi:

2, 3, 5, 1, 4, 3, 2, 4 find the sample mean.

xi 8

x1 2 ∑x i

x2 3 Sample mean, x = i =1

x3 5
x4 1 24
x5 4
x6 3
x8 4
Σxi 24
Example 2

The following are the ages (in years) of

all eight employees of a small company

53, 32, 61, 27, 39, 44, 49, 57

Find the mean age of these employees.

Properties of the Sample Mean
 Uniqueness -- For a given set of data there is one and only
one mean.
 Affected (distorted) by extreme values (outliers)
 May better be replaced by the median when the distribution
of the data is ‘skewed’).

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
Measures of Central Tendency:
The Median

The median is the value of the

middle observation in a dataset
that has been ranked in increasing
Measures of Central Tendency:
The Median

 First, arrange the observations in ascending order

 Then, find the middle position, using the following


n +1
Median position = position in the ordered data
 Find the median value.
Example 1
Find the median for the following data set.

27 38 12 34 42 40 24 40 23
 The ordered set becomes

Observation 12 23 24 27 34 38 40 40 42
Rank 1 2 3 4 5 6 7 8 9
9 + 1 th
 The median position is = 5 rank (observation)
 Therefore the median = 34
Example 2

Sambiri Silicon manufactures computer

monitors. The following data are numbers of
computer monitors produced at the company
for a sample of 10 days. Find the median.

24 31 27 25 35 33 26 40 25 28
Properties of the Median
 In an ordered array, the median is the “middle”
number (50% above, 50% below)
 Uniqueness -- There is only one median for each
set of data.
 Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3
Measures of Central Tendency:
The Mode
The mode is the most frequently
occurring value in a set of observations.

Organizing data into an ordered array

(in ascending order) helps to locate the
The Mode

 Find the mode for the data below

7.00 11.00 14.25 15.00 15.00 15.50
19.00 19.00 19.00 19.00 21.00 22.00
23.00 24.00 25.00 27.00 27.00 28.00
34.22 43.25

The mode is 19.00 because it recurs the

most times, i.e. four (4) times
Properties of the Mode

 Not affected by extreme values

 There may be no mode
 There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

No Mode
Mode = 9
Measures of Central Tendency:
Review Example

House Prices:  Sample Mean = $600,000

 Median = $300,000
 Mode = $100,000
Sum $3,000,000
Relationship among the Mean,
Median and Mode

Knowing the values of the mean,

median and mode can give us some
idea about the shape of a frequency
distribution curve
Symmetric Histogram
Skewed Histogram
Skewed Histogram
Measures of Central Tendency:
Which Measure to Choose?

 The mean is generally used, unless

extreme values (outliers) exist.

 The median is often used, since the

median is not sensitive to extreme values.

 In some situations it makes sense to report

both the mean and the median.
Measures of Central Tendency:

Central Tendency

Sample Mean Median Mode Geometric


∑X i
XG = ( X1 × X 2 ×  × Xn )1/ n

X= i=1
n Middle value Most Rate of
in the ordered frequently change of
array observed a variable
value over time
Measures of Dispersion for
ungrouped data
Measures of Dispersion
The measures of central tendency, such
as the mean, median and mode, do not
reveal the whole picture of the
distribution of the dataset.

 Two datasets with the same mean may have

completely different spreads.

 The amount or degree of spread is known

as variation.
Measures of Dispersion
 Which dataset has the larger variation?

Dataset 1

Dataset 2
Measures of Dispersion
Population 1 Population 2
Narrow range Wide range
Smaller Larger
variation variation
Smaller Larger
deviation deviation Population 1

Observations Observations
clustered spread out Population 2

Same centre,
different variation
Measures of Dispersion:
Summary Characteristics
 The more the data are spread out, the
greater the range, variance, and
standard deviation.

 The more the data are concentrated,

the smaller the range, variance, and
standard deviation.
Measures of Dispersion:
Summary Characteristics
 If the values are all equal (no
variation), all these measures will be

 None of these measures are ever

Measures of Dispersion

Consider the following data on ages of

employees at each of two companies.

The mean age of employees of these

companies is the same, 40 years.
Measures of Dispersion
If we do not know the ages of individual
employees at these two companies and
we are only told that the mean age of
employees at both companies is the
same, we may wrongly deduce that the
employees at these two companies have
a similar age distribution.
Measures of Dispersion

The diagram shows that the ages of the

employees at the second company have a
much larger variation than the ages of the
employees at the first company.
Measures of Dispersion

The mean, median and mode locate the

centre of the distribution.

We also need a measure that can

provide some information about the
variation (spread) among data values.
Measures of Dispersion

Consider the following data on ages of

employees at each of two companies.

The mean age of employees of these

companies is the same, 40 years.
Measures of Dispersion


Range Variance Standard Coefficient

Deviation of Variation

Measures of variation give

information on the spread
or variability or dispersion
of the data values.

Same centre,
different variation
Measures of Dispersion:
The Range

Range = Xlargest – Xsmallest


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 – 1 = 12
Measures of Dispersion:
Why The Range Can Be Misleading
 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Measures of Dispersion:
Why The Range Can Be Misleading

 Sensitive to outliers

Range = 5 - 1 = 4


Range = 120 - 1 = 119

The Sample Variance
Variance is used to measure the
dispersion of values relative to the
Measures of Dispersion:
The Sample Variance

∑ (x i − x) 2

s =
2 i=1

Where X = arithmetic mean
n = sample size
Xi = ith observation of the
variable X
The Sample Variance

When values are close to their mean

(narrow range) the dispersion is less than
when there is scattering over a wide
For this sample data Xi:

2, 3, 5, 1, 4, 3, 2, 4 find.

1. Sample variance
2. Sample standard deviation
Σ 24
2 2-3
3 3-3
5 5-3
1 1-3
4 4-3
3 3-3
2 2-3
4 4-3
Σ 24
2 -1
3 0
5 2
1 -2
4 1
3 0
2 -1
4 1
Σ 24 0
2 -1 1
3 0 0
5 2 4
1 -2 4
4 1 1
3 0 0
2 -1 1
4 1 1
Σ 24 12

∑x i
Sample mean=x = 3
i =1

n 2

∑( x i − x)
s = 2 i =1
n −1

Sample variance s= = 1.714

The Sample Standard Deviation
 Most commonly used measure of variation
 Tells us how much observations in our sample
differ from the mean value within our sample.
 Has the same units as the original data

s= s 2
Solution n

∑x i
Sample mean
= x i =1
= 3
n 2

∑( x i − x)
Sample variance s =
2 i =1
n −1
s= = 1.714
Sample standard deviation

=s s
= 2
= 1.309
Measures of Dispersion:
Comparing Standard Deviations

Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
s = 0.926

Data C Mean = 15.5

s = 4.570
11 12 13 14 15 16 17 18 19 20 21
Measures of Dispersion:
The Coefficient of Variation
Sometimes we may need to compare
the variability of two different datasets
that have different units of
Measures of Dispersion:
The Coefficient of Variation
 Measures relative variation to mean
 Always in percentage (%)
 Can be used to compare the variability of two or
more sets of data measured with different units of

 s 
CV =   ×100%
x 
Example 1
The yearly salaries of all employees who work for
a company have a mean of $62,350 and a
standard deviation of $6820. The years of
experience for the same employees have a mean
of 15 years and a standard deviation of 2 years. Is
the relative variation in the salaries larger or
smaller than that in the years of experience for
these employees?
Example 2

For example, we wish to know which is more

variable, the price of stock A or price of stock B

Stock A Stock B
Average price $50 $100
Standard deviation $5 $5
Measures of Dispersion:
Comparing Coefficients of Variation

s  5
  ⋅100% = ⋅100% =
x  50
s  5
CVB =  ⋅100% = ⋅100% =5%
x  100

Comparing the C.V. it is clear that variation is

much higher stock A than in stock B.
 A low (%) value shows low variability
implying tight clustering of
observations about the mean.

 A middle to high (%) value shows high

variability implying that observations
are widely spread.
Measures of Position for
ungrouped data
(Quartile Measures)
Quartile Measures

 Quartiles split the ranked data into 4 equal


25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile, Q1, contains the first 25% of the
 Q2 is the same as the median contains the first 50%
of the observations.
 The third quartile, Q3, contains the first 75% of the
Quartile Measures

 Q1 = 25th percentile = P25

 Q2 = 50th percentile = P50

 Q3 = 75th percentile = P75

Locating Quartiles Positions
Find a quartile by determining the value in
the appropriate position in the ranked data

= Q 1 0.25 ( n + 1)
First quartile position:     

= Q 2 0.5 ( n + 1)
Second quartile position: 

= Q 3 0.75 ( n + 1)
Third quartile position:    
Quartile Measures:
The Interquartile Range (IQR)
Because the range can be distorted by
outliers (extreme values), a modified range
which excludes these outliers if often

The IQR measures the spread in the

middle 50% of the data

= Q3 − Q1
Quartile Measures:
The Interquartile Range (IQR)
 The IQR is also called the 50%

 This modified range removes outliers,

but it excludes 50% of all observations
from further analysis.
Quartile Measures:
The Interquartile Range (IQR)

The IQR, like the range, also provides no

information on the clustering of
observations within the dataset as it
uses only two observations in its
Example 1

Given Sample Data in Ordered Array:

11 12 13 16 16 17 18 21 22

and 3
1. Q 1   Q
2. IQR
Locating First quartile, Q1

11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the 0.25(9+1)=2.5position of the ranked data
so use the value half way between the 2nd and 3rd values,

so Q1 = 12.5
Locating Third Quartile, Q3

11 12 13 16 16 17 18 21 22

(n = 9)
Q3 is in the 0.75(9+1)=7.5position of the ranked data
so use the value half way between the 7th and 8th values,

so Q3 = 19.5
The Interquartile Range (IQR)

= Q3 − Q1
= 19.5 − 12.5
= 7.0
Example 2

Given Sample Data in Ordered Array:

7 8 9 10 11 12 13 13 14 17 17 45

and 3
1. Q 1   Q
2. IQR
Locating First quartile, Q1

7 8 9 10 11 12 13 13 14 17 17 45

(n in the   0.25 (12 + 1) 3.25 pos of the ranked data. 

12)   Q 1  is=
So find the value half way between the 3rd and 4th values,   
9 + 10
which is = 9.5
9 + 9.5
= Q 1 = 9.25
Locating Third Quartile, Q3

7 8 9 10 11 12 13 13 14 17 17 45

(n in the   0.75 (12 + 1) 9.75 pos of the ranked data. 

12)   Q 3  is=
So find the value half way between the 9th and 10th values,   
14 + 17
which is = 15.5
15.5 + 17
= Q 3 = 16.25
The Interquartile Range (IQR)

= Q3 − Q1
= 16.25 − 9.25
= 7.0
Numerical Descriptive
Measures of a Population
Numerical Descriptive Measures
for a Population
 Numerical descriptive measures
discussed so far described a sample, not
the population.

 These descriptive statistics are called

sample statistics.
Numerical Descriptive Measures
for a Population
 Summary measures describing a
population, are called parameters, and
are denoted with Greek letters.

 Important population parameters are the

population mean, population variance,
and population standard deviation.
Numerical Descriptive Measures
for a Population:
The population mean µ

∑X i
µ= i =1

Where μ = population mean
N = population size
Xi = ith observation of the
variable X
Numerical Descriptive Measures
For A Population:
The Population Variance σ2

∑ (X − μ)i

σ =2 i=1
Where μ = population mean
N = population size
Xi = ith observation of the
variable X
Numerical Descriptive Measures
For A Population: The Population
Standard Deviation σ

∑ (X i − μ) 2

σ= i =1
Sample statistics versus
population parameters
Measure Population Sample
Parameter Statistic
Mean µ x

Variance σ2 s2

Standard σ s
Proportion π p
Approximating the Mean,
Variance and Standard
deviation from grouped data
Computing Numerical Descriptive
Measures From A Frequency
We can only compute approximations to
the mean, variance and the standard
deviation of the data since we are
dealing with grouped data.
Approximating the Sample Mean
from a Frequency Distribution
Use the midpoint of a class interval to approximate the values
in that class

∑fx i i
x= i=1

Where n = number of observations or sample size
k = number of classes in the frequency
xi = class midpoint
fi = frequency of observations
Example 1
The table below gives the commuting times (in
minutes) from home to work for 30 employees of
a company

18 15 7 24 10
23 28 10 16 12
5 23 24 16 19
26 17 27 17 17
29 18 23 9 26
12 22 14 26 22
Descriptive Statistics on Raw Data
n Range Mean Deviation Variance
Time 30 24 18.50 6.627 43.914
Question 1

Using the grouped data approximate the mean

commuting time for the 30 employees.
Frequency Distribution

fi xi fixi

5 ≤ x <10
10 ≤ x <15
15 ≤ x <20
20 ≤ x <25
25 ≤ x <30
Frequency Distribution

fi xi fixi

5 ≤ x <10 3
10 ≤ x <15 5
15 ≤ x <20 9
20 ≤ x <25 7
25 ≤ x <30 6
Frequency Distribution

fi xi fixi

5 ≤ x <10 3 7.5
10 ≤ x <15 5 12.5
15 ≤ x <20 9 17.5
20 ≤ x <25 7 22.5
25 ≤ x <30 6 27.5
Frequency Distribution

fi xi fixi

5 ≤ x <10 3 7.5 22.5

10 ≤ x <15 5 12.5 62.5
15 ≤ x <20 9 17.5 157.5
20 ≤ x <25 7 22.5 157.5
25 ≤ x <30 6 27.5 165.0
30 565
The mean commuting time

∑fx i i
Mean= x= i =1
n 30

x = 18.833 minutes
x = 18.8minutes
Approximating the Sample Standard
Deviation from a Frequency Distribution

∑ (x − x) i
s= i=1
Where n = number of observations or sample size
k = number of classes in the frequency distribution
xi = class midpoint
fi = frequency of observations
Descriptive Statistics on Raw Data
n Range Mean Deviation Variance
Time 30 24 18.50 6.627 43.914
Question 2

Using the grouped data approximate the sample

variance and standard deviation for commuting
time for the 30 employees.
Frequency Distribution

fi mid- ( xi − x ) ( xi − x ) ( x − x )
2 2
Class Limits i fi
point, xi

5 ≤ x <10 3 7.5
10 ≤ x <15 5 12.5
15 ≤ x <20 9 17.5
20 ≤ x <25 7 22.5
25 ≤ x <30 6 27.5
Frequency Distribution

fi mid- ( xi − x ) ( xi − x ) ( x − x )
2 2
Class Limits i fi
point, xi

5 ≤ x <10 3 7.5 -11.333

10 ≤ x <15 5 12.5 -6.333
15 ≤ x <20 9 17.5 -1.333
20 ≤ x <25 7 22.5 3.667
25 ≤ x <30 6 27.5 8.667
Frequency Distribution

fi mid- ( xi − x ) ( xi − x ) ( x − x )
2 2
Class Limits i fi
point, xi

5 ≤ x <10 3 7.5 -11.333 128.437

10 ≤ x <15 5 12.5 -6.333 40.107
15 ≤ x <20 9 17.5 -1.333 1.777
20 ≤ x <25 7 22.5 3.667 13.447
25 ≤ x <30 6 27.5 8.667 75.117
Frequency Distribution

fi mid- ( xi − x ) ( xi − x ) ( x − x )
2 2
Class Limits i fi
point, xi

5 ≤ x <10 3 7.5 -11.333 128.437 385.311

10 ≤ x <15 5 12.5 -6.333 40.107 200.535
15 ≤ x <20 9 17.5 -1.333 1.777 15.993
20 ≤ x <25 7 22.5 3.667 13.447 94.129
25 ≤ x <30 6 27.5 8.667 75.117 450.702
The Variance

∑( x )
i −x fi =

∑( x )
i −x fi
Variance= s=
2 i =1
n −1

=s 2
= 39.540
The Standard Deviation

s= s 2

=s s

s = 6.288 minutes
s = 6.3minutes
Class Exercise 1
The frequency distribution table below gives the
number of iPods sold by a shop on each of 30 days.
Calculate the mean, variance and standard

iPods sold f
5-9 3
10 - 14 6
15 - 19 8
20 -24 8
25 -29 5
Class Exercise 2
Sambiri Silicon manufactures computer monitors.
The following table represents the distribution of
computer monitors produced at the company for
a sample of 30 days. Calculate the mean, variance
and standard deviation.

Class Limits f
21 - 23 7
24 - 26 6
27 - 29 6
30 -32 4
33 -35 7
Class Exercise 3
A sample of 40 randomly selected households
from a city produced the following distribution of
the number of vehicles owned. Find the mean,
variance and standard deviation.
Class f
0 2
1 18
2 11
3 4
4 3
5 2
Approximating the Median
from grouped data
Approximating the Median from a
Frequency Distribution

c [ 0.5n − CF ]
Me = L +

Where L = lower class limit of the median class interval

c = class width
n = sample size
fme = absolute frequency of the median class interval
CF = absolute cumulative frequency of the interval
before the median interval
2. Find the median.

 Calculate the cumulative frequencies-(order)

 Identify the median position

= 0.5n
 Use the formula 2
Identify the median class interval from
the cumulative frequency column.

This is the class interval that contains the

median value
Question 3
Using the grouped data from the
commuting time for 30 employees
example approximate:

1. The median.
Commuting times Example

Class Limits fi CF

5 ≤ x <10 3 3
10 ≤ x <15 5 8
15 ≤ x <20 9 17
20 ≤ x <25 7 24
25 ≤ x <30 6 30
L = ? 15
c=? 5
n = 30 fme
fme = ? 9
Median CF = ? 8
interval CF
Class Limits fi
5 to <10 3 3 Median
10 to <15 5 8 position
15 to <20 9 17
20 to <25 7 24
25 to <30 6 30
n=∑ fi=30
1. The Median commuting time
n 30
n 30
= = = 15th observation
2 2
The median interval is
15 to < 20
as it contains the 15th observation
=L 15, = c 5, = n 30,f= me 9,=CF 8

5[ 0.5 × 30 − 8]
Me =
15 + =
18.889 ≈ 18.9 minutes
Approximating the Mode from
grouped data
Approximating the Mode from a Frequency

c ( fm − fm −1 )
Mo = L +
2 fm − fm −1 − fm +1

Where L = lower limit of the modal class interval

c = class width of the modal class interval
fm = frequency of the modal class interval
fm-1 = frequency of the class preceding the modal
fm+1 = frequency of the class following the modal
L = ? 15
c=? 5
fm = ? 9
Modal fm+1 = ? 7
fm-1 = ? 5
Class Limits fi

5 to <10 3 Modal
10 to <15 5 value
15 to <20 9
20 to <25 7
25 to <30 6
n=∑ fi=30
2. The Modal commuting time
Identify the modal interval
This is the interval associated with the highest frequency

The modal interval is

15 to < 20
as it contains the highest frequency
=L 15, = c 5, = fm 9, =fm −1 5,=fm +1 7

5 ( 9 − 5)
Mo =
15 + =
18.333 ≈ 18.3minutes
2(9) − 5 − 7
Finding the mode using the graphical
Number of Minutes


7.5 12.5 17.5 22.5 27.5

Class Midpoint
Class Exercise 1

With the aid of an appropriate graph

drawn to scale, determine
1. The mode for the commuting time
Measures of Position for
grouped data
(Quartile Measures)
Question 5

Using the grouped data from Example

on commuting time for 30 employees
approximate the following:

1. Q1
2. Q3
3. The IQR
Class Limits fi CF

5 ≤ x <10 3 3
10 ≤ x <15 5 8
15 ≤ x <20 9 17
20 ≤ x <25 7 24
25 ≤ x <30 6 30
Q1 = 25th percentile

 Identify the interval that contains the 25th

c [ 0.25n − CF ]
P25 = L +
fp25 = ?
CF = ?
L = ? 10
c=? 5
n = 30 fp25
fp25 = ? 5
P25 CF = ? 3
interval CF
Class Limits fi
5 to <10 3 3 P25
10 to <15 5 8 position
15 to <20 9 17
20 to <25 7 24
25 to <30 6 30
n=∑ fi=30
 L = 10 → lower limit of the P25 interval
 n = 30 → sample size
 fp25=5
 CF= 3 → cumulative frequency of the interval
before the P25 interval
 c = 5 → class width

5[ 0.25 × 30 − 3]
P25 =
10 + =
Q3 = 75th percentile

 Identify the interval that contains the 25th

c [ 0.75n − CF ]
P75 = L +
fp75 = ?
CF = ?
L = ? 20
c=? 5
n = 30 fp75
fp75 = ? 7
P75 CF = ? 17
interval CF
Class Limits fi
5 to <10 3 3
10 to <15 5 8
15 to <20 9 17 P75
20 to <25 7 24 position
25 to <30 6 30
n=∑ fi=30
 L = 20 → lower limit of the P75 interval
 n = 30 → sample size
 fp75=7
 CF= 17 → cumulative frequency of the
interval before the P75 interval
 c = 5 → class width

5[ 0.75 × 30 − 17]
P75 =
20 + =
The Interquartile Range

Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%

5 14.5 18.3 23.9 29

Interquartile range
= 23.9 – 14.5 = 9.4
Class Exercise 2

With the aid of an appropriate graph drawn to

scale, determine
1. Q1, Q3 and IQR
2. The 80th percentile and
3. The mid-60% range.

You might also like