Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

AST10113 Foundation Statistics

Lecture 2
Measures of Central Tendency and
Variation
Kirk Chan & Charmaine Lau

Foundation Statistics – Measures of Central Tendency and Variation


Summary Measures

Describing Data Numerically

Central Tendency Quartiles Variation

Arithmetic Mean Asymmetry Range

Median Shape Interquartile Range

Mode Variance
Skewness
Geometric Mean Standard Deviation

Coefficient of Variation

Foundation Statistics – Measures of Central Tendency and Variation 2


Measures of Central Tendency
Overview
Central Tendency

Arithmetic Mean Median Mode Geometric Mean

n
X G = (X 1  X 2    X n )1 / n
 Xi
X = i =1
n Midpoint of Most
ranked frequently
values observed
value

Foundation Statistics – Measures of Central Tendency and Variation 3


Arithmetic Mean
 The arithmetic mean (or usually called mean) is the most
common measure of central tendency
 Sample mean:  Population mean:
n
N

 Xi  Xi
X = i =1 = i=1

n N
=
X1 + X 2 ++ Xn X1 + X 2 ++ XN
=
n N
where n is the sample size where N is the population size
Xi is the ith observation Xi is the ith observation

Foundation Statistics – Measures of Central Tendency and Variation 4


Arithmetic Mean (cont’d)
 The most common measure of central tendency
 Mean = sum of values divided by the number of values
 Sensitive to extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10

Mean = 3
1 + 2 + 3 + 4 + 5 15
= =3
5 5

0 1 2 3 4 5 6 7 8 9… 20

Mean = 6
1 + 2 + 3 + 4 + 20 30
= =6
5 5

Foundation Statistics – Measures of Central Tendency and Variation 5


Median
 In an ordered array, the median is the “middle” number
(50% above, 50% below)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9…20

Median = 3 Median = 3

 Robust to extreme values

Foundation Statistics – Measures of Central Tendency and Variation 6


Finding the Median
 The location of the median:
n +1
Median position = position in the ordered data
2
– If the number of values is odd, the median is the middle number
– If the number of values is even, the median is the average of the
two middle numbers

n +1
 Note that 2
is not the value of the median, only the
position of the median in the ranked data

Foundation Statistics – Measures of Central Tendency and Variation 7


Mode
 A measure of central tendency
 Value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical (nominal) data
 -> There may be no mode
 -> There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode
Mode = 9

Foundation Statistics – Measures of Central Tendency and Variation 8


Geometric Mean
 Formula for Geometric Mean:
GM = n ( X 1)( X 2)( X 3)...( Xn )
– X1…n is a set of positive numbers (i.e. rate of change in
percentages, ratios, etc.)
– n is the total number of values

 Widely used in business and economics to find the


average of
– Percentages
– Ratios
– Growth rates

Foundation Statistics – Measures of Central Tendency and Variation 9


Example for Geometric Mean
 Example:
Suppose that Anne receives a 5% increase in salary this
year and a 15% increase next year. Calculate the average
annual percentage increase.

 Answer: GM = 2 (1.05)(1.15) = 1.098863


(i.e. average annual % increase =1.098863 -1 = 0.098863 =9.8863%)

Geometric mean rate of return (GMRR)

 Assume the original salary for Anne is USD3000,


– Salary at year 2 = 3000*1.05*1.15 = 3622.50
– Using GM: 3000 * 1.098863*1.098863 = 3622.50
Foundation Statistics – Measures of Central Tendency and Variation 10
Exercise: Real estate project
 Suppose a real estate project yields 6%, 12% and 10%
increase in the first, second and third year respectively.
Find the geometric mean rate of return (GMRR).

 Answer:

Foundation Statistics – Measures of Central Tendency and Variation 11


Quartiles
 Quartiles split the ranked data into 4 segments with an
equal number of values per segment
25% 25% 25% 25%

Q1 Q2 Q3

◼ The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
◼ The second quartile, Q2, is the same as the median (50%
are smaller, 50% are larger)
◼ Only 25% of the observations are greater than the third
quartile, Q3

Foundation Statistics – Measures of Central Tendency and Variation 12


Quartile Formulas
 Find a quartile by determining the value in the appropriate
position in the ranked data, where
 Q1, First quartile position: Q1 = (n+1)/4
 Q2, Second quartile position: Q2 = (n+1)/2
(the median position)
 Q3, Third quartile position: Q3 = 3(n+1)/4

where n is the number of observed values

Foundation Statistics – Measures of Central Tendency and Variation 13


Quartiles
 Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

 As n=9,
Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so
we use the value halfway between the 2nd and the 3rd
values, which yields Q1 = 12.5
 Q1 and Q3 are measures of non-central location while Q2,
i.e. median, is a measure of central tendency

Position Value

Foundation Statistics – Measures of Central Tendency and Variation 14


Quartiles (cont’d)
 Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

 Q1 is in the (9+1)/4 = 2.5 position of the ranked data,


so Q1 = 12.5
 Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
 Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5

Foundation Statistics – Measures of Central Tendency and Variation 15


Exercise for Quartiles
 Data: 11 12 13 16 16 17 18 21 22 25 n=10

 Find Quartiles:
 Q1 is in the (10+1)/4 = 2.75 or rounded as the 3rd
ranked data, so Q1 = 13
 Q2 is in the (10+1)/2 = 5.5th ranked data, so
Q2 = median = 16.5
 Q3 is in the 3(10+1)/4 = 8.25 or rounded as the 8th
ranked data, so Q3 = 21

Foundation Statistics – Measures of Central Tendency and Variation 16


Measures of Variation (Dispersion)

Variation

Range Interquartile *Variance *Standard *Coefficient


Range Deviation of Variation

Small variation
◼ Measures of variation give
information on the spread or
Large variation
variability of the data values

Same center,
different variation
Foundation Statistics – Measures of Central Tendency and Variation 17
Range
 Simplest measure of variation
 Difference between the largest and the smallest values in
a set of data:
Range = Xlargest – Xsmallest

 Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range: 14 – 1 = 13

Foundation Statistics – Measures of Central Tendency and Variation 18


Disadvantages of Range
 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 120 - 1 = 119 Misleading statistics

Foundation Statistics – Measures of Central Tendency and Variation 19


Interquartile Range
 Can eliminate some outlier problems by using the
interquartile range

 Eliminate some high- and low-valued observations and


calculate the range from the remaining values

 Interquartile range = 3rd quartile – 1st quartile


= Q3 – Q 1

Foundation Statistics – Measures of Central Tendency and Variation 20


Interquartile Range
 Example

Median X
X Q1 Q3
(Q2) maximum
minimum
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27

Foundation Statistics – Measures of Central Tendency and Variation 21


Boxplot: Exploratory Data Analysis
 Boxplot: A Graphical display of data using 5-number
summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
25% 25%

Minimum 1st Median 3rd Maximum


Minimum Quartile
1st Median Quartile
3rd Maximum
Quartile Quartile

Foundation Statistics – Measures of Central Tendency and Variation 22


Boxplot (cont’d)
 Boxplot can be drawn either horizontally or vertically
 Outliers can be detected and shown

Foundation Statistics – Measures of Central Tendency and Variation 23


Shape of Distribution
 Describes how data are distributed
 Measures of shape
– Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean

 Skewness: whether the data is concentrated on


one side.

Foundation Statistics – Measures of Central Tendency and Variation 24


Distribution Shape and Boxplot
Left-Skewed Symmetric Right-Skewed

Q1 Q 2 Q3 Q1 Q2 Q 3 Q1 Q2 Q3

– Negative skew: The left tail is longer; the mass of the distribution is
concentrated on the right of the figure. It has relatively few low
values. The distribution is said to be left-skewed.
– Positive skew: The right tail is longer; the mass of the distribution is
concentrated on the left of the figure. It has relatively few high
values. The distribution is said to be right-skewed.

Foundation Statistics – Measures of Central Tendency and Variation 25


Example for Boxplot
 Below is a boxplot for the following data:

Min Q1 Q2 Q3 Max
0 2 2 2 3 3 4 5 5 10 27

00 22 33 55 27
27
 The data are right-skewed, as the plot depicts

Foundation Statistics – Measures of Central Tendency and Variation 26


In-class Exercise: Boxplot
 Data set:
2, 3, 4, 4, 6, 7, 7, 9, 10, 11, 13, and 20
Draw a boxplot and describe the shape for the above data

Answer:

Foundation Statistics – Measures of Central Tendency and Variation 27


Variance
 Measure the data dispersion

Mean

 Variance measures the dispersion of a set of data points


around their mean

Foundation Statistics – Measures of Central Tendency and Variation 28


Variance
 Average of squared differences of values from the mean
 Sample variance:  Population variance:

 (X )
n N

 (X i − )
2 2
i −X
s =
2 i =1
2 = i =1
n −1 N
where n is the sample size where N is the population size
Xi is the ith observation Xi is the ith observation
X is the sample mean μ is the population mean

Foundation Statistics – Measures of Central Tendency and Variation 29


Variance
 Population variance:
• Dispersion is non-negative
N
• Non-negative values don’t
 (X i − )
2
cancel out
2 = i=1
• Amplifies the effect of large
N
differences
2
higher result

Mean

2
lower result

Foundation Statistics – Measures of Central Tendency and Variation 30


Example - Variance
 Population of 5 observations:
 1, 2, 3, 4, 5
 Task: Calculate the population variance
N

 (X i − )
2
Answer:
N=5 2 = i =1
N
1+2+3+4+5
Mean 𝑋ሜ = = 3.00
5

1−3 2 + 2−3 2 + 3−3 2 + 4−3 2 + 5−3 2


𝜎2 =
5

= 2.00
Foundation Statistics – Measures of Central Tendency and Variation 31
Example - Variance (cont’d)
(X i )
n


2
−X
 What if they are sample (1,2,3,4,5), n = 5
s2 = i
=1
 Sample variance S2 = 2.50 n −1

 Why is the sample variance different than the population


variance?
 Because: the sample has uncertainty.

 Imaginary population: 1, 1, 1, 2, 3, 4, 5, 5, 5, 5
 𝜎 2 = 2.96

 Our sample variance has rightfully corrected upwards in order to


reflect the higher potential variability.

Foundation Statistics – Measures of Central Tendency and Variation 32


Standard Deviation (SD)
 Average of squared differences of values from the mean
 Most commonly used measure of variation
 Shows variation about the mean
 It’s the square root of the variance
 Has the same units as the original data
 Sample standard  Population standard
deviation: deviation:
N
 (X i − X) (X i − )
n


2 2

s= i =1
 = i=1
n −1 N
where n is the sample size where N is the population size
Xi is the ith observation Xi is the ith observation
X is the sample mean μ is the population mean

Foundation Statistics – Measures of Central Tendency and Variation 33


Example for sample S.D.
Sample
Data (Xi) : 10 12 14 15 17 18 18 24

n=8 Sample mean = X = 16

S =
(10 − X ) + (12 − X ) + (14 − X )
2 2 2
(
+  + 24 − X )
2

n −1

=
(10 − 16 )2 + (12 − 16 )2 + (14 − 16 )2 +  + (24 − 16 )2
8 −1

130 A measure of the “average”


= = 4.3095 scatter around the mean
7

Foundation Statistics – Measures of Central Tendency and Variation 34


Example for Variance vs S.D.
 Pizza prices at 10 different places in New York and HK:
USD
HKD
$ 1.00
HK$ 75.00
$ 2.00
HK$ 80.00
$ 3.00
HK$ 90.00
$ 3.00 USD HKD
HK$ 90.00
$ 5.00 $ 5.50 Mean HK$ 102.00
HK$ 95.00
$ 6.00 $2 10.72 Sample variance HK$2 523.33 HK$ 100.00
$ 7.00 $ 3.27 Sample standard deviation HK$ 22.88 HK$ 100.00
$ 8.00
HK$ 110.00
$ 9.00
HK$ 130.00
$ 11.00
HK$ 150.00

Image Credit: CC BY-NC-ND

Foundation Statistics – Measures of Central Tendency and Variation 35


Coefficient of Variation
 Relative measure of dispersion: comparing two or more
data sets
 Expressed as a % rather than the units
 Useful for comparing data sets which are expressed in
different units of measurement
 Also useful for data sets with same unit of measurement,
but vary greatly by their means and/or SD

S 
CV =    100%
X 

Foundation Statistics – Measures of Central Tendency and Variation 36


Example: Compare prices of pizza
 Continue from previous example:
New York Hongkong
Mean US$ 5.5 HK$ 102
Standard Deviation US$ 3.27 HK$ 22.88
S 
CV =    100%
X 

CV for New York: 3.27/5.5*100% = 60%


CV for Hongkong: 22.88/102*100% = 22%
Interpret this result:
Despite HK has the larger standard deviation, it gives lower
coefficient of variation
-> the prices of pizza in HK are relatively less volatile.
Foundation Statistics – Measures of Central Tendency and Variation 37
Observations and Implications
 Observations of data dispersion:
– The more spread out, or dispersed, the data are, the larger will be
the range, the inter-quartile range, the variance, and the standard
deviation
– The more concentrated, or homogeneous, the data are, the
smaller will be the range, the inter-quartile range, the variance,
and the standard deviation
– If the observations are all the same, the range, the inter-quartile
range, the variance, and the standard deviation will all be zero
– None of the measures of variation can ever be negative

 Implications:
– Helps to know how a set of data clusters around its mean
– In any data set, the observed values lie within a certain standard
deviations above or below the mean. (Chebyshev's Rule)

Foundation Statistics – Measures of Central Tendency and Variation 38


The Empirical Rule
 If the data distribution is approximately bell-shaped, then
the interval:
   1 contains about 68% of the values in the
population or the sample

Example:
Consider lifetime of certain
brand of battery
µ = 100hr
68% σ = 2hr

 Therefore, about 68% of


battery lies between 98 to 102
  1 hours

Foundation Statistics – Measures of Central Tendency and Variation 39


The Empirical Rule (cont’d)
   2 contains about 95% of the values in the
population or the sample
   3 contains about 99.7% of the values in the
population or the sample

95% 99.7%

  2   3

Foundation Statistics – Measures of Central Tendency and Variation 40


Example
 Did you know that the average IQ Score is 100?
Example:
Consider IQ Score
µ = 100
σ = 15
  1

Therefore, about 68%


of people IQ Score
  2 between 85 – 115.

About 95% of people


IQ Score between 70 –
130.

Foundation Statistics – Measures of Central Tendency and Variation 41


Chebyshev’s Rule
 Regardless of how the data are distributed, at least
(1 - 1/k2) x 100% of the values will fall within k standard
deviations of the mean (for k > 1)
 Example:
At least within
(1 - 1/12) x 100% = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) x 100% = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) x 100% = 89% …........ k=3 (μ ± 3σ)

Foundation Statistics – Measures of Central Tendency and Variation 42


Example for Chebshev’s Rule
 Example: Consider lifetime of certain brand of battery
with µ = 100hr and σ = 2hr
 Using Chebyshev’s theorem, between what values would
you expect at least 80% of batteries lie?
 Answer:
1 between μ ± kσ
1 − 2 = 0.8
𝑘 between 100 ± 2.2361(2)

𝑘 = 2.2361 i.e. between 95.5278 hr and 104.4722 hr

Foundation Statistics – Measures of Central Tendency and Variation 43

You might also like