Lec02 - Central Tendency (Student)

AST10113 Foundation Statistics
Lecture 2
Measures of Central Tendency and
Variation
Kirk Chan & Charmaine Lau
Foundation Statistics – Measures of Central Tendency and Variation

Summary Measures
Describing Data Numerically
Central Tendency Quartiles Variation
Arithmetic Mean Asymmetry Range
Median Shape Interquartile Range
Mode Variance
Skewness
Geometric Mean Standard Deviation
Coefficient of Variation
Foundation Statistics – Measures of Central Tendency and Variation 2

Measures of Central Tendency
Overview
Central Tendency
Arithmetic Mean Median Mode Geometric Mean
n
X G = (X 1  X 2    X n )1 / n
 Xi
X = i =1
n Midpoint of Most
ranked frequently
values observed
value

Arithmetic Mean
 The arithmetic mean (or usually called mean) is the most
common measure of central tendency
 Sample mean:  Population mean:
n
N
 Xi  Xi
X = i =1 = i=1
n N
=
X1 + X 2 ++ Xn X1 + X 2 ++ XN
=
n N
where n is the sample size where N is the population size
Xi is the ith observation Xi is the ith observation

Arithmetic Mean (cont’d)
 The most common measure of central tendency
 Mean = sum of values divided by the number of values
 Sensitive to extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1 + 2 + 3 + 4 + 5 15
= =3
5 5
0 1 2 3 4 5 6 7 8 9… 20
Mean = 6
1 + 2 + 3 + 4 + 20 30
= =6
5 5

Median
 In an ordered array, the median is the “middle” number
(50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9…20
Median = 3 Median = 3
 Robust to extreme values

Finding the Median
 The location of the median:
n +1
Median position = position in the ordered data
2
– If the number of values is odd, the median is the middle number
– If the number of values is even, the median is the average of the
two middle numbers
n +1
 Note that 2
is not the value of the median, only the
position of the median in the ranked data

Mode
 A measure of central tendency
 Value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical (nominal) data
 -> There may be no mode
 -> There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode
Mode = 9

Geometric Mean
 Formula for Geometric Mean:
GM = n ( X 1)( X 2)( X 3)...( Xn )
– X1…n is a set of positive numbers (i.e. rate of change in
percentages, ratios, etc.)
– n is the total number of values
 Widely used in business and economics to find the

average of
– Percentages
– Ratios
– Growth rates

Example for Geometric Mean
 Example:
Suppose that Anne receives a 5% increase in salary this
year and a 15% increase next year. Calculate the average
annual percentage increase.
 Answer: GM = 2 (1.05)(1.15) = 1.098863

(i.e. average annual % increase =1.098863 -1 = 0.098863 =9.8863%)
Geometric mean rate of return (GMRR)
 Assume the original salary for Anne is USD3000,

– Salary at year 2 = 3000*1.05*1.15 = 3622.50
– Using GM: 3000 * 1.098863*1.098863 = 3622.50
Exercise: Real estate project
 Suppose a real estate project yields 6%, 12% and 10%
increase in the first, second and third year respectively.
Find the geometric mean rate of return (GMRR).
 Answer:

Quartiles
 Quartiles split the ranked data into 4 segments with an
equal number of values per segment
25% 25% 25% 25%
Q1 Q2 Q3
◼ The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
◼ The second quartile, Q2, is the same as the median (50%
are smaller, 50% are larger)
◼ Only 25% of the observations are greater than the third
quartile, Q3

Quartile Formulas
 Find a quartile by determining the value in the appropriate
position in the ranked data, where
 Q1, First quartile position: Q1 = (n+1)/4
 Q2, Second quartile position: Q2 = (n+1)/2
(the median position)
 Q3, Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values

Quartiles
 Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
 As n=9,
Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so
we use the value halfway between the 2nd and the 3rd
values, which yields Q1 = 12.5
 Q1 and Q3 are measures of non-central location while Q2,
i.e. median, is a measure of central tendency
Position Value

Quartiles (cont’d)
 Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
 Q1 is in the (9+1)/4 = 2.5 position of the ranked data,

so Q1 = 12.5
 Q2 is in the (9+1)/2 = 5th position of the ranked data,
so Q2 = median = 16
 Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,
so Q3 = 19.5

Exercise for Quartiles
 Data: 11 12 13 16 16 17 18 21 22 25 n=10
 Find Quartiles:
 Q1 is in the (10+1)/4 = 2.75 or rounded as the 3rd
ranked data, so Q1 = 13
 Q2 is in the (10+1)/2 = 5.5th ranked data, so
Q2 = median = 16.5
 Q3 is in the 3(10+1)/4 = 8.25 or rounded as the 8th
ranked data, so Q3 = 21

Measures of Variation (Dispersion)
Variation
Range Interquartile *Variance *Standard *Coefficient

Range Deviation of Variation
Small variation
◼ Measures of variation give
information on the spread or
Large variation
variability of the data values
Same center,
different variation
Range
 Simplest measure of variation
 Difference between the largest and the smallest values in
a set of data:
Range = Xlargest – Xsmallest
 Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range: 14 – 1 = 13

Disadvantages of Range
 Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119 Misleading statistics

Interquartile Range
 Can eliminate some outlier problems by using the
interquartile range
 Eliminate some high- and low-valued observations and

calculate the range from the remaining values
 Interquartile range = 3rd quartile – 1st quartile

= Q3 – Q 1

Interquartile Range
 Example
Median X
X Q1 Q3
(Q2) maximum
minimum
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27

Boxplot: Exploratory Data Analysis
 Boxplot: A Graphical display of data using 5-number
summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
25% 25%
Minimum 1st Median 3rd Maximum

Minimum Quartile
1st Median Quartile
3rd Maximum
Quartile Quartile

Boxplot (cont’d)
 Boxplot can be drawn either horizontally or vertically
 Outliers can be detected and shown

Shape of Distribution
 Describes how data are distributed
 Measures of shape
– Symmetric or skewed
Left-Skewed Symmetric Right-Skewed

Mean < Median Mean = Median Median < Mean
 Skewness: whether the data is concentrated on

one side.

Distribution Shape and Boxplot
Left-Skewed Symmetric Right-Skewed
Q1 Q 2 Q3 Q1 Q2 Q 3 Q1 Q2 Q3
– Negative skew: The left tail is longer; the mass of the distribution is
concentrated on the right of the figure. It has relatively few low
values. The distribution is said to be left-skewed.
– Positive skew: The right tail is longer; the mass of the distribution is
concentrated on the left of the figure. It has relatively few high
values. The distribution is said to be right-skewed.

Example for Boxplot
 Below is a boxplot for the following data:

Min Q1 Q2 Q3 Max
0 2 2 2 3 3 4 5 5 10 27
00 22 33 55 27
27
 The data are right-skewed, as the plot depicts

In-class Exercise: Boxplot
 Data set:
2, 3, 4, 4, 6, 7, 7, 9, 10, 11, 13, and 20
Draw a boxplot and describe the shape for the above data
Answer:

Variance
 Measure the data dispersion
Mean
 Variance measures the dispersion of a set of data points

around their mean

Variance
 Average of squared differences of values from the mean
 Sample variance:  Population variance:
 (X )
n N
 (X i − )
2 2
i −X
s =
2 i =1
2 = i =1
n −1 N
X is the sample mean μ is the population mean

Variance
 Population variance:
• Dispersion is non-negative
N
• Non-negative values don’t
 (X i − )
2
cancel out
2 = i=1
• Amplifies the effect of large
N
differences
2
higher result
Mean
2
lower result

Example - Variance
 Population of 5 observations:
 1, 2, 3, 4, 5
 Task: Calculate the population variance
N
 (X i − )
2
Answer:
N=5 2 = i =1
N
1+2+3+4+5
Mean 𝑋ሜ = = 3.00
5
1−3 2 + 2−3 2 + 3−3 2 + 4−3 2 + 5−3 2

𝜎2 =
5
= 2.00
Example - Variance (cont’d)
(X i )
n

2
−X
 What if they are sample (1,2,3,4,5), n = 5
s2 = i
=1
 Sample variance S2 = 2.50 n −1
 Why is the sample variance different than the population

variance?
 Because: the sample has uncertainty.
 Imaginary population: 1, 1, 1, 2, 3, 4, 5, 5, 5, 5
 𝜎 2 = 2.96
 Our sample variance has rightfully corrected upwards in order to

reflect the higher potential variability.

Standard Deviation (SD)
 Average of squared differences of values from the mean
 Most commonly used measure of variation
 Shows variation about the mean
 It’s the square root of the variance
 Has the same units as the original data
 Sample standard  Population standard
deviation: deviation:
N
 (X i − X) (X i − )
n

2 2
s= i =1
 = i=1
n −1 N
X is the sample mean μ is the population mean

Example for sample S.D.
Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Sample mean = X = 16
S =
(10 − X ) + (12 − X ) + (14 − X )
2 2 2
(
+  + 24 − X )
2
n −1
=
(10 − 16 )2 + (12 − 16 )2 + (14 − 16 )2 +  + (24 − 16 )2
8 −1
130 A measure of the “average”

= = 4.3095 scatter around the mean
7

Example for Variance vs S.D.
 Pizza prices at 10 different places in New York and HK:
USD
HKD
$ 1.00
HK$ 75.00
$ 2.00
HK$ 80.00
$ 3.00
HK$ 90.00
$ 3.00 USD HKD
HK$ 90.00
$ 5.00 $ 5.50 Mean HK$ 102.00
HK$ 95.00
$ 6.00 $2 10.72 Sample variance HK$2 523.33 HK$ 100.00
$ 7.00 $ 3.27 Sample standard deviation HK$ 22.88 HK$ 100.00
$ 8.00
HK$ 110.00
$ 9.00
HK$ 130.00
$ 11.00
HK$ 150.00
Image Credit: CC BY-NC-ND

Coefficient of Variation
 Relative measure of dispersion: comparing two or more
data sets
 Expressed as a % rather than the units
 Useful for comparing data sets which are expressed in
different units of measurement
 Also useful for data sets with same unit of measurement,
but vary greatly by their means and/or SD
S 
CV =    100%
X 

Example: Compare prices of pizza
 Continue from previous example:
New York Hongkong
Mean US$ 5.5 HK$ 102
Standard Deviation US$ 3.27 HK$ 22.88
S 
CV =    100%
X 
CV for New York: 3.27/5.5*100% = 60%

CV for Hongkong: 22.88/102*100% = 22%
Interpret this result:
Despite HK has the larger standard deviation, it gives lower
coefficient of variation
-> the prices of pizza in HK are relatively less volatile.
Observations and Implications
 Observations of data dispersion:
– The more spread out, or dispersed, the data are, the larger will be
the range, the inter-quartile range, the variance, and the standard
deviation
– The more concentrated, or homogeneous, the data are, the
smaller will be the range, the inter-quartile range, the variance,
and the standard deviation
– If the observations are all the same, the range, the inter-quartile
range, the variance, and the standard deviation will all be zero
– None of the measures of variation can ever be negative
 Implications:
– Helps to know how a set of data clusters around its mean
– In any data set, the observed values lie within a certain standard
deviations above or below the mean. (Chebyshev's Rule)

The Empirical Rule
 If the data distribution is approximately bell-shaped, then
the interval:
   1 contains about 68% of the values in the
population or the sample
Example:
Consider lifetime of certain
brand of battery
µ = 100hr
68% σ = 2hr
 Therefore, about 68% of

battery lies between 98 to 102
  1 hours

The Empirical Rule (cont’d)
   2 contains about 95% of the values in the
   3 contains about 99.7% of the values in the
95% 99.7%
  2   3

Example
 Did you know that the average IQ Score is 100?
Example:
Consider IQ Score
µ = 100
σ = 15
  1
Therefore, about 68%

of people IQ Score
  2 between 85 – 115.
About 95% of people

IQ Score between 70 –
130.

Chebyshev’s Rule
 Regardless of how the data are distributed, at least
(1 - 1/k2) x 100% of the values will fall within k standard
deviations of the mean (for k > 1)
 Example:
At least within
(1 - 1/12) x 100% = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) x 100% = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) x 100% = 89% …........ k=3 (μ ± 3σ)

Example for Chebshev’s Rule
 Example: Consider lifetime of certain brand of battery
with µ = 100hr and σ = 2hr
 Using Chebyshev’s theorem, between what values would
you expect at least 80% of batteries lie?
 Answer:
1 between μ ± kσ
1 − 2 = 0.8
𝑘 between 100 ± 2.2361(2)
𝑘 = 2.2361 i.e. between 95.5278 hr and 104.4722 hr

Lec02 - Central Tendency (Student)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec02 - Central Tendency (Student)

Uploaded by

Copyright:

Available Formats

AST10113 Foundation Statistics

Foundation Statistics – Measures of Central Tendency and Variation

Describing Data Numerically

Central Tendency Quartiles Variation

Arithmetic Mean Asymmetry Range

Median Shape Interquartile Range

Foundation Statistics – Measures of Central Tendency and Variation 2

Arithmetic Mean Median Mode Geometric Mean

Foundation Statistics – Measures of Central Tendency and Variation 3

Foundation Statistics – Measures of Central Tendency and Variation 4

Foundation Statistics – Measures of Central Tendency and Variation 5

 Robust to extreme values

Foundation Statistics – Measures of Central Tendency and Variation 6

Foundation Statistics – Measures of Central Tendency and Variation 7

Foundation Statistics – Measures of Central Tendency and Variation 8

 Widely used in business and economics to find the

Foundation Statistics – Measures of Central Tendency and Variation 9

 Answer: GM = 2 (1.05)(1.15) = 1.098863

Geometric mean rate of return (GMRR)

 Assume the original salary for Anne is USD3000,

Foundation Statistics – Measures of Central Tendency and Variation 11

Foundation Statistics – Measures of Central Tendency and Variation 12

where n is the number of observed values

Foundation Statistics – Measures of Central Tendency and Variation 13

Foundation Statistics – Measures of Central Tendency and Variation 14

 Q1 is in the (9+1)/4 = 2.5 position of the ranked data,

Foundation Statistics – Measures of Central Tendency and Variation 15

Foundation Statistics – Measures of Central Tendency and Variation 16

Range Interquartile *Variance *Standard *Coefficient

Foundation Statistics – Measures of Central Tendency and Variation 18

Range = 120 - 1 = 119 Misleading statistics

Foundation Statistics – Measures of Central Tendency and Variation 19

 Eliminate some high- and low-valued observations and

 Interquartile range = 3rd quartile – 1st quartile

Foundation Statistics – Measures of Central Tendency and Variation 20

Foundation Statistics – Measures of Central Tendency and Variation 21

Minimum 1st Median 3rd Maximum

Foundation Statistics – Measures of Central Tendency and Variation 22

Foundation Statistics – Measures of Central Tendency and Variation 23

Left-Skewed Symmetric Right-Skewed

 Skewness: whether the data is concentrated on

Foundation Statistics – Measures of Central Tendency and Variation 24

Foundation Statistics – Measures of Central Tendency and Variation 25

Foundation Statistics – Measures of Central Tendency and Variation 26

Foundation Statistics – Measures of Central Tendency and Variation 27

 Variance measures the dispersion of a set of data points

Foundation Statistics – Measures of Central Tendency and Variation 28

Foundation Statistics – Measures of Central Tendency and Variation 29

Foundation Statistics – Measures of Central Tendency and Variation 30

1−3 2 + 2−3 2 + 3−3 2 + 4−3 2 + 5−3 2

 Why is the sample variance different than the population

 Our sample variance has rightfully corrected upwards in order to

Foundation Statistics – Measures of Central Tendency and Variation 32

Foundation Statistics – Measures of Central Tendency and Variation 33

n=8 Sample mean = X = 16

130 A measure of the “average”

Foundation Statistics – Measures of Central Tendency and Variation 34

Image Credit: CC BY-NC-ND

Foundation Statistics – Measures of Central Tendency and Variation 35

Foundation Statistics – Measures of Central Tendency and Variation 36

CV for New York: 3.27/5.5*100% = 60%

Foundation Statistics – Measures of Central Tendency and Variation 38

Range Interquartile Variance Standard *Coefficient