Professional Documents
Culture Documents
Chapter 3
Chapter 3
Chapter 3
1
Chapter 3
Part 1: Measures of Location:
a) Mean (sample, weighted and geometric mean)
b) Mode
c) Median
d) Percentiles
e) Quartiles
3
Part 1
Measures of Location
1. Mean
If the measures are computed
a) Sample mean for data from a sample,
b) Weighted mean they are called sample statistics.
c) Geometric mean
If the measures are computed
2. Mode for data from a population,
they are called population parameters.
3. Median
A sample statistic is referred to
4. Quartiles
as the point estimator of the
corresponding population parameter.
5. Percentiles
4
1. Mean (Sample, Weighted and Geometric Mean)
also called “Average”
a) The 𝐬𝐚𝐦𝐩𝐥𝐞 𝐦𝐞𝐚𝐧 (ഥ
𝒙) also called “Arithmetic mean”:
The mean of a data set is the average of all the data values
σ 𝒙𝒊
ഥ=
𝒙
𝒏
Where σ 𝒙𝒊 is the sum of all observations, and n is the total number of
observations.
The sample mean is a point estimate of the population mean 𝜇.
5
Example A: Apartment Rents
Seventy apartments were randomly
sampled in a small university town.
The monthly rent prices (in €) for
these apartments are listed below:
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
ഥ :
𝐓𝐡𝐞 𝐬𝐚𝐦𝐩𝐥𝐞 𝐦𝐞𝐚𝐧 𝒙
σ 𝑥𝑖
ഥ=
𝒙 =
𝑛
7
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
ഥ :
𝐓𝐡𝐞 𝐬𝐚𝐦𝐩𝐥𝐞 𝐦𝐞𝐚𝐧 𝒙
σ 𝑥𝑖 445 + 615 + ⋯ + 440 34356
ഥ=
𝒙 = = = 𝟒𝟗𝟎. 𝟖𝟎
𝑛 70 70
8
b) The 𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐦𝐞𝐚𝐧 𝒙𝒘
Using the number of funds as weights, compute the weighted average total
return for the mutual funds covered by Morningstar.
10
Number of Funds Total Return (%)
Purchase
𝒘𝒊 𝒙𝒊
σ 𝒘𝒊 𝒙𝒊
𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐦𝐞𝐚𝐧 (𝒙𝒘 ) = σ 𝒘𝒊
=
11
Number of Funds Total Return (%)
Purchase 𝒘𝒊 × 𝒙 𝒊
𝒘𝒊 𝒙𝒊
σ 𝒘𝒊 𝒙𝒊 𝟗𝟏𝟗𝟏∗𝟒.𝟔𝟓 +⋯+(𝟐𝟗𝟎𝟎∗𝟔.𝟕𝟓)
𝐖𝐞𝐢𝐠𝐡𝐭𝐞𝐝 𝐦𝐞𝐚𝐧 (𝒙𝒘 ) = σ 𝒘𝒊
=
𝟗𝟏𝟗𝟏+𝟐𝟔𝟐𝟏+𝟏𝟒𝟏𝟗+𝟐𝟗𝟎𝟎
126004.1
= = 𝟕. 𝟖𝟏
16131
12
ഥ
c) The 𝐆𝐞𝐨𝐦𝐞𝐭𝐫𝐢𝐜 𝐦𝐞𝐚𝐧 𝒙
𝑛
Geometric mean = 𝒙𝒈 = 𝑥1 × 𝑥2 × … × 𝑥𝑛
13
Example C: Share Price
Consider the following five data values, which represent the share
price of a company at the beginning of five successive years,
relative to the price at the start of the previous year.
14
1.11 1.35 0.80 1.40 1.05
𝑛
Geometric mean = 𝒙𝒈 = 𝑥1 × 𝑥2 × … × 𝑥𝑛
15
1.11 1.35 0.80 1.40 1.05
𝑛
Geometric mean = 𝒙𝒈 = 𝑥1 × 𝑥2 × … × 𝑥𝑛
5
= 1.11×1.35×0.80×1.40×1.05
= 𝟏. 𝟏𝟐𝟎
16
2. Mode
• The mode of a data set is the value that occurs with greatest
frequency.
• If the data have exactly two modes, the data are bimodal.
• If the data have more than two modes, the data are multimodal.
17
Example:
18
Example:
The value 450 repeated 7 times, the most repeated value, and
therefore the Mode of our data set is 450.
19
3. Median also called “Second Quartile (𝑄2 )”
• The median is a measure of central location provided by the value in the
middle when the data are arranged in ascending order.
• Whenever a data set has extreme values, the median is the preferred
measure of central location.
To get the median, first arrange the data in ascending order (smallest value to
largest value), then:
a) For an odd number of observations, the median is the middle value.
b) For an even number of observations, the median is the average of the two
middle values.
20
Example:
21
Example:
n= 5 is odd
22
Example:
24
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
26
Steps to Calculate the pth percentile:
28
We need to find the 20th Percentile, therefore 𝑝 = 20.
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
n = 70.
𝑝 20
𝑖 = 𝑛= 70 = 14 (integer), then use position 14 and 15.
100 100
Therefore, 20𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 =
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑤𝑜 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝟏𝟒 𝑎𝑛𝑑 𝟏𝟓 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
445 + 445
= = 445.
2 29
Example:
30
We need to find the 35th Percentile, therefore 𝑝 = 35.
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
n = 70.
𝑝 35
𝑖 = 𝑛= 70 = 24.5 (Not an integer), then round it up ≅ 𝟐𝟓
100 100
Therefore, 35𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 25 𝑖𝑛 𝑑𝑎𝑡𝑎 = 450.
31
5. Quartiles
• Quartiles are specific percentiles:
First Quartile (𝑄1 ) = 25th Percentile (25% of data are less or equal to 𝑄1 )
Second Quartile (𝑄2 ) = 50th Percentile (50% of data are less or equal to 𝑄2 )
Third Quartile (𝑄3 ) = 75th Percentile (75% of data are less or equal to 𝑄3 )
Note: Using these formulas you will obtain positions of quartiles in the data set, if the position
is integer we take directly its value from the data, and if it is not integer we calculate the
average of values of which the position falls. 32
Quartiles on Number Line
33
Example: Back to Example A: Apartment Rents
Sorted Data: 425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
𝒏+𝟏 𝒕𝒉
𝑸𝟏 = =
𝟒
35
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
a) First Quartile (𝑸𝟏 ):
• 1st method: Using Quartile rule with n=70
Therefore 𝑄1 = 445 36
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
a) First Quartile (𝑸𝟏 ):
• 2nd method: Using Percentile rule: 𝑄1 = 25𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, n=70 and p=25. Then:
𝑝
𝑖 = 𝑛=
100
37
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
a) First Quartile (𝑸𝟏 ):
• 2nd method: Using Percentile rule: 𝑄1 = 25𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, n=70 and p=25. Then:
𝑝 25
𝑖 = 𝑛= 70 = 17.5 ≅ 18, therefore 𝑄1 ≅ 445
100 100
38
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
b) Second Quartile (𝑸𝟐 ):
• 1st method: Using Quartile rule with n=70
𝒏+𝟏 𝒕𝒉
𝑸𝟐 = 𝟐 =
𝟒
39
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
b) Second Quartile (𝑸𝟐 ):
• 1st method: Using Quartile rule with n=70
𝒏+𝟏 𝒕𝒉 𝟕𝟎+𝟏 𝒕𝒉 𝒕𝒉 𝒕𝒉 𝟑𝟓 𝒕𝒉 + 𝟑𝟔 𝒕𝒉
𝑸𝟐 = 𝟐 =2 = 𝟐 𝟏𝟕. 𝟕𝟓 = 𝟑𝟓. 𝟓 =
𝟒 𝟒 2
𝟒𝟕𝟓+𝟒𝟕𝟓
= = 475, Therefore 𝑄1 = 475 40
2
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
b) Second Quartile (𝑸𝟐 ):
• 2nd method: Using Percentile rule: 𝑄2 = 50𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, n=70 and p=50. Then:
𝑝
𝑖 = 𝑛=
100
41
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
b) Second Quartile (𝑸𝟐 ):
• 2nd method: Using Percentile rule: 𝑄2 = 50𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, n=70 and p=50. Then:
𝒏+𝟏 𝒕𝒉
𝑸𝟑 = 𝟑 =
𝟒
43
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
c) Third Quartile (𝑸𝟑 ):
• 1st method: Using Quartile rule with n=70
𝒏+𝟏 𝒕𝒉 𝟕𝟎+𝟏 𝒕𝒉 𝒕𝒉 𝒕𝒉 𝟓𝟑 𝒕𝒉 + 𝟓𝟒 𝒕𝒉
𝑸𝟑 = 𝟑 =3 = 𝟑 𝟏𝟕. 𝟕𝟓 = 𝟓𝟑. 𝟐𝟓 =
𝟒 𝟒 2
𝟓𝟐𝟓+𝟓𝟐𝟓
= = 525, Therefore 𝑄1 = 525 44
2
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
c) Third Quartile (𝑸𝟑 ):
• 2nd method: Using Percentile rule: 𝑄3 = 75𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, n=70 and p=75. Then:
𝑝
𝑖 = 𝑛=
100
45
Sorted Data:
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
c) Third Quartile (𝑸𝟑 ):
• 2nd method: Using Percentile rule: 𝑄3 = 75𝑡ℎ 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, n=70 and p=75. Then:
𝑝 75
𝑖 = 𝑛= 70 = 52.5 ≅ 53, therefore 𝑄3 ≅ 525
100 100
46
Part 2
Measures of Variability
It is often desirable to consider measures of variability (dispersion), as well as
measures of location.
For example, in choosing supplier A or supplier B we might consider not only the
average delivery time for each, but also the variability in delivery time for each.
1. Range
2. Interquartile Range
3. Variance
4. Standard Deviation
5. Coefficient of Variation
47
1. Range
Note: The range is very sensitive to the smallest and largest data
values.
48
Example:
49
• Find The Range of the data in Example A (Apartment Rents).
Sorted Data
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
• The interquartile Range (IQR) is the range for the middle 50% of
the data.
51
Example:
𝐱 𝟐
σ 𝐱 𝐢 −ത
Sample Variance= 𝐒 𝟐 =
𝐧−𝟏
Note: To compute the variance, you need to obtain the mean at first. 54
Example D: Class size.
consider the following class size data for a sample of five college
classes:
46 54 42 46 32
55
Example D: Class size.
consider the following class size data for a sample of five college
classes:
46 54 42 46 32
ഥ :
𝐓𝐡𝐞 𝐬𝐚𝐦𝐩𝐥𝐞 𝐦𝐞𝐚𝐧 𝒙
σ 𝑥𝑖 46 + 54 + 42 + 46 + 32
ഥ=
𝒙 = = 𝟒𝟒
𝑛 5
σ 𝒙𝒊 − 𝒙
ഥ 𝟐
𝟐
Sample Variance= 𝑺 =
𝒏−𝟏
56
ഥ = 𝟒𝟒
𝒙
Number of
Sample Mean Deviation about Squared Deviation
Students in
the Mean about the Mean
Class
(ഥ
𝒙) ഥ)
(𝒙𝒊 − 𝒙 ഥ )𝟐
(𝒙𝒊 − 𝒙
(𝒙𝒊 )
46
54
42
46
32
Total ഥ) =
(𝒙𝒊 − 𝒙 ഥ)𝟐 =
(𝒙𝒊 − 𝒙
σ ഥ 𝟐
𝒙𝒊 − 𝒙
Sample Variance= 𝑺𝟐 =
𝒏−𝟏 57
Number of
Sample Mean Deviation about Squared Deviation
Students in
the Mean about the Mean
Class
(ഥ
𝒙) ഥ)
(𝒙𝒊 − 𝒙 ഥ )𝟐
(𝒙𝒊 − 𝒙
(𝒙𝒊 )
46 44 2 4
54 44 10 100
42 44 -2 4
46 44 2 4
32 44 -12 144
Total ഥ) = 𝟎
(𝒙𝒊 − 𝒙 ഥ)𝟐 = 𝟐𝟓𝟔
(𝒙𝒊 − 𝒙
σ ഥ 𝟐
𝒙𝒊 − 𝒙 𝟐𝟓𝟔
Sample Variance= 𝑺𝟐 = = = 𝟔𝟒
𝒏−𝟏 𝟓−𝟏 58
4. Standard Deviation
• The standard deviation of a data set is the positive
square root of the variance.
• The standard deviation is computed as follows:
Population standard deviation = 𝛔 = 𝛔𝟐
60
𝑥ҧ = 44
Sample Variance= 𝑆 2 = 64
Therefore,
Sample standard deviation = 𝐒 = 𝐒 𝟐 = 𝟔𝟒 = 𝟖.
61
5. Coefficient of Variation (C.V)
• The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
• The coefficient of variation is computed as follows:
𝛔
Population coefficient of variation = × 𝟏𝟎𝟎 %
𝝁
𝐒
Sample coefficient of variation = ഥ
× 𝟏𝟎𝟎 %
𝒙
Note: To compute the coefficient of variation, you need to obtain the mean
and the standard deviation at first. 62
Example:
Using the data in example D (Class Size) Compute the Sample
coefficient of variation for the class size data.
Notes:
𝑥ҧ = 44
Sample standard deviation =𝑆 = 8
63
𝑥ҧ = 44
Sample standard deviation =𝑆 = 8
Therefore,
S
Sample coefficient of variation = × 100 %
𝑥ҧ
8
= × 100 % = 𝟏𝟖. 𝟐%
44
64
Extra Exercise:
Try to find the Variance, Standard Deviation as well as the
Coefficient of Variation of Example A data set (Apartment Rents)
• Variance
s2
i
( x x ) 2
2, 996.16
n1
• Standard Deviation
s s 2 2996.16 54.74
1. Empirical Rule
2. Boxplot and Outliers
3. Skewness
4. Measures of association between two variables
66
1. Empirical Rule
For data having a bell-shaped distribution
99.72%
95.44%
68.26%
x
m
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s 67
• 68.26% of the values of a normal random variable are within +/- 1
standard deviation of its mean.
ഥ−𝑺 ; 𝒙
𝒙 ഥ+𝑺
• 95.44% of the values of a normal random variable are within +/- 2
standard deviation of its mean.
ഥ − 𝟐𝑺 ; 𝒙
𝒙 ഥ + 𝟐𝑺
• 99.72% of the values of a normal random variable are within +/- 3
standard deviation of its mean.
ഥ − 𝟑𝑺 ; 𝒙
𝒙 ഥ + 𝟑𝑺
68
Example:
Using the data in example D (Class Size: number of students in
class). Data: 46 54 42 46 32
70
Answer:
72
Answer:
Data: 32 42 46 46 54 with ഥ = 44 and 𝑺 = 8
𝒙
b) What is the percentage for the following (28 ; 60)? Note: use
the empirical rule to get the answer.
28 ; 60 = 44 − 16 ; 44 + 16
= 44 − 2 ∗ 8 ; 44 + 2 ∗ 8
= 𝒙ത − 𝟐𝑺 ; 𝒙ത + 𝟐𝑺
Therefore the percentage of the interval 28 ; 60 is 95.44%
73
2. Boxplot & Outliers
• A box plot is a graphical summary of data that is used to study
the shape of the distribution of data as well as to detect
potential outliers.
𝑸𝟏 𝑸𝟐 𝑸𝟑 Potential Outliers
Potential Outliers
Start End
Variable Name
74
Drawing the boxplot requires the following steps:
1. Find 𝑸𝟏 (𝐹𝑖𝑟𝑠𝑡 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒) , 𝑸𝟐 (𝑀𝑒𝑑𝑖𝑎𝑛) and 𝑸𝟑 (𝑇ℎ𝑖𝑟𝑑 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒)
2. Find the 𝑳𝒐𝒘𝒆𝒓 𝑳𝒊𝒎𝒊𝒕 = 𝑄1 − 1.5 × 𝐼𝑄𝑅 and then determine if you have
outliers from the bottom (any value in the data less than the lower limit is
considered outlier)
3. Find the 𝑼𝒑𝒑𝒆𝒓 𝑳𝒊𝒎𝒊𝒕 = 𝑄3 + 1.5 × 𝐼𝑄𝑅 and then determine if you have
outliers from the top (any value in the data more than the upper limit is
considered outlier)
4. The Start point and End points of the boxplot are the lowest and highest
values in your dataset between the lower and upper limits.
75
Note: You may also draw approximately the boxplot (without using
lower and upper limits, assuming no outliers in the dataset) using
the five number summary, but this will not detect the outliers in
your data set.
𝑸𝟏 𝑸𝟐 𝑸𝟑
Minimum Maximum
Variable Name
Skewed data show a uneven boxplot, where the median cuts the box
into two unequal boxes:
• If the longer part of the box is to the right (or above) the median,
the data is said to be skewed right.
• If the longer part is to the left (or below) the median, the data
is skewed left. 76
Example: Draw the boxplot of the data in Example A (Apartment
Rents), is there any potential outliers?.
Sorted Data
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
77
425 430 430 435 435 435 435 435 440 440
𝑸𝟏 = 𝟒𝟒𝟓 440 440 440 445 445 445 445 445 450 450
𝑸𝟐 = 𝟒𝟕𝟓 450 450 450 450 450 460 460 460 465 465
𝑸𝟑 = 𝟓𝟐𝟓 465 470 470 472 475 475 475 480 480 480
𝑰𝑸𝑹 = 𝟖𝟎 480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
78
425 430 430 435 435 435 435 435 440 440
𝑸𝟏 = 𝟒𝟒𝟓 440 440 440 445 445 445 445 445 450 450
𝑸𝟐 = 𝟒𝟕𝟓 450 450 450 450 450 460 460 460 465 465
𝑸𝟑 = 𝟓𝟐𝟓 465 470 470 472 475 475 475 480 480 480
𝑰𝑸𝑹 = 𝟖𝟎 480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
1. Lower Limit = Q1 − 1.5 × IQR = 445 − 1.5 × 80 = 325 (No outliers from
the bottom side)
2. Upper Limit = Q 3 + 1.5 × IQR = 525 + 1.5 × 80 = 645 (No outliers from
the top side)
3. Start Point is the Minimum (425), and the End point is the Maximum (615)
79
425 430 430 435 435 435 435 435 440 440
𝑸𝟏 = 𝟒𝟒𝟓 440 440 440 445 445 445 445 445 450 450
𝑸𝟐 = 𝟒𝟕𝟓 450 450 450 450 450 460 460 460 465 465
𝑸𝟑 = 𝟓𝟐𝟓 465 470 470 472 475 475 475 480 480 480
𝐒𝐭𝐚𝐫𝐭 𝐏𝐨𝐢𝐧𝐭 = 𝟒𝟐𝟓 480 485 490 490 490 500 500 500 500 510
𝐄𝐧𝐝 𝐏𝐨𝐢𝐧𝐭 = 𝟔𝟏𝟓 510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
𝑸𝟏 𝑸𝟐 𝑸𝟑
Minimum Maximum
Apartment rents
425 445 475 525 615
Data has no Potential outliers, and the distribution is Right skew (positively
skewed) since the Right box in bigger than the left box.
80
3. Skewness
• The skewness of the dataset shows the shape of the distribution whether it is
symmetric or skewed.
• Symmetric data means that the data is mainly centered around the mean
(average). Right Skew means that the bulk of data belongs to the bottom
range, and therefore the Left Skew means that the majority of data falls in the
top range.
• Skewness can be studies using the Histogram (covered in part 1), Boxplot, and
the comparison between the mean and median.
IF 𝑿ഥ = 𝑸𝟐 then the data is symmetric
IF 𝑿ഥ < 𝑸𝟐 then the data is Left Skew (Negatively Skewed)
IF 𝑿ഥ > 𝑸𝟐 then the data is RightSkew (Positively Skewed)
81
Example: What can you say about the skewness of the data in
Example A (Apartment Rents)?
ഥ = 𝟒𝟗𝟎. 𝟖𝟎 and the boxplot as follow
Note: We have 𝑸𝟐 = 𝟒𝟕𝟓 , 𝒙
𝑸𝟏 𝑸𝟐 𝑸𝟑
Minimum Maximum
Apartment rents
425 445 475 525 615
82
Answer: What can you say about the skewness of the data in
Example A (Apartment Rents)?
ഥ = 𝟒𝟗𝟎. 𝟖𝟎 and the boxplot as follow
Note: We have 𝑸𝟐 = 𝟒𝟕𝟓 , 𝒙
𝑸𝟏 𝑸𝟐 𝑸𝟑
Minimum Maximum
Apartment rents
425 445 475 525 615
σ 𝐱 𝐢 −ത
𝐱 𝐲𝐢 −ത
𝐲
Sample Covariance = 𝐒𝑿𝒀 =
𝐧−𝟏
85
Number of Sales Volume
Example E: Sales Commercials ($100s)
(x) (y)
The store’s manager wants to determine the
2 50
relationship between the number of weekend 5 57
1 41
television commercials shown and the sales
3 54
at the store during the following week. 4 54
Sample data with sales expressed in hundreds 1 38
5 63
of dollars are provided in the following table: 3 48
4 59
2 46
Compute the sample covariance.
86
x y
𝒏 = 𝟏𝟎 2 50
5 57
1 41
3 54
4 54
1 38
5 63
3 48
4 59
2 46
Total
σ 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത
𝑺𝒂𝒎𝒑𝒍𝒆 𝑪𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 = 𝐒𝑿𝒀 =
𝐧−𝟏 87
x y 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത
𝒏 = 𝟏𝟎 2 50 -1 -1 1
5 57 2 6 12
𝟑𝟎
𝐱ത = =𝟑 1 41 -2 -10 20
𝟏𝟎
𝟓𝟏𝟎 3 54 0 3 0
𝐲ത = = 𝟓𝟏 4 54 1 3 3
𝟏𝟎
1 38 -2 -13 26
5 63 2 12 24
3 48 0 -3 0
4 59 1 8 8
2 46 -1 -5 5
Total 30 510 0 0 99
σ 𝐱 𝐢 −ത
𝐱 𝐲𝐢 −ത
𝐲 𝟗𝟗
𝑺𝒂𝒎𝒑𝒍𝒆 𝑪𝒐𝒗𝒂𝒓𝒊𝒂𝒏𝒄𝒆 = 𝐒𝑿𝒀 = = = 𝟏𝟏 > 𝟎, therefore we have positive relationship.
𝐧−𝟏 𝟏𝟎−𝟏
88
b) Correlation Coefficient:
• Correlation is a measure of linear association and necessarily causation.
• Just because two variables are highly correlated, it does not mean that one
variable is the cause of the other.
𝐒𝐗𝐘
Sample Correlation Coefficient = 𝐫𝐗𝐘 =
𝐒𝐗 𝐒𝐘
Where 𝐒𝐗𝐘 is the sample covariance, 𝐒𝐗 is the sample standard deviation of x, and
𝐒𝐘 is yje sample standard deviation of y. 89
The correlation coefficient can take only values between -1 and +1.
Moderate Moderate
Negative Positive
90
Example:
Using the data in example E (Sales) Compute the Sample correlation
coefficient.
x y
Notes: 𝐒𝑿𝒀 = 𝟏𝟏, 𝐱ത = 𝟑 & 𝐲ത = 𝟓𝟏 2 50
𝟐 5 57
σ 𝒙𝒊 − 𝒙ҧ
𝐒= 𝐒𝟐 𝒂𝒏𝒅 𝟐
𝑺 = 1 41
𝒏−𝟏 3 54
4 54
𝑺𝑿𝒀 1 38
𝐒𝐚𝐦𝐩𝐥𝐞 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 = 𝒓𝑿𝒀 =
𝑺𝑿 𝑺𝒀 5 63
3 48
4 59
2 46
91
𝒙 𝟐
σ 𝒙𝒊 −ഥ
𝐒𝑿𝒀 = 𝟏𝟏, 𝐱ത = 𝟑 & 𝐲ത = 𝟓𝟏 with 𝐒 = 𝐒𝟐 𝒂𝒏𝒅 𝑺𝟐 =
𝒏−𝟏
x y
2 50
5 57
1 41
3 54
4 54
1 38
5 63
3 48
4 59
2 46
Total 30 510
𝑺𝑿𝒀
𝐒𝐚𝐦𝐩𝐥𝐞 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 = 𝒓𝑿𝒀 =
𝑺𝑿 𝑺𝒀 92
x y 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝟐 𝐲𝐢 − 𝒚 𝟐
𝑺𝑿𝒀 = 𝟏𝟏 2 50 -1 -1 1 1
5 57 2 6 4 36
1 41 -2 -10 4 100
𝑺𝑿 = 𝑺𝑿 𝟐 3 54 0 3 0 9
4 54 1 3 1 9
1 38 -2 -13 4 169
5 63 2 12 4 144
𝑺𝒀 = 𝑺𝒀 𝟐 3 48 0 -3 0 9
4 59 1 8 1 64
2 46 -1 -5 1 25
Total 30 510 0 0 20 566
𝟐𝟎 𝟓𝟔𝟔
𝑺𝑿 = = 𝟏. 𝟒𝟗 𝒂𝒏𝒅 𝑺𝒚 = = 𝟕. 𝟗𝟑
𝟏𝟎 − 𝟏 𝟏𝟎 − 𝟏
93
x y 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝟐 𝐲𝐢 − 𝒚 𝟐
2 50 -1 -1 1 1
5 57 2 6 4 36
1 41 -2 -10 4 100
𝑺𝑿𝒀 = 𝟏𝟏 3 54 0 3 0 9
4 54 1 3 1 9
𝑺𝑿 = 𝟏. 𝟒𝟗 1 38 -2 -13 4 169
5 63 2 12 4 144
𝑺𝒀 = 𝟕. 𝟗𝟑 3 48 0 -3 0 9
4 59 1 8 1 64
2 46 -1 -5 1 25
Total 30 510 0 0 20 566
𝑺𝑿𝒀
𝐒𝐚𝐦𝐩𝐥𝐞 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 = 𝒓𝑿𝒀 = =
𝑺 𝑿 𝑺𝒀
94
x y 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝟐 𝐲𝐢 − 𝒚 𝟐
2 50 -1 -1 1 1
5 57 2 6 4 36
1 41 -2 -10 4 100
𝑺𝑿𝒀 = 𝟏𝟏 3 54 0 3 0 9
4 54 1 3 1 9
𝑺𝑿 = 𝟏. 𝟒𝟗 1 38 -2 -13 4 169
5 63 2 12 4 144
𝑺𝒀 = 𝟕. 𝟗𝟑 3 48 0 -3 0 9
4 59 1 8 1 64
2 46 -1 -5 1 25
Total 30 510 0 0 20 566
𝑺𝑿𝒀 𝟏𝟏
𝐒𝐚𝐦𝐩𝐥𝐞 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 = 𝒓𝑿𝒀 = = = 𝟎. 𝟗𝟑
𝑺𝑿 𝑺𝒀 𝟏. 𝟒𝟗 × 𝟕. 𝟗𝟑
95
x y 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝟐 𝐲𝐢 − 𝒚 𝟐
2 50 -1 -1 1 1
5 57 2 6 4 36
1 41 -2 -10 4 100
𝑺𝑿𝒀 = 𝟏𝟏 3 54 0 3 0 9
4 54 1 3 1 9
𝑺𝑿 = 𝟏. 𝟒𝟗 1 38 -2 -13 4 169
5 63 2 12 4 144
𝑺𝒀 = 𝟕. 𝟗𝟑 3 48 0 -3 0 9
4 59 1 8 1 64
2 46 -1 -5 1 25
Total 30 510 0 0 20 566
𝑺𝑿𝒀 𝟏𝟏
𝐒𝐚𝐦𝐩𝐥𝐞 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 = 𝒓𝑿𝒀 = = = 𝟎. 𝟗𝟑 > 𝟎. 𝟕
𝑺𝑿 𝑺𝒀 𝟏. 𝟒𝟗 × 𝟕. 𝟗𝟑
Therefore, we have strong positive linear relationship between “Number of Commercials”
and “Sales Volume” 96
Extra Exercise:
Given the following dataset:
2 4 6 8 9 10 12 14 16
Answer the following questions:
1. Find the Mode and Median of this data set
2. Calculate the sample mean and deduce the shape of the distribution
3. Find the 23rd percentile of the data
4. Compute the Inter Quartile Range (IQR)
5. Find the sample variance, sample standard deviation and deduce the coefficient of
Variation.
6. Use the empirical rule to compute the 68.26% interval of this dataset.
7. Which of the following graphs (A or B) is the boxplot of our data set:
Graph A: Graph B:
2 5 9 13 16 2 5 9 13 16 97
End of Session