Professional Documents
Culture Documents
Numerical Descriptive Measures: Dr. Tran Anh Vu, SEEE, HUST 1
Numerical Descriptive Measures: Dr. Tran Anh Vu, SEEE, HUST 1
Sum of measurements
Mean =
Number of measurements
1, 3, 5, 2, 4, 3
Solution:
6
åi=1 x i x11 + x3 2 + x53 + x24 + x45 + x36
x= = = 3.0
6 6
Solution: 20 families
Average number of children in a family is
åi20=1 xi x1 + x2 ... + x20 3(0) + 4(1) + 7(2) + 2(3) + 4(4)
x= = = = 2.0
20 20 20
Example 4
Seven employee salaries were recorded (in ‘000s):
42, 45, 40, 46, 44, 40, 43.
(a) Find the median salary.
(b) Suppose the director’s salary of $200 000 was
added to the group recorded before. Find the
median salary.
(c) Compare the mean and the median values for the
data in parts a and b.
40,40,42,43,44,45,46,200
40,40,42,43, 44,45,46,200
40,40,42,43,44,45,46
20
30
40
50
60
70
80
90
ore
0
10
M
100 24
More 0
Modal class
or
How spread out are the measurements
around the average value?
E.g.
Data: {4, 4, 4, 4, 50} Range = 46
Data: {4, 8, 15, 24, 39, 50} Range = 46
The range is the same in both cases, but the data sets have
very different distributions…
Range
? ? ?
Smallest Largest
measurement measurement
Sum = 0
The sum of deviations
A is zero in both cases,
therefore another
8 9 10 11 12 measure is needed.
4–10 = – 6
16–10 = +6
B 7–10 = –3
13–10 = +3
4 7 10 13 16
Sum = 0
Dr. Tran Anh Vu,
39
SEEE, HUST
Variance…
Let us calculate the variance of the two populations.
2 2 2 2 2
( 8 - 10) + ( 9 - 10) + (10 - 10) + (11 - 10) + (12 - 10)
s2A = =2
5
2 2 2 2 2
( 4 - 10) + ( 7 - 10) + (10 - 10) + (13 - 10) + (16 - 10)
sB2 = = 18
5
Why is the variance defined as After all, the sum of squared
the average squared deviation? deviations increases in
Why not use the sum of squared magnitude when the dispersion
deviations as a measure of of a data set increases!
dispersion instead?
1 3
1 3 Data set B is more
1 3 dispersed around the mean.
A 1
1
3
3 B 1 5
1 2 3 1 3 5
SumA = (1–2)2 +…+(1–2)2 +(3–2)2 +… +(3–2)2 = 10
sA2 = SumA/N = 10/10 = 1
5 times 5 times
SumB = (1–3)2 + (5–3)2 = 8 ! sB2 = SumB/N = 8/2 = 4
Dr. Tran Anh Vu,
41
SEEE, HUST
Variance…
As you can see, you have to calculate the sample mean 𝑋#
in order to calculate the sample variance.
Alternatively, there is a short-cut formulation to calculate
sample variance directly from the data without the
intermediate step of calculating the mean. Its given by:
Solution:
Sample Mean
2
…as opposed to µ or s
Dr. Tran Anh Vu, SEEE, HUST 43
Example 7: Solution…
Sample Variance
Trust A: 12.3, 2.2, 24.9, 1.3, 37.6, 46.9, 28.4, 9.2, 7.1, 34.5
Trust B: 15.1, 0.2, 9.4, 15.2, 30.8, 28.3, 21.2, 13.7, 1.7, 14.4
Trust A Trust B
Mean 20 Mean 15
Standard Error 5.295 Standard Error 3.152 Even though Trust A
Median 18.6 Median 14.75 has a higher average
Mode #N/A Mode #N/A return, it should be
Standard Deviation 16.743 Standard Deviation 9.969
Sample Variance 280.340 Sample Variance 99.373
considered riskier
Kurtosis -1.342 Kurtosis -0.464 because its standard
Skewness 0.217 Skewness 0.107 deviation is larger.
Range 49.1 Range 30.6
Minimum -2.2 Minimum 0.2
Maximum 46.9 Maximum 30.8
Sum 200 Sum 150
Count 10 Count 10
(𝑋! - s, 𝑋+
! s) contains approximately 68% of the measurements
(𝑋! - 2s, 𝑋+
! 2s) contains approximately 95% of the measurements
(𝑋! - 3s, 𝑋+
! 3s) contains virtually all the measurements.
10
8
6
4
2
0
2 5 8 11 14 17 20 More
30.6
s@ = 7.51 percent
4
Actual standard deviation of Trust B returns is 9.97%
.
Dr. Tran Anh Vu, SEEE, HUST 60
Coefficient of Variation
The coefficient of variation of a set of
measurements is the standard deviation divided by
the mean value.
s
Sample coef*icient of variation: cv =
x3
𝜎
Population coef*icient of variation: CV =
𝜇
Coefficient of Variation
s
Sample coef3icient of variation: cv =
x#
𝜎
Population coef3icient of variation: CV =
𝜇
This coefficient provides a proportionate measure of
variation.
A standard deviation of 10 may be perceived as large
when the mean value is 100, but only moderately large
when the mean value is 500.
0 77 100
67
Dr. Tran Anh Vu, SEEE, HUST 67
Location of Percentiles
Find the location of any percentile using the formula
2nd observation
3rd observation 2nd observation
50
L 50 = (10 + 1) = 5.5
100
The 50th percentile is halfway between the fifth and
sixth observations (in the middle between 8 and 9),
that is 8.5. That is,
75
L 75 = (10 + 1) = 8.25
100
The 75th percentile is one quarter of the distance
between the eighth and ninth observation. That is
8th 9th
observation observation
position position
2.75 8.25
Possition 1 2 | 3 4 5 6 7 8 | 9 10
0 0 | 5 7 8 9 12 14 | 22 33
value value
3.75 16
Lp determines the position in the data set where
the percentile value lies, not the value of the
percentile itself.
Q1 Q2 Q3 Q1 Q2 Q3
Positively skewed Negatively skewed
histogram histogram
Large values of this statistic mean that the 1st and 3rd
quartiles are far apart, indicating a high level of
variability.
S Q1 Q2 Q3 L
Whisker
(1.5×IQR) Whisker (1.5×IQR)
Q1 Q2 Q3
• the first, second, and third quartiles.
S Q1 Q2 Q3 L
410 530 560 590 700
S Q1 Q2 Q3 L
410 530 560 590 700
50%
25% 25%
410 700
85
Dr. Tran Anh Vu, SEEE, HUST 85
Covariance…
In much the same way there was a ‘shortcut’ for
calculating sample variance without having to calculate
the sample mean, there is also a shortcut for calculating
sample covariance without having to first calculate the
means:
COV(X,Y)=0
r à -1
Dr. Tran Anh Vu,
90
SEEE, HUST
Coefficient of Correlation…
Strong positive linear relationship
If the two variables are very strongly positively
linear related, the coefficient value is close to +1.
Strong negative linear relationship
If the two variables are very strongly negatively
linear related, the coefficient value is close to –1.
No linear relationship
No linear (straight line) relationship is indicated by
a coefficient value close to zero.
Advert Sales
1 30
3 40
5 40
4 50
2 35
5 50
3 35
2 25
é
( )ú
2ù
cov( X , Y ) å n
xi
r= 1 ê i =1
sx s y sx2 = ê å n 2
i =1 i -
x ú
n -1 n
ê ú
ë û
Interpretation
• The covariance (10.2679) indicates that
advertisement expenditure and sales level are
positively related
• The coefficient of correlation (0.797) indicates that
there is a strong positive linear relationship between
advertisement expenditure and sales level.
97
Dr. Tran Anh Vu, SEEE, HUST 97
The Least Squares Method…
Recall, the slope-intercept equation for a line is
expressed in these terms:
y = mx + b
where:
m is the slope of the line
b is the y-intercept.
ŷ = bˆ0 + bˆ1 x
bˆ0 = y - bˆ1x
ŷ = bˆ0 + bˆ1 x
yˆ = 9.587 + 2.245 x
Electrical cost = 9.587 + 2.245 (Number of tools)
ŷ = 9.587 + 2.245x
The y-intercept is 9.587.
That is, the regression line strikes the y-axis at 9.587.
This is simply the value of when x = 0.
However, when x = 0, we are producing no tools and
hence the estimated fixed cost of electricity is $9.59
per day.
R2 = 0.758