Professional Documents
Culture Documents
Measures of Dispersion (Autosaved)
Measures of Dispersion (Autosaved)
Single Number
Unit # 4
Measures of Dispersion
Measures of Dispersion
Although arithmetic mean is a concise method of presentation
of a statistical data yet it is inadequate for several reasons,
for example, it gives no indication of its reliability.
A mean weight derived from a sample of 1200 will usually be
more reliable than a mean weight derived from 4 or 5
weights.
However, not only number of observations will affect the
reliability of the mean but also the variability of the
individual observations. For example, a mean weight of 20
children selected from a group with weights between 45 and
50 kg will be more reliable than the mean weight of 20
children selected from a group with weight varying between
35 to 70 kg.
2
As mean doesn’t in itself gives a clear picture of a distribution,
therefore, an other type of measure which helps to describe more
clearly the shape of the distribution is dispersion (or variability).
This variability among the individual observations is the essence
of statistical data. It indicates how the observations are spread
out from the average.
gives an of
A measure adequate
centraldescription of statistical
tendency along with a measure of dispersion
3
data.
Measures of Dispersion
4
Types of Measures of Dispersion
There are two main types of measures of dispersion:
1. Absolute Measure of Dispersion
2. Relative Measure of Dispersion
5
Measures of Dispersion
The commonly used measures of absolute dispersion
are:
1. Range
2. Quartile Deviation
3. Variance and Standard Deviation
6
Measures of Spread
Distance Based Measures of Spread
• The range
• The Semi interquartile range
Centre Based Measures of Spread
• The variance
• The standard deviation
7
Distance Based Measures of Spread
Range
Range. The difference between the largest and the
smallest observations:
X M ax = X m = X0
,X m in
R an g e = X
m – X
0
• Coefficient of Range. – X0
Xm
C oefficient o f R ange = X + X
m 0
8
Range
Example: The following data set shows the weekly TV viewing
times, in hours. Find Range and Coefficient of Range
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21
X m = 66, X 0 = 5
Range = X m – X 0 = 66 – 5 =
61
Xm – X0 66 – 5
Coefficient of Range = X m + X = 66 + = 0.86
0 5
9
Disadvantages of the Range
• Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
10
Interquartile Range
11
Semi Interquartile Range or Quartile Deviation
12
Quartile Deviations. Find IQR,Q.D and Coefficient of Q.D
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66
1(20+1)
LOCATION of Q1 = 4 th observation in the
data =5.25th observation
VALUE of Q1 =5th obs.+0.25{6th obs.-5th
obs.}
= 21+0.25{25-21}= 22.0
Hours 13
2(20 +1)
LOCATION of Q =
2
4 th observation in the data
= 10.50th observation
VALUE of Q2 = 10th obs. + 0.50{11th obs.-10th
obs.}
= 30 + 0.50{31- 30} = 30.5 Hours
3(20 + 1)
LOCATION of Q3 = 4 th observation in the data
= 15.75th observation
VALUE of Q = 15th obs. + 0.75{16th obs. -15th
3 =obs.}
35 + 0.75{37 - 35} = 36.5
Hours
14
Q1 = 22.0 Hours, Q3 = 36.5
Hours I.Q.R=Q3 – Q1 = 36.5 –
22.0 = 14.50
Q3 – Q1 = 36.5 – 22.0 = 7.25
Q.D = 2 2
Q3 – Q1 = 36.5 – 22.0 = 0.12
Coefficient of Q.D = Q3 + 36.5 +
Q1 22.0
15
Centre Based Measures of Spread
• Several statistics use the centre of the data as a
point of reference and reflect how data is
clustered around it
16
Variance: is defined as the arithmetic mean of
the Squared deviation of observations from
mean it is denoted by S2
= X )2 n – n
–1
– ) n
1 17
Q. Find Variance of the X (X – X)2
given data, s2 4 36
6 16
9 1
ΣX 60
X= n =6 = 12 4
13 9
10
102 16 36
S =Σ ( X – X )
2 2
=17
n 6 60 102
=
Σ ( X – X ) 2
10
ΣX Σ(X –X)2
S 2
= =
n – 25
= 20.4
1
18
Q. Find Variance of the
X cm X2
given data, s2
4 16
Σ 6 36
X = n = 60 6 = 10
X 9 81
(
2
S =
1
Σ X – Σn
( X) =2 12 144
n 13 169
2 17
16 256
2 1(
(60 ) 2
S = 702 2
6 = 17 60 702
6
– cm
1
( X )
2
ΣX ΣX 2
S 2 = (n – 1 Σ
Σ X – n
2
1(
(60 ) n=
S2 = 5 702 2 6 =
6
– 20 .4
19
Standard Deviation: is defined as the Positive square
root of the arithmetic mean of the Squared
deviation of observations from mean it is denoted by
S
10 13 9
102 16 36
S Σ ( Xn– = =
6 4.123 60 102
= X )2
2 ΣX Σ(X –X)2
S Σ ( nX–– =
102
5 = 4.517
= X)
1
21
Q. Find S.D and Coefficient
X cm X2
Variation of the given data, s2
4 16
Σ 60 6 36
X = n = 6 =1
X 0 9 81
(
S 1
Σ
2
( X)
– Σ n
2
12 144
= n 13 169
X 16 256
S = 1 ( 7 02 2
(60)
– 6 = 17 = 4 .1 60 702
6 23cm
(
2 (Σ X
2
ΣX ΣX 2
S 1 –
= n –1 Σ ) n
X
n=
S = (60)
1 ( 7 02 2
– 6
= 20 .4 = 4 .51 6
5 7 22
Coefficient of Variation (CV)
• Always in percentage (%)
• Shows relative variability, that is, variability relative to
the magnitude of the data i.e variation relative to mean
• Can be used to compare two or more sets of data
measured in different units or same units but different
average size
(S
C.V =
X
.100%
23
Coefficient of Variation (CV)
X = 10
S = 4.123
4.123
CV = 10 100% =41.23%
24
Comparing Coefficient of Variation
M r . Ali
AverageM arks out of 60 = X 1 =
53 Standard deviation=S 1
= 5( S Mr. Ali is
C ALI = 1 100 % =
5
100 %
X 53 more
V 1
= 9 .43 % Consistent in
M r .Zain performance
AverageM arks out of 80 = X 1 =
72 Standard deviation=S 1
= 10
C ALI = ( S1
100 % = 10 100 % =
V X 1 72
13 .88 % 25
Comparing Coefficient
of Variation
• Stock A:
o Average price last year = $50
o Standard deviation = $5
(s $
100% =
CVA = 5
$5 Both stocks
x 10%
• Stock 100%
B: = 0 have the same
standard
last year = $100
o Average price deviation, but
stock B is less
o Standard deviation = $5 variable relative
to its price
Which variable [WEIGHT or HEIGHT] has greater dispersion? [No meaningful answer can be given]
Which variable has greater dispersion relative to its average, e.g., greater Coefficient of Dispersion (SD
relative to mean)?
C V W e ig h = ( S 1
1 0 0
30
=
1 3 .6
=
0 .0 1 5
= 1 8 .7
t X
1
%
%
(
C V H e ig h = 1 S = 4 0 .3 3 1 0 .2
X 1 0 0 1 6 0 = 7 2 .6= 0 .0 = 8 6 .1
t 1
%
% not expressed in any units and is the same
= is a pure number,
Note that the Coefficient of Variation
whatever units the variable is measured in.
66 5 .5 168
Choosing Appropriate
Measure of Variability
• If data are symmetric, with no serious outliers, use
range and standard deviation.
• If data are skewed, and/or have serious outliers,
use IQR.
• If comparing variation across two data sets, use
coefficient of variation (C.V)
28
Five Number Summary
The five number summary of a data set consist of the
minimum value, the first quartile, the second
quartile, the third quartile and the maximum value
written in that order: Min, Q1, Q2, Q3, Max.
30
Five Number Summary
X0 = Min value = 5
Xm = Max value =
1(20+1)
LOCATION
66 of Q1 = 4 th observation in the data
=5.25th observation
VALUE of Q1 =5th obs. + 0.25{6th obs.-5th
obs.}
= 21+ 0.25{25-21} = 22.0 Hours
31
2(20 +1)
LOCATION of Q =
2
4 th observation in the data
= 10.50th observation
VALUE of Q2 = 10th obs. + 0.50{11th obs.-10th
obs.}
= 30 + 0.50{31- 30} = 30.5 Hours
3(20 + 1)
LOCATION of Q3 = 4 th observation in the data
= 15.75th observation
VALUE of Q = 15th obs. + 0.75{16th obs. -15th
3 =obs.}
35 + 0.75{37 - 35} = 36.5
Hours
32
Construction of Box Whisker Plot 70
1. Minimum Value=5.0 60
2. Q1=22.0 50
3. Q2=30.5
40
4. Q3=36.5
5. Maximum Value=66.0 30
20
10
0
33
Box and Whisker Plot
BOX
0 5
10
20 25 50 55 60 65
15 30 35 40 45
Interpretation of Box-Whisker Plot 70
37
Detection of Outliers
X0 = Min value = 5
Xm = Max value =
1(20+1)
LOCATION
66 of Q1 = 4 th observation in the data
=5.25th observation
VALUE of Q1 =5th obs. + 0.25{6th obs.-5th
obs.}
= 21+ 0.25{25-21} = 22.0 Hours
38
2(20 +1)
LOCATION of Q =
2
4 th observation in the data
= 10.50th observation
VALUE of Q2 = 10th obs. + 0.50{11th obs.-10th
obs.}
= 30 + 0.50{31- 30} = 30.5 Hours
3(20 + 1)
LOCATION of Q3 = 4 th observation in the data
= 15.75th observation
VALUE of Q = 15th obs. + 0.75{16th obs. -15th
3 =obs.}
35 + 0.75{37 - 35} = 36.5
Hours
39
Determine Inner and Outer Fences
If Q1=22.0 Q2=30.5
Q3=36.5
The inner fences and outer fences are defined as follows:
I .Q .R = Q 3 – Q 1 = 3 6 .5 – 2 2 = 1 4 .5
In
L on w
e reFr e In
n cneesr:F e n c e= Q
– 1 .5 ( IQ
1
L IF = 2 2 – 1 .5 (1 4 .5 ) =
0 .2 5 R )
3
U IF = 3 6 .5 + 1 .5 (1 4 .5 ) = 5
8 .2 5
L o w e r O u te r F e n c e= Q 1 – 3 ( IQ R ) = – 2
1U .5
p p e r O u te r F e n c e= Q 3 + 3 ( IQ R ) = 8 40
0 .0
Box and Whisker Plot
Suspected Outlier
UOF=80
LOF=-21.5 LIF=0.25 Median=30.5
UIF=58.25
66
- - - 0 10 20 30 40 50 60 70 80 90
30 20 10
Identification of Suspected and Sure Outliers
80
1. The values that lie within
inner fences are normal 70
values
Only 66 is *
2. The values that lie 60
mild outlier
outside inner fences but
50
inside outer fences are
possible/suspected/mild 40
outliers
3. The values that lie 30
10
Plot each suspected outliers with an asterisk and
each outliers with an hollow dot. 47
0
Treatment for outlier
When an outlier is found, its cause should be determined if
possible.
1. If it is discovered that an outlier is due to a measurement or
recording error or that for some other reason it clearly does not
belong to the set of data, then the outlier should be removed
from the data.
2. It sometimes helps to determine the cause of extreme values,
because outliers can often provide useful insights to the
situations under consideration (such as better ways of doing
things).
3. If no explanation for the outliers is apparent, then decision
whether to retain it in the set of data can often be difficult and
calls for a judgment by the researcher. 48
Skewness
A distribution in which the values equidistant from the centre have equal frequencies is defined
to be symmetrical and any departure from symmetry is called skewness. That is, in
symmetrical distributions the points equidistant from the centre have equal concentration.
The frequency curve for a symmetrical distribution can be folded along the central
maximum in such a way that the two halves of the curve coincide.
44
Skewness
For a symmetrical distribution, the following relations hold:
46
Positive Skewness
For a positively skewed distribution the following
relations hold:
47
Negative Skewness
A distribution is negatively skewed, if the observations tend to concentrate
more at the upper end of the possible values of the variable than the
lower end. A negatively skewed frequency curve has a longer tail on
the left side.
48
Negative Skewness
For a negatively skewed distribution,
the following relations hold:
49
Measures of Skewness
A distribution in which the values equidistant from the centre
have equal frequencies is defined to be sym metrical and
any departure from symmetry is called skewness.
Karl Pearson Coefficient of Skewness
1.Sk = Mean – Mode
S
2.Sk = 3(Mean – Median)
S
Bowley Coefficient of Skewness
Q3 + Q1 – 2Q2
Sk =
From the information given below , find Coefficient of Skewness.
53
Measures of Kurtosis
Coefficient of Kurtosis=K
Q –
1. K = 2(P 3 1
Q90 – P10
)
65
The Empirical Rule
If the data distribution is approximately bell-
shaped, then the interval:
68%
X 1+_S contains about 68% of values
X
X
± S
X +_2 S contains about 95% 95%
of values
X ± 2S
99.7%
X +_3 S contains about 99.7% of
values
X±
66
3S
Question-1:
‘How many measurements are within 1 standard deviation of
the mean?’
Question-2:
‘How many measurements are within 2 standard
deviations?’
and so on.
M easurem ents
According to empirical rule:
a) Approximately 68% of the measurements will fall
within 1 standard deviation of the mean, i.e. within
the interval (˗ S, + S)
b) Approximately 95% of the measurements will fall
within 2 standard deviations of the mean, i.e. within
the interval (˗ 2S, + 2S)
c) Approximately 100% (practically all) of the
measurements will fall within 3 standard deviations
of the mean, i.e. within the interval (˗ 3S, + 3S)
EXAMPLE. The 50 companies’ percentages of revenues spent on
R&D (i.e. Research and Development) are:
1 3 .5 9.5 8.2 6.5 8.4 8.1 6.9 7.5 10.5 1 3 .5
7.2 7.1 9.0 9.9 8.2 13.2 9.2 6.9 9.6 7.7
9.7 7.5 7.2 5.9 6.6 11.1 8.8 5.2 1 0 .6 8.2
11 .3 5.6 10.1 8.0 8.5 11.7 7.1 7.7 9.4 6.0
8.0 7.4 10.5 7.8 7.9 6.5 6.9 6.5 6.8 9.5
X = and S = 1.98
8.49
Calculate
the proportions of these measurements
that lie within the intervals ± S, ± 2S and ± 3S and
compare the results with the theoretical values. The
mean and standard deviation of these data come out to be
8.49 and 1.98, respectively.
EXAMPLE
(˗ S, + S)
= (8.49 – 1.98, 8.49 + 1.98)
= (6.51, 10.47)
A check of the measurement reveals that 34 of the 50
measurements, or 68%, fall between 6.51 and 10.47.
Similarly, the interval
(˗ 2S, + 2S)
= (8.49 – 3.96, 8.49 + 3.96)
= (4.53, 12.45)
contains 47 of the 50 measurements, i.e. 94% of
the data- values.
Finally, the 3-standard deviation interval around
(˗ 3S, + 3S)
= (8.49 – 5.94, 8.49 + 5.94)
= (2.55, 14.43)
contains all, or 100%, of the measurements.
In spite of the fact that the distribution of these
data is skewed to the right, the percentages of data-
values falling within 1, 2, and 3 standard deviations of
the mean are remarkably close to the theoretical
values (68%, 95%, and 100%) given by the Empirical
Rule.