Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 64

MT-2005

Single Number

Unit # 4
Measures of Dispersion
Measures of Dispersion
Although arithmetic mean is a concise method of presentation
of a statistical data yet it is inadequate for several reasons,
for example, it gives no indication of its reliability.
A mean weight derived from a sample of 1200 will usually be
more reliable than a mean weight derived from 4 or 5
weights.
However, not only number of observations will affect the
reliability of the mean but also the variability of the
individual observations. For example, a mean weight of 20
children selected from a group with weights between 45 and
50 kg will be more reliable than the mean weight of 20
children selected from a group with weight varying between
35 to 70 kg.
2
As mean doesn’t in itself gives a clear picture of a distribution,
therefore, an other type of measure which helps to describe more
clearly the shape of the distribution is dispersion (or variability).
This variability among the individual observations is the essence
of statistical data. It indicates how the observations are spread
out from the average.

1. The spread in the sample observations may be of great interest to


estimate spread in the population.
2. The spread of a distribution may be of interest insofar as it aids in
gauging the precision with which a sample mean estimates
the
corresponding population mean.

gives an of
A measure adequate
centraldescription of statistical
tendency along with a measure of dispersion
3

data.
Measures of Dispersion

The Scatter of the values about their center is called


Dispersion and Measures which are used to find the
amount of scatter about the center are called
Measures of Dispersion.
Measures of variation measure the variation present
among the values in a data set with a single number
so measures of variation are summary measures of
spread of values in the data

4
Types of Measures of Dispersion
There are two main types of measures of dispersion:
1. Absolute Measure of Dispersion
2. Relative Measure of Dispersion

3. Absolute Measure of Dispersion


The absolute measure of dispersion measures the variation present among
the observations in the unit of the variable or square of the unit of the
variable.
4. Relative Measure of Dispersion
The relative measure of dispersion measures the variation present among
the observations relative to their average. It is expressed in the form of
ratio, or percentage. It is independent of the unit of measurement.

5
Measures of Dispersion
The commonly used measures of absolute dispersion
are:
1. Range
2. Quartile Deviation
3. Variance and Standard Deviation

Their corresponding measures of relative dispersion are:


1. Coefficient of Range/Coefficient of dispersion
2. Coefficient of Quartile Deviation
3. Coefficient of Variation (CV)

6
Measures of Spread
Distance Based Measures of Spread
• The range
• The Semi interquartile range
Centre Based Measures of Spread
• The variance
• The standard deviation

7
Distance Based Measures of Spread
Range
Range. The difference between the largest and the
smallest observations:
X M ax = X m = X0
,X m in
R an g e = X
m – X
0

• Coefficient of Range. – X0
Xm
C oefficient o f R ange = X + X
m 0

8
Range
Example: The following data set shows the weekly TV viewing
times, in hours. Find Range and Coefficient of Range
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

X m = 66, X 0 = 5
Range = X m – X 0 = 66 – 5 =
61
Xm – X0 66 – 5
Coefficient of Range = X m + X = 66 + = 0.86
0 5
9
Disadvantages of the Range
• Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
10
Interquartile Range

The difference between upper quartile and lower


quartile

Q3 = Upper Quartile, Q1 = Lower


Quartile I .Q.R = Q3 – Q1

11
Semi Interquartile Range or Quartile Deviation

Half of the difference between upper quartile and


lower quartile
Q3 = Upper Quartile, Q1 = Lower
Quartile
Q3 – Q1
Q.D = 2
Q3 – Q1
Coefficient of Q.D = Q3 + Q1

12
Quartile Deviations. Find IQR,Q.D and Coefficient of Q.D

5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

1(20+1)
LOCATION of Q1 = 4 th observation in the
data =5.25th observation
VALUE of Q1 =5th obs.+0.25{6th obs.-5th
obs.}
= 21+0.25{25-21}= 22.0
Hours 13
2(20 +1)
LOCATION of Q =
2
4 th observation in the data
= 10.50th observation
VALUE of Q2 = 10th obs. + 0.50{11th obs.-10th
obs.}
= 30 + 0.50{31- 30} = 30.5 Hours
3(20 + 1)
LOCATION of Q3 = 4 th observation in the data
= 15.75th observation
VALUE of Q = 15th obs. + 0.75{16th obs. -15th
3 =obs.}
35 + 0.75{37 - 35} = 36.5
Hours
14
Q1 = 22.0 Hours, Q3 = 36.5
Hours I.Q.R=Q3 – Q1 = 36.5 –
22.0 = 14.50
Q3 – Q1 = 36.5 – 22.0 = 7.25
Q.D = 2 2
Q3 – Q1 = 36.5 – 22.0 = 0.12
Coefficient of Q.D = Q3 + 36.5 +
Q1 22.0

15
Centre Based Measures of Spread
• Several statistics use the centre of the data as a
point of reference and reflect how data is
clustered around it

• Of these, variance and the standard


deviation are the most widely used
All are based on the arithmetic mean.

16
Variance: is defined as the arithmetic mean of
the Squared deviation of observations from
mean it is denoted by S2

(Biased Estimate Variance)




2
X
S 2
Σ ( Xn– 1( 
or S = Σ X2
) n 
2
= X )2 n 
– 
(Unbiased Estimate Variance)


2
( X
1 
S =Σ( X –
2
or S2
Σ X 2

= X )2 n – n
 –1
 – ) n 

1  17
Q. Find Variance of the X (X – X)2
given data, s2 4 36
6 16
9 1
ΣX 60
X= n =6 = 12 4
13 9
10
102 16 36
S =Σ ( X – X )
2 2
=17
n 6 60 102
=
Σ ( X – X ) 2
10
ΣX Σ(X –X)2
S 2
= =
n – 25
= 20.4
1

18
Q. Find Variance of the
X cm X2
given data, s2
4 16

Σ 6 36
X = n = 60 6 = 10
X 9 81

(
2
S =
1
Σ X – Σn
( X) =2 12 144
n 13 169
2 17
 16 256
2 1(
(60 )  2
S =  702 2
6  = 17 60 702
6  
– cm
1
( X )
2
 ΣX ΣX 2
S 2 = (n – 1  Σ
Σ X – n 
2


1(
(60 )  n=
S2 = 5  702 2 6  =
 6
– 20 .4
 19
Standard Deviation: is defined as the Positive square
root of the arithmetic mean of the Squared
deviation of observations from mean it is denoted by
S

(Biased Estimate S.D) 



2
X
S = Σ (X – or S 1( 
ΣX 2
= X) 2
n n ) n 

(Unbiased Estimate 

S.D)

( (Σ
2
1 X
S = Σ (X – or S ΣX 2 
= X) 2
n– n–

 – ) n 

1
1 20
Q. Find S.D of the given X (X – X)2
data, s 4 36
6 16
9 1
ΣX 60
X= n =6 = 12 4

10 13 9
102 16 36
S Σ ( Xn– = =
6 4.123 60 102
= X )2
2 ΣX Σ(X –X)2
S Σ ( nX–– =
102
5 = 4.517
= X)
1
21
Q. Find S.D and Coefficient
X cm X2
Variation of the given data, s2
4 16
Σ 60 6 36
X = n = 6 =1
X 0 9 81

(
S 1
Σ
2
( X) 
– Σ n
2
12 144
= n    13 169

X 16 256
S = 1 ( 7 02 2
(60) 
– 6  = 17 = 4 .1 60 702
 
6 23cm
(
2 (Σ X
2
 ΣX ΣX 2
S 1 –
= n –1 Σ ) n 
 

X

n=
S = (60) 
1 ( 7 02 2
–  6 
= 20 .4 = 4 .51 6
5 7 22
Coefficient of Variation (CV)
• Always in percentage (%)
• Shows relative variability, that is, variability relative to
the magnitude of the data i.e variation relative to mean
• Can be used to compare two or more sets of data
measured in different units or same units but different
average size

(S
C.V =
X
 
.100%
 23
Coefficient of Variation (CV)

X = 10
S = 4.123

4.123
CV = 10 100% =41.23%

24
Comparing Coefficient of Variation
M r . Ali
AverageM arks out of 60 = X 1 =
53 Standard deviation=S 1
= 5( S  Mr. Ali is
C ALI =  1 100 % =
5
100 %
 X 53 more
V 1

 = 9 .43 % Consistent in
M r .Zain performance

AverageM arks out of 80 = X 1 =
72 Standard deviation=S 1
= 10
C ALI = ( S1 
 100 % = 10 100 % =
V X 1 72
 13 .88 % 25
Comparing Coefficient
of Variation
• Stock A:
o Average price last year = $50
o Standard deviation = $5

(s $
100% =
CVA =  5
$5 Both stocks
x 10%
• Stock 100%
B:  = 0 have the same
standard
 last year = $100
o Average price deviation, but
stock B is less
o Standard deviation = $5 variable relative
to its price

CVB (s 100% $ 100% =


5
$10
= x =
 5%
 0

Coefficient of Variation
Summary statistics for WEIGHT and HEIGHT (both ratio variables) of Pakistani adults in different units:
Weight Height Weight Height
Mean 160 pounds 66 inches SD 30 pounds 4 inches
72.6 kilograms 5.5 feet 13.6 kilograms 0.33 feet
0.08 tons 168 centimeters 0.015 tons 10.2 centimeters

Which variable [WEIGHT or HEIGHT] has greater dispersion? [No meaningful answer can be given]
Which variable has greater dispersion relative to its average, e.g., greater Coefficient of Dispersion (SD
relative to mean)?

C V W e ig h = ( S 1  
 1 0 0
30
=
1 3 .6
=
0 .0 1 5
= 1 8 .7
t X
 1 
%
%
(
C V H e ig h =  1  S = 4 0 .3 3 1 0 .2
X 1 0 0 1 6 0 = 7 2 .6= 0 .0 = 8 6 .1
t  1 
%
% not expressed in any units and is the same
= is a pure number,
Note that the Coefficient of Variation
whatever units the variable is measured in.
66 5 .5 168
Choosing Appropriate
Measure of Variability
• If data are symmetric, with no serious outliers, use
range and standard deviation.
• If data are skewed, and/or have serious outliers,
use IQR.
• If comparing variation across two data sets, use
coefficient of variation (C.V)

28
Five Number Summary
The five number summary of a data set consist of the
minimum value, the first quartile, the second
quartile, the third quartile and the maximum value
written in that order: Min, Q1, Q2, Q3, Max.

From the three quartiles we can obtain a measure of


central tendency (the median, Q2) and measures of
variation of the two middle quarters of the
distribution, Q2-Q1 for the second quarter and Q3-Q2
for the third quarter.
29
Five Number Summary
Example: The following data set shows the weekly TV viewing
times, in hours.
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

Determine the five number summary.


The array of the above data is given below:
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

30
Five Number Summary
X0 = Min value = 5

Xm = Max value =
1(20+1)
LOCATION
66 of Q1 = 4 th observation in the data
=5.25th observation
VALUE of Q1 =5th obs. + 0.25{6th obs.-5th
obs.}
= 21+ 0.25{25-21} = 22.0 Hours
31
2(20 +1)
LOCATION of Q =
2
4 th observation in the data
= 10.50th observation
VALUE of Q2 = 10th obs. + 0.50{11th obs.-10th
obs.}
= 30 + 0.50{31- 30} = 30.5 Hours
3(20 + 1)
LOCATION of Q3 = 4 th observation in the data
= 15.75th observation
VALUE of Q = 15th obs. + 0.75{16th obs. -15th
3 =obs.}
35 + 0.75{37 - 35} = 36.5
Hours
32
Construction of Box Whisker Plot 70

1. Minimum Value=5.0 60

2. Q1=22.0 50

3. Q2=30.5
40
4. Q3=36.5
5. Maximum Value=66.0 30

20

10

0
33
Box and Whisker Plot
BOX

Lowest Lower Median=30.5 Upper Highest


Value=5 Quartile=22 Quartile=36.5 value=66
Whisker Whisker

0 5
10
20 25 50 55 60 65
15 30 35 40 45
Interpretation of Box-Whisker Plot 70

Box-Whisker Plot is useful to identify


60
• From upper and lower whiskers;
Maximum and Minimum Values in the data
50
• From line within box i.e Q2 ;
Average Size of the data 40
• From length of the box i.e Q3-Q1=IQR
Variability in the data i.e lengthy box indicates more variability 30

• From Position of line within box


20
Shape of the data
Line At the center of the box-------Symmetrical
10
Line above center of the box-------Negatively skewed
Line below center of the box-------Positively Skewed
0
• Detection of Outliers in the data
40
Outliers
The outliers are the values that fall well outside the overall pattern of
the data. It may be
• The result of a measurement or recording error
• A member from a different population than the rest of the sample.
• Simply an unusual extreme value.

An extreme value needs not be an outliers;


it may instead be an indication of
skewness.

Professor Jhon Wilder Tukey suggested a method for defining


outliers. We can use quartiles and the IQR = Q3-Q1 to identify the
outliers.
36
Detection of Outliers
Example: The following data set shows the weekly TV viewing
times, in hours.
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

Determine the outliers if any


The array of the above data is given below:
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

37
Detection of Outliers
X0 = Min value = 5

Xm = Max value =
1(20+1)
LOCATION
66 of Q1 = 4 th observation in the data
=5.25th observation
VALUE of Q1 =5th obs. + 0.25{6th obs.-5th
obs.}
= 21+ 0.25{25-21} = 22.0 Hours
38
2(20 +1)
LOCATION of Q =
2
4 th observation in the data
= 10.50th observation
VALUE of Q2 = 10th obs. + 0.50{11th obs.-10th
obs.}
= 30 + 0.50{31- 30} = 30.5 Hours
3(20 + 1)
LOCATION of Q3 = 4 th observation in the data
= 15.75th observation
VALUE of Q = 15th obs. + 0.75{16th obs. -15th
3 =obs.}
35 + 0.75{37 - 35} = 36.5
Hours
39
Determine Inner and Outer Fences
If Q1=22.0 Q2=30.5
Q3=36.5
The inner fences and outer fences are defined as follows:
I .Q .R = Q 3 – Q 1 = 3 6 .5 – 2 2 = 1 4 .5
In
L on w
e reFr e In
n cneesr:F e n c e= Q
– 1 .5 ( IQ
1
L IF = 2 2 – 1 .5 (1 4 .5 ) =
0 .2 5 R )
3

U IF = 3 6 .5 + 1 .5 (1 4 .5 ) = 5
8 .2 5

L o w e r O u te r F e n c e= Q 1 – 3 ( IQ R ) = – 2
1U .5
p p e r O u te r F e n c e= Q 3 + 3 ( IQ R ) = 8 40
0 .0
Box and Whisker Plot
Suspected Outlier
UOF=80
LOF=-21.5 LIF=0.25 Median=30.5
UIF=58.25

66

- - - 0 10 20 30 40 50 60 70 80 90
30 20 10
Identification of Suspected and Sure Outliers
80
1. The values that lie within
inner fences are normal 70
values
Only 66 is *
2. The values that lie 60
mild outlier
outside inner fences but
50
inside outer fences are
possible/suspected/mild 40
outliers
3. The values that lie 30

outside outer fences are


sure outliers 20

10
Plot each suspected outliers with an asterisk and
each outliers with an hollow dot. 47
0
Treatment for outlier
When an outlier is found, its cause should be determined if
possible.
1. If it is discovered that an outlier is due to a measurement or
recording error or that for some other reason it clearly does not
belong to the set of data, then the outlier should be removed
from the data.
2. It sometimes helps to determine the cause of extreme values,
because outliers can often provide useful insights to the
situations under consideration (such as better ways of doing
things).
3. If no explanation for the outliers is apparent, then decision
whether to retain it in the set of data can often be difficult and
calls for a judgment by the researcher. 48
Skewness
A distribution in which the values equidistant from the centre have equal frequencies is defined
to be symmetrical and any departure from symmetry is called skewness. That is, in
symmetrical distributions the points equidistant from the centre have equal concentration.
The frequency curve for a symmetrical distribution can be folded along the central
maximum in such a way that the two halves of the curve coincide.

44
Skewness
For a symmetrical distribution, the following relations hold:

1. Length of Right Tail =


Length of Left Tail
2. Mean = Median = Mode
3. (Q3-Q2) = (Q2-Q1)
4. All odd order moments
about vanish, i.e.,
mean m3 = m5 =
… = m2n-1 =0
1. b1 = 0
2. g1 = 0
M. Yaseen, Deptt. of Math & Stat, UAF 54
Positive Skewness
A distribution is positively skewed, if the observations tend to concentrate
more at the lower end of the possible values of the variable than the
upper end. A positively skewed frequency curve has a longer tail on
the right hand side.

46
Positive Skewness
For a positively skewed distribution the following
relations hold:

1. Length of Right Tail > Length of Left Tail


2. Mean > Median > Mode
3. (Q3-Q2) > (Q2-Q1)
4. m3 > 0
5. b1 > 0
6. g1 > 0

47
Negative Skewness
A distribution is negatively skewed, if the observations tend to concentrate
more at the upper end of the possible values of the variable than the
lower end. A negatively skewed frequency curve has a longer tail on
the left side.

48
Negative Skewness
For a negatively skewed distribution,
the following relations hold:

1. Length of Right Tail < Length of Left Tail


2. Mean < Median < Mode
3. (Q3-Q2) < (Q2-Q1)
4. m3 < 0
5. b1 > 0
6. g1 < 0

49
Measures of Skewness
A distribution in which the values equidistant from the centre
have equal frequencies is defined to be sym metrical and
any departure from symmetry is called skewness.
Karl Pearson Coefficient of Skewness
1.Sk = Mean – Mode
S
2.Sk = 3(Mean – Median)
S
Bowley Coefficient of Skewness
Q3 + Q1 – 2Q2
Sk =
From the information given below , find Coefficient of Skewness.

Mean =140, Median = Q2 =142, Mode


=144,
Q1 = 62, Q3 = 195, S = 30

Karl Pearson Coefficient of Skewness

1.Sk = Mean – Mode


=
140 – 149
= –0.30
S 30
2.Sk = 3(Mean – Median) = 3(140 – 142) = –0.20
S 30
Bowley Coefficient of Skewness

Q3 + Q1 – 2Q2 195 + 62 – 2(142)


Sk = = = –0.20
Measures of Kurtosis

The Kurtosis is the degree of peakedness or flatness of a


unimodal (single humped) distribution,
• When the values of a variable are highly concentrated
around the mode, the peak of the curve becomes relatively
high; the curve is Leptokurtic.
• When the values of a variable have low concentration
around the mode, the peak of the curve becomes relatively
flat;curve is Platykurtic.
• A curve, which is neither very peaked nor very flat-toped,
it is taken as a basis for comparison, is
called Mesokurtic/Normal.
52
Measures of Kurtosis

53
Measures of Kurtosis
Coefficient of Kurtosis=K

Q –
1. K = 2(P 3 1
Q90 – P10
)

1. If Coefficient of Kurtosis= K> 0.263 ---------is


Leptokurtic.
2. If Coefficient of Kurtosis = K = 0.263--------- is
Mesokurtic.
3. If Coefficient of Kurtosis= K < 0.26--------- is Platykurtic.
54
From the information given below , find Coefficient of
Kurtosis.
P90 = 210, P10 = 35,Q1 = 62, Q3 =195,Q2
=142
Coefficient of Kurtosis=K
Q3 – Q1
K
= 2( P9 0 –
195P10–)
K =
62 –
2(210
= 0.38
35)
distribution is LeptoKurtic
Describing a Frequency Distribution
To describe the major characteristics of a frequency
distribution, the following five quantities are needed:

1. The number of observations that describes the size of the data.


2. A measure of central tendency such as the mean or median
that provides information about the centre or average value.
3. A measure of dispersion such as standard deviation that
indicates the variability of the data.
4. A measure of skewness that shows the lack of symmetry in
the frequency distribution.
5. A measure of kurtosis that gives information about its peakedness.

65
The Empirical Rule
If the data distribution is approximately bell-
shaped, then the interval:
68%
X 1+_S contains about 68% of values
X
X

± S
X +_2 S contains about 95% 95%
of values

X ± 2S
99.7%
X +_3 S contains about 99.7% of
values

66
3S
Question-1:
‘How many measurements are within 1 standard deviation of
the mean?’
Question-2:
‘How many measurements are within 2 standard
deviations?’
and so on.

For any specific data set, we can answer these questions by


counting the number of measurements in each of the
intervals.
However, if we are interested in obtaining a general
answer to these questions the problem is a bit more difficult.
Next, let us consider the Empirical
Rule.
This is a rule of thumb that applies to data sets
with frequency distributions that are mound-shaped
and symmetric, as follows:
Relative Frequency

M easurem ents
According to empirical rule:
a) Approximately 68% of the measurements will fall
within 1 standard deviation of the mean, i.e. within
the interval (˗ S, + S)
b) Approximately 95% of the measurements will fall
within 2 standard deviations of the mean, i.e. within
the interval (˗ 2S, + 2S)
c) Approximately 100% (practically all) of the
measurements will fall within 3 standard deviations
of the mean, i.e. within the interval (˗ 3S, + 3S)
EXAMPLE. The 50 companies’ percentages of revenues spent on
R&D (i.e. Research and Development) are:
1 3 .5 9.5 8.2 6.5 8.4 8.1 6.9 7.5 10.5 1 3 .5
7.2 7.1 9.0 9.9 8.2 13.2 9.2 6.9 9.6 7.7
9.7 7.5 7.2 5.9 6.6 11.1 8.8 5.2 1 0 .6 8.2
11 .3 5.6 10.1 8.0 8.5 11.7 7.1 7.7 9.4 6.0
8.0 7.4 10.5 7.8 7.9 6.5 6.9 6.5 6.8 9.5

X = and S = 1.98
8.49
Calculate
the proportions of these measurements
that lie within the intervals ± S, ± 2S and ± 3S and
compare the results with the theoretical values. The
mean and standard deviation of these data come out to be
8.49 and 1.98, respectively.
EXAMPLE

Calculate the proportions of these measurements


that lie within the intervals ± S, ± 2S and ± 3S and
compare the results with the theoretical values. The
mean and standard deviation of these data come out to be
8.49 and 1.98, respectively.
Hence

(˗ S, + S)
= (8.49 – 1.98, 8.49 + 1.98)
= (6.51, 10.47)
A check of the measurement reveals that 34 of the 50
measurements, or 68%, fall between 6.51 and 10.47.
Similarly, the interval
(˗ 2S, + 2S)
= (8.49 – 3.96, 8.49 + 3.96)
= (4.53, 12.45)
contains 47 of the 50 measurements, i.e. 94% of
the data- values.
Finally, the 3-standard deviation interval around
(˗ 3S, + 3S)
= (8.49 – 5.94, 8.49 + 5.94)
= (2.55, 14.43)
contains all, or 100%, of the measurements.
In spite of the fact that the distribution of these
data is skewed to the right, the percentages of data-
values falling within 1, 2, and 3 standard deviations of
the mean are remarkably close to the theoretical
values (68%, 95%, and 100%) given by the Empirical
Rule.

You might also like