Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

49 | G a b i n o P .

P e t i l o s

MEASURES OF LOCATIONS

3.1 INTRODUCTION
3
In the previous unit, you learned how to organize data into tables and how to
present the same data in various graphical forms. From the graph of the data, it is
possible to extract information about the values on the scale where most of the other
scores are clustered. These values are called “central values” or “averages” which can
only be estimated from the table or graph. At times, instead of a central value, we may
want to find a value below which or above which we can find a certain percent of the
entire set of scores. These “noncentral values” are called quantiles.

In this unit, you will learn how to find the exact location of the central values and
quantiles of a given set of data or distribution. Central values of a distribution are called
“measures of central tendency” while quantiles are called “measures of relative
position”.

3.2 MEASURES OF CENTRAL TENDENCY

In our everyday life, we are more often than not, exposed to statements
involving the concept of “average” such as “the average price of a car”, “the average
amount of rainfall in the Philippines” or “the average salary of nurses abroad”.
Somehow, informed individuals generally make decisions based on their understanding
of this important idea.

There are three measures of central tendency namely, the mean, median and
the mode. Each of these measures will be discussed separately. The discussion will be
based on raw data (data gathered in their original form) and grouped data (data that
have been organized into a frequency distribution).

3.2.1 Mean (Raw Data)

For raw data, the mean is defined as the arithmetic average of a set of data. To
find the mean, we first sum up all the scores and divide the resulting sum by the total
number of cases. The Greek capital letter sigma 
is used to denote a sum. When
followed by a variable say X, the notation X means “the sum of the values of X”.
Because the values of the variable are added, the mean is appropriate when the data are
measured in at least the interval scale.

-------------------------------------------------------
Measures of Location
50 | G a b i n o P . P e t i l o s

Since data can be obtained from a defined population or sample, we have to


understand some symbols that are universally used to denote the mean. Table 3.1 below
contains these symbols.

Table 3.1 Notations for Population Mean, Sample Mean, and Sample Size

Type of Data Symbol for Mean Size Formula for Mean

Population  ( greek Letter mu)  parameter N μ


X
N

Sample X (read: x bar)  statistic n X


X
n

The population mean is denoted by the Greek letter  and is defined by

μ
 X
while, the sample mean is denoted by X 
X 
. Note that the formulas are
N n
almost identical except for the denominator. Thus, for population data, we divide the
sum of all scores by the population size N, while for sample data, we divide the sum by
the sample size n.

Since it is not always feasible to study the entire population the parameter  is
usually estimated using the statistic X . Let us now consider some examples.

Example 3.1 Compute the average score in science of 12 students who obtained the
following scores: 22, 24, 17, 21, 19, 18, 21, 30, 30, 17, 15, and 20.

Solution: Using a calculator, we have  X  254 and n = 12. Therefore

X
 X  254  21.167 = 21.17
n 12

Because the mean is obtained using the division process, we have to use a special
convention for rounding off the result. In research, the rule is for us to round-up heavily,
which means that we only have to retain the digits that are meaningful. For instance, the
extra digits after the first digit 6 of the value 21.16666667 do not add information about
the mean value.

In statistics, we round off the mean to one more decimal place from the original
data, which means that if the data are gathered to the nearest whole numbers, the mean
should be reported to the nearest tenths; if the data are gathered to the nearest tenths,
the mean should be reported to the nearest hundredths etc. Following this convention,
we report the mean score above as 21.2.

In the preceding example, the mean score of 21.2 is not one of the given scores.
Such is the property of the mean, which means that more often than not, the mean as an
-------------------------------------------------------
Measures of Location
51 | G a b i n o P . P e t i l o s

average will be different from all the given scores. Also, when we say that the average
score is 21.2, we are not saying that all students got a score of 21.2, but rather, we say
that the scores of the students are close to this value.

If the perfect score is 40, then, the performance of the 12 students in the science
test is not quite good since the value tells us that most of them got only nearly half of the
total score, on the average. The mean, as a single index, is very useful in describing the
entire data set, whether we are dealing with sample or population data. The mean is
also useful in comparing two or more sets of sample or population data.

Example 3.2 The following data represent the net take home pay of five rank and file
employees of a certain company. Determine the average net take home
pay of the employees.

NET TAKE HOME PAY: P4,750; P4,535; P4,380; P3,895; and P9,307.

Solution: Summing up the amounts and dividing by 5, we have

X
 X  4,750  4,535  4,380  3,895  9,307  26,867  P 5,373.40 .
n 5 5

Therefore the average take home pay of the employees is P 5,373.40 .

In this example, the average net take home pay does not seem to represent (or is
not typical) of the net take home pay of the five employees since the average value is
higher than majority of the amounts. This occurred because of one extremely high
amount of P9,307. Thus, the mean is affected by extreme values so that when such
extreme values occur, the mean does not represent well the entire set of data. However,
the mean is amenable to further mathematical manipulation which makes it popular in
inferential statistics, as we shall see later.

3.2.2 Mean (Grouped Data)

Sometimes the data available are already sorted out and organized into a table
or frequency distribution. If the mean of the data is desired, then we have to estimate it
from the information available in the frequency distribution. Normally, only the class
intervals and frequency counts are given. How do we estimate the mean of the data if
the scores are not available? Let us illustrate this problem by considering the
hypothetical frequency distribution of the mental ability of 10 students as shown in Table
3.2

-------------------------------------------------------
Measures of Location
52 | G a b i n o P . P e t i l o s

Table 3.2 Distribution of Mental Ability of 10 Students

IQ (Mental Ability) f Percent


119 – 127 2 20%
110 – 118 5 50%
101 – 109 3 30%
Total 10 100%

Since the IQ scores are not known, we can estimate them using the class marks
or midpoints of the intervals. Here, we use the idea that the average of the scores in a
particular interval is equal to the midpoint of the interval. The midpoints of the class
intervals are 105, 114 and 123, therefore, using the frequency counts, we can reconstruct
the 10 scores using these midpoints. Thus the scores in ascending order are: 105, 105,
105, 114, 114, 114, 114, 114, 123, and 123 (why). The sum of these estimated scores is
1,131. Therefore the estimated mean IQ of the 10 students is given by

1,131
X  113.1 .
10

The technique presented above is not quite easy to apply especially when there
are many class intervals and the frequency counts are large numbers. Fortunately, the
above idea can be used to generate a formula for computing the mean of grouped data.
This formula is given by

X
 fX (Equation 1)
N

where f = class frequency; X = class mark; and N is the total number of cases in
the frequency distribution. The symbol  fX means adding the products of all
paired class marks and class frequencies.

Example 3.3 Find the mean of the data as shown in the frequency distribution below.

Class Interval f
95 – 99 4
90 – 94 8
85 – 89 12
80 – 84 7
75 – 79 6
70 – 74 2
65 – 69 1

Solution: First, we expand the table by constructing the columns for X (midpoints) and
fX , (the products of corresponding midpoints and class frequencies). The
constructed table is shown below.

-------------------------------------------------------
Measures of Location
53 | G a b i n o P . P e t i l o s

Class Interval f X fX
95 – 99 4 97 388
90 – 94 8 92 736
85 – 89 12 87 1,044
80 – 84 7 82 574
75 – 79 6 77 462
70 – 74 2 72 144
65 – 69 1 67 67
Sum 40 3,415

Using the summary values from the table, the mean of the 40 scores is given
by

X
 fX  3415  85.38 .
N 40

We rounded off the computed mean to the nearest hundredths because the
class intervals actually extend up to the nearest tenths below and above the given
apparent limits.

Remark: Equation I can be used only when the class intervals have equal widths and the
lowest and highest class intervals are not open ended (why?)

When there are many class intervals and the class frequencies are large numbers,
Equation 1 may not be convenient to use. However this formula can be simplified by
transforming the midpoints X into some numbers u defined in Equation 2.

XA
u (Equation 2)
c

In Equation 2, A is the assumed mean (which could be any of the class midpoints) while c
is the class size. If we solve for X in this equation and substitute the result in Equation 1,
we arrive at the following equation:

  fu 
X A  (c) (Equation 3)
 N 
 

Equation 3 is called the “coded deviation” formula for calculating the mean of grouped
data.

-------------------------------------------------------
Measures of Location
54 | G a b i n o P . P e t i l o s

Example 3.4 Let us compute the mean of the same data as presented in Example 3.

Class Interval f
95 – 99 4
90 – 94 8
85 – 89 12
80 – 84 7
75 – 79 6
70 – 74 2
65 – 69 1

Solution: First let us decide on our assumed mean A. As mentioned earlier, A can be
any of the class midpoints. However, it is easier to compute the mean of the
entire data by choosing A corresponding to the midpoint of the middle class
interval. When there are two middle class intervals (which arises when the
number of categories is even), we choose for our A the midpoint of the class
interval with the larger frequency. From the frequency distribution above,
our A is the midpoint of the class interval 80 – 84 which is 82. Hence A = 82.

We present again the frequency distribution with additional columns


indicating the midpoints X, the values of u and the values of the products fu.

Class Interval f X u fu
95 – 99 4 97 3 12
90 – 94 8 92 2 16
85 – 89 12 87 1 12
80 – 84 7 A = 82 0 0
75 – 79 6 77 -1 -6
70 – 74 2 72 -2 -4
65 – 69 1 67 -3 -3
N=  f  40  fu  27
Note that the value of u corresponding to the assumed mean A is 0
while the values of u for all higher intervals and all lower intervals are
consecutive positive integers and consecutive negative integers, respectively
(Check this using the formula!). Such is always the case when the coded
deviation method is used. Thus, without using formula, the values of u can
be generated in this manner.

From the information gathered, we have


X A

fu 
(c)  82   27 (5) = 85.375 or 85.38 as before.
 N   40 
 

[Note: Verify that other values of A will yield the same value of the mean X ]

-------------------------------------------------------
Measures of Location
55 | G a b i n o P . P e t i l o s

3.2.3 Weighted Mean

Suppose the mean age of the pupils in four Grade Six classes with 30, 40, 42, and
38 students each are 12.62 years, 13.25 years, 13.75 years and 12.80 years, respectively.
How do we get the mean age of all the pupils in these four classes? If we get the mean
age simply getting the mean of the four means, then the result is 13.105.

If we apply the definition of the mean, we have to sum up the ages of all the
pupils in these four classes and divide it by the total number of pupils which is 30 + 40 +
42 + 38 = 150. The sum of the ages in all classes may be obtained by multiplying the
mean age per class by the number of pupils per class. Thus, the sum of the ages of all
pupils is given by

30  12.62  40  13.25  42  13.75  38  12.80  1972 .5 .

1972.5
Therefore the actual mean age of the 150 pupils is  13.15 which is not the same
150
as the mean value 13.105 obtained by getting the average of the given means per class.

We have the following formula for getting the weighted mean of a set of data
with given weights.

Xw 
 wX
w
where w is the weight corresponding to a given score X.

The grade point average of a student on a given semester is actually a weighted


average since this value is obtained by multiplying each grade by the corresponding
number of units and the sum of these products is divided by the total number of units
enrolled by the student.

The mean for grouped data is also known as weighted mean where the
frequencies corresponding to the different class intervals serve as the weights to be
multiplied by the class marks.

3.2.4 Median (Raw Data)

The median is the score where half of the total number of scores is found
below it and the other half above it. The median therefore is the middle score
when all the scores have been ranked. There is no common symbol for the
median. For our purpose, we will use the notation Md for median. Because
ranking of the data is involved in getting the median value, the median is an
-------------------------------------------------------
Measures of Location
56 | G a b i n o P . P e t i l o s

appropriate measure of central tendency when the variable is measured in at


least the ordinal scale.

The following are the steps in finding the median based on raw data:

1. Arrange the data or scores in ascending order (from lowest to highest);


2. If n is odd, there will be a middle score. This middle score is the
median. If n is even, there will be two middle scores and the median is
taken as the arithmetic average of the two middle scores.

Because the median depends only on the number of cases, it is more


preferred than the mean whenever extreme values occur in a data set.

Example 3.5 Let us find the median value of the data in Example 2.

Solution: Arranging the data from lowest to highest, we get


P3,895
P4,380
P4,535 middle score
P4,750
P9,307.

Therefore the median net take home pay of the employees is


P4,535.00 which is more representative of the five amounts compared with
the mean of P5,373.40 computed earlier.

Example 3.6 The scores of nine students in a science test are: 22, 24, 17, 21, 19, 18, 21, 30,
30 . Find the median score.

Solution: Arranging the scores in ascending order and identifying the middle score, the
median score is 21.
17
18
19
21
21  middle score (Median) Md = 21
22
24
30
30

Example 3.7 The age (in years) of six teachers are listed below 42, 23, 24, 30, 27, and 34.
Find the median age of the six teachers.

-------------------------------------------------------
Measures of Location
57 | G a b i n o P . P e t i l o s

Solution: Because the number of cases is 6 (even), there will be two middle scores. The
age in ascending order are listed below. From the arranged data, the two
middle scores are 27 and 30. Hence, the median is 28.5.
23
24
27 middle score 27  30
Md =  28.5
30 middle score 2
34
42

When n is large, locating the position of the median by ocular inspection may not
be easy. However, based on the definition given and the parity (oddness or evenness) of
n, we can use the following formulas for locating the middle score (s).

 n 1 
 If n is odd, the median is the  th score;
 2 
  
n  n
 If n is even, the median is the average of the  th and   1 th scores.
2 2 

 157  1 
Thus, for instance, if n = 157 (odd), the median is the  th  79th score. On
 2 
346 346
the other hand, if n is 346 (even),  173 and  1  174 . Hence, the median is
2 2
the average of the 173rd and 174th scores.

Although the median is easier to determine than the mean, it is less stable since
its value depends only on the size of the sample or population. Also, the median is not as
extensively used in inferential statistics as the mean because the formula is not
amenable to further mathematical manipulation. However, when a set of data is
skewed, the median will be more representative of the data than the mean.

3.2.5 Median (Grouped Data)

For grouped data, the median is usually computed on the assumption that the
scores in the class category containing the median are evenly distributed throughout that
interval. To find the median for grouped data, we first identify the median class, the class
interval containing the median. Since the median divides the distribution into two equal
parts, we first get 50% of the total number of cases or scores. We then identify the
interval containing the score where 50% of the cases would fall below this value.

Let us consider again the frequency distribution given in Example 2 (labeled Table
3.3 below). Since the total number of cases is 40, 50%(40) is 20. Hence, the median is
the 20th score. Since the scores are not given, we will use the idea of interpolation to
find the median value. Next, we have to construct the less than cumulative frequencies
and reflect the class boundaries of the intervals (Table 3.4).

-------------------------------------------------------
Measures of Location
58 | G a b i n o P . P e t i l o s

Table 3.3 Table 3.4


Class Interval f Class Interval f <cf
95 – 99 4 94.5 – 99.5 4 40
90 – 94 8 89.5 – 94.5 8 36
85 – 89 12 Median Class  84.5 – 89.5 12 28
80 – 84 7 79.5 – 84.5 7 16
75 – 79 6 74.5 – 79.5 6 9
70 – 74 2 69.5 – 74.5 2 3
65 – 69 1 64.5 – 69.5 1 1
 f  40  f  40
Note that up to the score of 84.5, only 16 scores fall below this value, hence the
class interval 79.5 – 84.5 does not contain the median score. On the other hand, up to
the score of 89.5, there are already 28 scores that fall below this value. Hence we
conclude that the median score lies in the class interval 84.5 - 89.5 (median class).

The interpolation of the median value is illustrated in Fig. 3.1. In this Fig., we
consider the score and the corresponding less than cumulative frequency (<cf) as a pair.
Note that the score corresponding to the <cf = 20 is supposed to be the median Md
(Why?).

SCORE <cf
89.5 28

Md 20
5 12

Md - 84.5 4

84.5 16

Fig. 3.1 Interpolation of the Median Value

Using the principle of interpolation, the ratio of the differences of scores and the
ratio of the corresponding differences of the less than cumulative frequencies (<cf’s)
must be equal. Therefore,
Md  84.5 4
 .
5 12

4
Solving for the median value, we get, Md  84.5     5 or Md= 86.17.
 12 
4
Let us interpret the equation Md  84.5     5 . We said earlier that the
 12 
th 1
median score is the 20 score since 2 of 40 is 20. Since there are only 16 scores up to the
value of 84.5, we have to “add a certain amount to 84.5, the lower class boundary of
the median class” to reach the median value. Now, we need four (4) more scores to
reach the needed 20 scores and since the median class contains 12 scores, we shall take
“4 out of the 12” scores in this interval. Moreover, since these 12 scores are assumed to

-------------------------------------------------------
Measures of Location
59 | G a b i n o P . P e t i l o s

be evenly distributed in this interval and the class width of the interval is 5, we multiply
4
the ratio by 5. Therefore, the amount to be added to 84.5 to reach the median value
12
4
is    5 .
 12 

The above procedure is the basis for the following formula for getting the median
for grouped data:
N 
  Fb 
Md  LL   2  (c) (Equation 4)
 f 
 
 

where, LL = true lower limit or lower class boundary of the median class;
Fb = the sum of all frequencies below the median class
(or the <cf directly below the median class)
f = frequency corresponding to the median class; and
c = class size.

Notes: 1. It is important to identify first the class interval containing the median since
the values needed in the formula to compute the median are determined
with reference to the median class.
2. If 50% (N) is one of the “<cf’s”, the median is the upper class boundary of the
interval corresponding to this “<cf”.

Example 3.8 Find the median of the following frequency distribution.

Class Interval f <cf


95 – 99 1 30
90 – 94 4 29
85 – 89 8 25
Median Class  80 – 84 10 17
75 – 79 4 7
70 – 74 3 3
 f  30

N
Solution: We note that = 50%(30) = 15. Looking at the <cf, 15 is between 7 and 17.
2
Hence the median class is 80 – 84. With reference to the median class, we
have LL  79.5 ; Fb = 7; f = 10; and c = 5. Therefore, the value of the median
is given by
 15 - 7 
Md  79.5     5 = 83.50.
 10 

-------------------------------------------------------
Measures of Location
60 | G a b i n o P . P e t i l o s

Another formula for computing the median which is useful for checking purposes
(also derived using the idea of interpolation) is given by

N 
  Fa 
Md  UL   2  (c) (Equation 5)
 f 
 
 

where, UL = true upper limit or upper class boundary of the median class;
Fa = the sum of all frequencies above the median class
(or the >cf directly above the median class)
f = frequency corresponding to the median class; and
c = class size.

Example 3.9 Verify the answer in Example 8 using Equation 5.

Solution: The median class is 80 – 84. With reference to this median class, we have:
UL = 84.5; Fa = 13 (8 + 4 + 1); f = 12; and c = 5. Hence,
 15 - 13 
Md  84.5     5 = 83.50
 10 
which is equal to the obtained value in Example 3.8.

3.2.6 Mode (Raw Data)

The mode is the value or the score that occurs most frequently in a collection of
scores. Normally, the mode is represented by the tallest column on a histogram or the
highest peak in a frequency polygon. Hence the mode is appropriate when the variable is
measured in the nominal scale.

The definition of the mode is instructive since it gives us an idea on how to find
the value for a given set of scores. Similar to the median, there is no common symbol
used to denote the mode. For our purpose, we shall use the notation MO for the mode.
The following are the steps in determining the mode for raw data.

Steps: 1. Arrange the data or scores in ascending order (lowest to highest).


2. Record the frequency of each distinct score.
3. The score which has the highest frequency is declared as the mode.

Example 3.10 Find the modal age of the 10 Grade III pupils whose ages are listed below:
Age x (in years): 10.25, 9.0, 10.25, 9.5, 9.0, 10, 9.0, 9.25, 10.75, 10.

Solution: We first arrange the scores in ascending order and take note of the
frequency of occurrence of each score.
-------------------------------------------------------
Measures of Location
61 | G a b i n o P . P e t i l o s

9
.0
 9.0 , 9
, 9.0,
 .25, 9
.5, 10
.0  , 10
, 10.0 .
25  , 10
, 10.25
 .
75
f 3 f 1 f 1 f 2 f 2 f 1

Based on the arranged scores, the modal age MO  9 .


The advantage of the mode over the mean and the median is that, it is easily
obtained by inspection. However, there are problems associated with the mode as a
measure of central tendency. First, the mode is the most unstable among the three
measures of central tendency since it changes drastically from sample to sample.
Second, when all the scores are distinct, no mode can be declared from the data. For this
reason, the mode is recommended only when the sample size large.

Third, there are situations when two or more scores would appear the same
number of times in a set of data. In this situation, you declare these scores as the modes
of the distribution, hence it is possible for a distribution to have several modes. A
distribution having two modes is called a bimodal distribution.

Finally, the mode is not popular in inferential statistics since there is no formula
(unlike the mean) that can be manipulated for further analysis of the data. For this
reason, the mode is called a terminal statistic since its usefulness is generally limited only
to descriptive statistics.

3.2.7 Mode (Grouped Data)

For data that are summarized in a frequency distribution, an approximate value


of the mode called the “crude mode” is defined as the midpoint of the class interval with
the highest class frequency (called the modal class), and thus is also easily obtained by
inspection.

A more accurate value of the mode (exact mode) for grouped data is also
obtained by linear interpolation. The resulting formula is given by

 d1 
MO  LL    (c) (Equation 6)
 d1  d2 
where, LL = true lower limit or lower boundary of the modal class;
d1 = absolute difference between the frequencies of the modal class
and the lower class interval (interval scores lesser than the mode);
d2 = absolute difference between the frequencies of the modal class
and the higher class interval (interval scores greater than the mode);
c = the class size.

Example 3.11 Find the exact mode based on the same frequency distribution given in
Example 3.

-------------------------------------------------------
Measures of Location
62 | G a b i n o P . P e t i l o s

Solution: The modal class of the frequency distribution shown below is 85 – 89


since it has the largest value of f.

Class Interval f
95 – 99 4
90 – 94 8
Median Class  85 – 89 12
80 – 84 7
75 – 79 6
70 – 74 2
65 – 69 1
 f  40
Thus, with reference to the modal class, the frequency of the
lower class interval is 7 while the frequency of the higher class interval is
8. The values needed to compute the exact mode are:

LL = 84.5; d1 = |12 - 7| = 5 and d2 = |12 – 8| = 4 and c = 5.

 d1 
 (c)  84.5  
5 
Therefore, MO  LL   (5)  87.28.
 d1  d2  54

As a check, we can also use the formula below to compute the mode for grouped
data.
 d2 
MO  UL    (c) (Equation 7)
 d1  d2 
where UL is true lower limit or lower boundary of the modal class while the rest are
defined as before.

Thus, using the data in Example 11, we have

 d2 
 (c)  89.5  
4 
MO  UL   (5)  87.28
 d1  d 2   5  4 

which is consistent with the previous result.

There is also a rule in statistics called the “empirical rule” which allows us to
compute the value of the mode for grouped data when the mean and the median are
available. This rule is given by the formula

MO  3  Median  2  Mean (Equation 8)

For the data in Example 3.11, the mean is 85.38 while the median is 86.17. Using
these values, the value of the mode using the empirical rule would be
MO  3  Median  2  Mean = 3(86.17) - 2(85.38) = 87.75

-------------------------------------------------------
Measures of Location
63 | G a b i n o P . P e t i l o s

which is only slightly higher than the computed mode in Example 3.11.

3.2.8 Relationship of the Three Measures of Central Tendency

When the distribution of a set of data is symmetric, the three measures of


central tendency have the same values (see Fig. 3.2).


Mean = Median = Mode

Fig. 3.2 Symmetric Distribution

For skewed distributions, the three measures will have different values. For
instance, when the distribution is negatively skewed, majority of the scores would be
high and there will only be few extremely low scores. These extremely low scores will
tend to pull down the value of the mean, hence among the three measures, the mean
will have the least value. The mode would correspond to the value where the highest
peak occurs and hence it will be greater than the mean as well as the median. Finally,
the median will not be affected by the extremely low scores so it will lie between the
mean and the mode. Fig. 3.3 illustrates the relationship of the three measures of central
tendency for negatively skewed distributions.

Mean  Median  Mode

  
Mean Median Mode
Fig. 3.3 Negatively Skewed Distribution

For positively skewed distributions, the mean will have the largest value and
the mode will have the least value (Why?). The median will be between the mode and
the mean. Fig. 3.4 illustrates the relationship of the three measures of central tendency
when the distribution if positively skewed.

Mean  Median  Mode

  
Mode Median Mean
Fig. 3.4 Positively Skewed Distribution

-------------------------------------------------------
Measures of Location
64 | G a b i n o P . P e t i l o s

3.2.9 Summary of Measures of Central Tendency

We list in the Table 3.6 the summary of the three measures of central tendency
indicating their common names, when they are supposed to be used and their
advantages and disadvantages

Table 3.6 Characteristics of the Mean, Median and Mode

Measure Common Name When to use Advantage Disadvantage


Mean Arithmetic  There are no extreme  Most reliable, i.e., stable  Affected by extreme
Average scores and less variable from scores
 When the data are sample to sample
interval or ratio  Can be further manipulated
mathematically which
makes it useful in inferential
statistics
Median Middle score  When the distribution  Not affected by extreme  Less stable from sample
is skewed scores to sample
 When the data are  Can be computed for  Merely indicates the
ranks grouped data with open- middle value and is not
ended class related to the sum of the
entire set of data
Mode Typical score  When a quick estimate  Can be obtained by ocular  The most unstable
of the typical score is inspection measure especially when
to be determined the number of scores is
 When the data are small
frequency counts  Difficult to interpret
 When the distribution when the distribution is
is skewed Not unimodal

-------------------------------------------------------
Measures of Location
65 | G a b i n o P . P e t i l o s

3.3 MEASURES OF RELATIVE POSITION

An individual score has meaning only in relation to the rest of the scores. Thus, to
interpret a score, we have to use the entire distribution as basis for interpreting the
individual scores.

We learned that the median is that point in the distribution below which we can
find 50% of the scores. In exactly the same manner, we can calculate the values on the
scale below which we can find a certain percent of all the other scores. These values are
called quantiles and are referred to here as measures of relative position.

The values that divide a distribution into 100 equal parts are called percentiles.
If Px denotes the xth percentile value, then

Px = value on the scale below which we can find x% of the scores.

Thus,

 P90 = the 90th percentile value is the value in the distribution below which we can
find 90% of all the other scores.
In a class consisting of 50 pupils, a pupil whose final grade corresponds to
or is greater than P90 is said to belong to the upper 10% of the entire pupils in the
class. This also means that his grade is better than 90%(50) = 45 pupils in the
class.
 P10 = the 10th percentile value is the value below which we can find 10% of all the
other scores in the distribution.
 P50 = the 50th percentile value is the value below which we can find 50% of all the
other scores in the distribution. Thus, P50 is the same as the median.

Other Quantiles:

 Deciles - values on the scale that divide the distribution into 10 equal parts
D1 - first decile = value on the scale below which lie 10% of the scores in the
distribution.
D5 - fifth decile = P50 = Median
D9 - 9th decile = P90 = 90th percentile

 Quartiles - values on the scale that divide the distribution into 4 equal parts
Q1 - first quartile = P25 = value on the scale below which lies 25% of the scores in
the distribution. Thus, 75% of all the scores are higher than Q 1.
Q2 = Middle quartile = D5 = P50 = Median
Q3 = Upper Quartile = P75 = 75th percentile value.

-------------------------------------------------------
Measures of Location
66 | G a b i n o P . P e t i l o s

3.3.1 Computation of Quantiles for Raw data

If you have read the literature about quantiles (or percentiles), there are two
methods for computing percentiles for ungrouped data, one using n and the other using
(n+1), where n refers to the total number of cases. Both methods were developed based
on the following definition of the Pth percentile value.

Definition: "The pth percentile value (PV) of a set of data is a value along the
measurement scale of the data with the property that approximately p
percent of the entire data set are less than or equal to PV.

Note that to find PV, we always sort the data from lowest to highest. Thus, if
there are n cases or observations in a data set, the ordered data may be denoted by

X1, X2, X3, ..., Xj, Xj+1,...Xn-1, Xn.

For raw data, there are two ways of computing quantiles, one using n as basis or
n+1, where n represents the total number of cases or observations in a data set.

1. Computing PV using n.
Let np = jg, where j is the integer part and g the fractional part of the product. For
instance, if n = 6 and p = 25%, np = 6 (0.25) = 1.5, so j = 1 and g = 0.5. If n = 9 and
p  75% , then np =9 (0.75) = 6.75, so j = 6 and g = 0.75.

X j  X j 1
RULE A. If g = 0, then PV  .
2
RULE B. If g > 0, PV = Xj+1.

Thus, if np is a whole number j, the pth percentile value is just the average of
the numbers (cases) in the jth and (j+1)th positions of the ordered data set (Rule A).
On the other hand, if np is not a whole number but contains a fractional part, we
"round up" this value of np to j+1 and declare the number or value in the (j+1)th
position of the ordered set of numbers as the pth percentile value (Rule B).

2. Computing PV using (n+1).

If (n+1)p = jg, where j is the integral part and g is the fractional part of the
product, then

PV = Xj + g(Xj+1 – Xj)

Thus, PV is the number in the jth position (Xj) of the ordered data plus g multiplied by
the difference between the succeeding value (Xj+1) and Xj.

-------------------------------------------------------
Measures of Location
67 | G a b i n o P . P e t i l o s

Let us illustrate the above rules using the following examples.

Example 3.12 Find Q1, Q3, and P99 for the following data:

95 81 59 68 100 92 75 67 85 79
71 88 100 94 87 65 93 72 83 91

Solution: We first arrange the data in ascending order as follows:

59 65 67 68 71 72 75 79 81 83
85 87 88 91 92 93 94 95 100 100

1. Computing PV using n.
Since Q1= P25, we get 25% (20) = 5, which means that j = 5 and g = 0. Since there
is no fractional part of 25% (20), Q1 is the average of the 5th score and 6th score in the
ordered data set. Thus,

5 th score  6 th score 71  72
Q1    71.5
2 2

Similarly, to find Q3 = P75, we get 75%(20) = 15 which means that j = 15 and g = 0. Thus
Q3 is the average of the 15th score and the 16th score which is 92.5. Therefore Q3 =
92.5
To find P99, we get 90%(20) = 19.8 so that j = 19 and g = 0.8. Since g >0, we take
the (19+1)th or 20th position as the value of P99. Thus, P99 = 100.

2. Computing PV using n+1.


For Q1 which is the same as P25, we get 25% (20+1) = 5.25, so that j = 5 and g =
0.25 which means that Q1 is the 5th score plus 0.25 of the difference between the 6th
score and the 5th score. Put another way, we say that Q1 is the value which is 0.25 of
the way from the 5th score to the 6th score. Thus, to get Q1, we add to the 5th score
0.25 of the difference between the 6th score (72) and the 5th score (71). Hence,

Q1 = 71 + .25(72 – 71) = 71 + .25 = 71.25.

For Q3 which is the same as P75, we get 75% (20+1) = 15.75, so that j = 15 and
g  0.75 . Thus, Q3 is the 15th score plus 0.75 of the difference between the 16 th and
15th scores. Therefore,
Q3 = 92 + 0.75(93 – 92) = 92 + 0.75 = 92.75

Finally for P99, we note that 99%(20+1) = 20.79. Thus, P99 is the score which is
.79 of the way from the 20th score to the next score. Since we do not have a score
beyond the 20th score we take the 20th score as the value of P99.
-------------------------------------------------------
Measures of Location
68 | G a b i n o P . P e t i l o s

Example 3.13 Find P11 and P93 for the data given above.

Solution: Since x = 11, we get 11%(20+1) = 2.31. The value 2.31 suggests that P 11 is a
value that is 0.31 of the way from the 2 nd score to the third score. To get P11,
we get 0.31 of the difference between the 2nd score (65) and 3rd score (67)
and add the result to the 2nd score. Thus we have,

0.31(67– 65) = 0.31(2) = .62.

Therefore, P11 = 65 + 0.62 = 65.62

Similarly, 93%(20+1) = 19.53. Thus, P93 is the score that is 0.53 of the way
from the 19th score (100) to the 20th score (100). Thus,P93 = 100 + 0.53(100 -
100) = 100.

Remarks:
1. If we compare the results in Example 12, there is a slight difference between
the resulting quantiles using n and n+1. We will agree to use n+1 instead of n
when computing quartiles for raw data and we shall refer to the obtained
values as the exact values of the quantiles.

2. In some books, quantiles are obtained by simply rounding off the product
np to the nearest whole number and use this number to locate the quartile.

3. A simple procedure for finding the lower and upper quartiles is to get the
median of the scores below Q2 for the lower quartile (Q1) and the median of
the scores above Q2 for the upper quartile (Q3).

To illustrate this idea, consider the data consisting of n = 10 cases as shown


below. Since the data are already arranged, the median value Q2 = (71+72)/2
= 71.5. The 10 cases are divided by this median value and are separated
using a bar.

Q1 Q3
59 65 67 68 71 72 75 79 81 83
(values below Q2) (values above Q2)

Thus, Q1 is the median of the values below Q2 which is 67 while Q3 is the


median of the values above Q2 which is 79.

-------------------------------------------------------
Measures of Location
69 | G a b i n o P . P e t i l o s

3.3.2 Computation of Quantiles for Grouped Data

The computation of quantiles of grouped data is similar to the computation of


the median (P50). Recall that the formula for finding the median is given by

N 
  Fb   50%(N)  Fb 
Md  LL   2 (c)  LL   (c) .
 f   f 
 
 

Since Px is defined as the value below which x% of the total number of cases lies,
we can revise the above formula by merely changing 50%(N) with x%(N). Thus,

 x %(N)  Fb 
Px  LL   (c) (Equation 9)
 f 

where LL = the true lower limit of the class interval containing Px;
Fb = sum of all frequencies below the intervals containing Px
(the “<cf” below the interval containing Px)
f = frequency of the interval containing Px;
N = number of cases; and
c = class size.

Example 3.14 Find a) Q1 and b) Q3 for the data given below:

Class Interval f <cf


95 – 99 1 30
90 – 94 4 29
85 – 89 8 25
80 – 84 10 17
75 – 79 4 7
70 – 74 3 3
 f  30
Solution: a) Since Q1 = P25, we first get 25%(N) to determine the interval containing Q1.
Note that 25%(30) = 7.5. With reference to the “<cf” column, 7.5 is
between 7 and 17, so, the interval 80 – 84 contains P25. Thus, with
reference to this interval, we have LL = 79.5; Fb = 7; f = 10; c = 5.
Therefore,

 25%(N)  Fb   7.5  7 
P25  LL   (c) = 79.5   (5) = 79.5 + 0.25 = 79.75.
 f   10 

-------------------------------------------------------
Measures of Location
70 | G a b i n o P . P e t i l o s

b) Since Q3 = P75, we first get 75%(N) to determine the interval containing Q3.

Using N = 30, we have 75%(30) = 22.5. Looking at the “<cf” column, we


note that 22.5 is between 17 and 25. Thus, the interval containing Q 3 or
P75 is defined by the limits 85 – 89. With reference to this class interval,
we have LL = 84.5; Fb = 17; f = 8; c = 5. Therefore,

 75%(N)  Fb   22.5  17 
P75  LL   (c) = 84.5   (5) = 84.5 + 3.44 = 87.94.
 f   8 

3.4 Box Plots or “Box and Whiskers Diagrams”

A box plot is also graphical presentation of quantitative data that indicates what
extreme values of the data are, where the data are centered, and how spread out are the
data. This is obtained by plotting the values of five descriptive statistics of the data
which are the smallest value, the lower quartile (Q1), the median (Q2), the upper
quartile (Q3), and the largest value. Box plots are useful in identifying any outliers of
the data, that is, cases or observations that fall outside the overall pattern of the data
itself.

A case or observation is a suspected outlier if it falls more than 1.5  IQR above
Q3 and below Q1, where IQR (Interquartile Range)= Q3 – Q1.

Procedure for constructing a box plot:


1. Draw a box that spans from Q1 to Q3.
2. Draw line segment in the box that marks the median Q2.
3. Draw line segments (called whiskers) that extend from the box to the
smallest and largest values of the data.

Example 3.14 Construct a box plot, using the data presented in Example 3.12. The
ordered data are reproduced below.
59 65 67 68 71 72 75 79 81 83
85 87 88 91 92 93 94 95 100 100
Solution:
Based on this data, we have the following five summary values:
1. Smallest value: 59
2. Q1 = 71.25
3. Q2 = 84
4. Q3 = 92.75
5. Largest value: 100

-------------------------------------------------------
Measures of Location
71 | G a b i n o P . P e t i l o s

The resulting box plot is shown in Figure 3.5

S Q1 Q2 Q3 L

    

60 65 70 75 80 85 90 95 100 105

Fig. 3.5 Box Plot for the Data in Example 12

From the given data, IQR =92.75 – 71.25 = 21.5. Thus, Q1 – 21.5 = 49.75 while
Q3  IQR  92.75  21.5  114.25 . Based on this criterion, the given data has no potential
outlier.

Box plots can be generated automatically using the Statistical Package for Social
Sciences (SPSS) software. Figure 3.6 shows the box plots of the level of knowledge on
AIDS between a random sample of high school students and college students in Tacloban
City1 generated using the SPSS.

Fig. 3.6 Box Plot for Knowledge Scores among HS and College Students

From the figure, it is easy to compare the levels of knowledge on AIDS between
the two groups of respondents by simply looking at the location of the medians. Also,
the length of the whiskers, suggest that the knowledge scores among the college

1
Barbosa, Jovy (2012). Knowledge, Attitude and Practices on HIV/AIDS Among Senior Secondary
Students and Senior Nursing Students in Selected Schools in Tacloban City. Unpublished
Master’s Thesis. Remedios T. Romualdez Medical Foundation College

-------------------------------------------------------
Measures of Location
72 | G a b i n o P . P e t i l o s

students are more homogeneous than the knowledge scores of the High School students.
Note that there is one outlier for the knowledge scores among the high school students.

As shown in this illustration, box plots are useful not only in describing data sets
but more importantly, they are useful when comparing two or more data sets.

-------------------------------------------------------
Measures of Location
73 | G a b i n o P . P e t i l o s

Exercise 3.

1. For each given data set, compute the mean, median and mode and discuss which
measure is preferable. Justify your answer
a. Contributions of 15 students (in pesos)to the Red Cross campaign fund:
5, 7, 2, 15, 2 , 6, 12, 150, 5, 5, 20 10, 12, 10, 8.

b. The yearly operating expenses of a group of 17 apartment houses:


13,000, 19,000, 17,000, 16,000, 12,000, 27,000, 16,000, 14,000, 13,000, 10,000,
15,000, 13,000, 17,000, 14,000 16,000, 16,000, 12,000.

c. The completion time (in minutes) on a special test admisnistered to 14 students:


11, 15, 16, 19, 22, 24, 28, 30, 35, 38, 40, 52, 58, 70.

d. The monthly income of 17 government employees:


13,000; 19,000; 17,000; 16,000; 12,000; 27,000; 16,000; 14,000; 13,000; 10,000;
15,000; 13,000; 17,000; 14,000; 16,000; 16,000; 12,000.

2. The data shown below are the scores of 17 students in a Math test. Find the exact
value of each indicated quantile.
63, 61, 62, 58, 57, 61, 59, 56, 65, 63, 57, 59, 62, 64, 58, 60
a. P10 b. P90 c. Q1 d. Q3

3. Construct a box plot for the data in exercise 2 above.

4. Find the exact value of the indicated measures of relative position based on the
following data:
85, 78, 82, 79, 83, 80, 89, 87, 83, 90, 88, 85
a. P15 b. D3 c. D7 d. P85

5. For the data in 1d above, use the procedure discussed in remark 3 on page 60 to find
the lower and the upper quartiles of the data.

6. The table below presents the High School GPA of five sections of fourth year students
in a particular school.

Section 1 2 3 4 5
n 35 40 38 36 40
X 85.2 84.0 83.7 82.3 80.3

What is the High School GPA of all the fourth year students in this school?

-------------------------------------------------------
Measures of Location
74 | G a b i n o P . P e t i l o s

7. Tell which measure of central tendency would probably best represent the data
described. Give a reason for your answer.
a) A decorator surveys 36 people to ask their favorite colors.
b) A professor wants to describe class performance on a test for which the scores
range from 13 to 99, with all but two students scoring above 65.
c) A manufacturer wants to know what shoe size the “average” woman wears.
d) A job applicant takes an aptitude test with six parts, each of which tests a skill
essential to the job.

8. In a quality control study in a knitting mill, a manager examines 10 randomly chosen


sweaters a day for six weeks. The number of defects he finds in each group of 10 are
summarized in this frequency distribution:

No. 0 1 2 3 4 5 6 7 8 9 10
defects
Frequency 12 7 4 1 2 1 2 0 1 0 0

Find the mean, median and modal number of defects for this data. (6 points)

9. The ages of a random sample of 267 high school students are summarized in a
frequency distribution as shown below. Find the mean, median and modal age of the
students. Which measure is appropriate for describing the age the respondents?
Explain.

Age(years) f
14 – 15 103
16 – 17 116
18 – 19 31
20 – 21 12
22 – 23 5
10. The frequency distribution of the scores in Statistics of 44 graduate students is shown
below:
Class Interval f
95 – 99 1
90 – 94 2
85 – 89 4
80 – 84 6
75 – 79 8
70 – 74 12
65 – 69 7
60 – 64 3
55 – 59 1
44

a. Find the mean, median and mode for this data. On the basis of the computed
values, describe the type of skewness exhibited by the data.

-------------------------------------------------------
Measures of Location
75 | G a b i n o P . P e t i l o s

b. Find the values of the following quantiles: P10, Q1, Q3, P90, D4 and D6 and interpret
the computed value.

11. The frequency distribution of the monthly take home pay of a random sample of 60
rank and file employees is shown in the table below.

Class Interval f
6,000 – 6,499 2
5,500 – 5,998 3
5000 – 5,499 4
4,500 – 4,499 8
4,000 – 4,499 12
3,500 – 3,999 18
3,000 – 3,499 9
2,500 – 2,999 4
60

a. Find the mean, median and mode for this data. What type of skewness is
exhibited by the data based on the computed measures of central tendency?

b. Find the values of the following quantiles: D2, D7, Q3, Q1, P80, and P95 and interpret
the computed value.

n
 n


i 1
fi X i 

fd
i 1
i i


12. Show that the formula x  reduces to the formula x  A   c by
n n
 
 
 
Xi  A
letting d i  where X i is the ith class mark, A is the assumed mean and c is
c
the class size.

-------------------------------------------------------
Measures of Location

You might also like