Professional Documents
Culture Documents
Stat (I) 1-3 Material
Stat (I) 1-3 Material
1
1.3 STAGES IN STATISTICAL INVESTIGATION
A (statistical) population: is the complete set of possible measurements for which inferences
are to be made. The population represents the target of an investigation, and the objective of the
investigation is to draw conclusions about the population hence we sometimes call it target
population.
Examples
➢ Population of trees under specified climatic conditions
➢ Population of animals fed a certain type of diet
➢ Population of farms having a certain type of natural fertility
➢ Population of households, etc.
Sample: is sub part of the population which is representative
The population could be finite or infinite (an imaginary collection of units)
There are two ways of investigation: Census and sample survey.
Census: a complete enumeration of the population. But in most real problems it cannot be
realized, hence we take sample.
Sample: A sample from a population is the set of measurements that are actually collected in the
course of an investigation. It should be selected using some pre-defined sampling technique in
such a way that they represent the population very well. Sample is sub part of the population. In
practice, we don’t conduct census, instead we conduct sample survey.
Parameter: Characteristic or measure obtained from a population.
2
Statistic: Characteristic or measure obtained from a sample.
Sampling: The process or method of sample selection from the population.
Sample size: The number of elements or observation to be included in the sample.
1.5 APPLICATIONS, USES AND LIMITATIONS OF STATISTICS.
1.5.1 APPLICATIONS OF STATISTICS:
➢ In almost all fields of human endeavor
➢ Almost all human beings in their daily life are subjected to obtain numerical facts
➢ Applicable in some process e.g. invention of certain drugs, extent of environmental
pollution
➢ In industries especially in quality control area.
1,5.2 USES OF STATISTICS
The main function of statistics is to enlarge our knowledge of complex phenomena. Some uses of
statistics:
➢ It presents facts in a definite and precise form.
➢ Data reduction.
➢ Measuring the magnitude of variations in data.
➢ Furnishes a technique of comparison
➢ Estimating unknown population characteristics.
➢ Testing and formulating of hypothesis.
➢ Studying the relationship between two or more variable.
➢ Forecasting future events.
3
2. Quantitative Variables are numerical variables and can be measured. Examples: balance
in checking account, number of children in family.
Note that quantitative variables are either discrete or continuous
Discrete variable: It assumes a finite or countable number of possible values. It is usually
obtained by counting. Example: number of children ‘s in a family, number of cars at a traffic
light
Continuous variable: It can assume any value within the defined range. Continuous variables
are usually obtained by measuring. Example: weight in kg, height, time, air pressure in a tire.
Measurement scales
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement
scale refers to the property of value assigned to the data based on the properties of order, distance
and fixed zero.
Order
The property of order exists when an object that has more of the attribute than another object, is
given a bigger number by the rule system.
Distance
The property of distance is concerned with the relationship of differences between objects. If a
measurement system possesses the property of distance it means that the unit of measurement
means the same thing throughout the scale of numbers. More precisely, an equal difference
between two numbers reflects an equal difference in the "real world" between the objects that
were assigned the numbers.
Fixed zero (true zero)
True zero is related to the property of absolute absence of characteristic under consideration.
The property of fixed zero (true zero) is necessary for ratios between numbers to be meaningful.
Scale types
Four levels of measurement scales are commonly distinguished: nominal, ordinal, interval, and
ratio and each possessed different properties of measurement systems.
Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated above.
• Level of measurement which classifies data into mutually exclusive, all inclusive
categories in which no order or ranking can be imposed on the data.
• No arithmetic and relational operation can be applied.
4
Examples:
• Sex (Male or Female),
• Marital status (married, single, widow, divorce)
• Country code
• Regional differentiation of Ethiopia.
Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not the property
of distance. The property of fixed zero is not important if the property of distance is not satisfied.
• Level of measurement which classifies data into categories that can be ranked.
Differences between the ranks do not exist.
• Arithmetic operations are not applicable but relational operations are applicable.
• Ordering is the sole property of ordinal scale.
Example: Rating scales (Excellent, Very good, Good, Fair, poor), Military status.
Interval Scales
Interval scales are measurement systems that possess the properties of Order and distance, but
not the property of fixed zero.
• Level of measurement which classifies data that can be ranked and differences are
meaningful. However, there is no meaningful zero, so ratios are meaningless.
• All arithmetic operations except division are applicable.
• Relational operations are also possible.
Example: Temperature in degree Celsius or 0F,
Your score on an individual intelligence test as a measure of your intelligence.
A temperature of 0°C does not mean that there is no temperature. Furthermore, a temperature of
30°C in town X on a specific day may not be twice as warm as 15°C on another day in the same
town.
Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and fixed
zero. The added power of a fixed zero allows ratios of numbers to be meaningfully interpreted;
e.g. the ratio of the first person’s height to another person’s height is 1.32, whereas this is not
possible with interval scales.
5
• Level of measurement which classifies data that can be ranked, differences are
meaningful, and there is a true zero. True ratios exist between the different units of
measure.
• All arithmetic and relational operations are applicable.
Examples: Weight, Height, Number of students, Age
Exercises: Classify the following different measurement systems into one of the four types of
scales.
1. Your checking account number as a name for your account.
2. Your checking account balance as a measure of the amount of money you have in that account
3. Your score on the first statistics test as a measure of your knowledge of statistic
4. A response to the statement "Abortion is a woman's right" where "Strongly Disagree" = 1,
"Disagree" = 2, "No Opinion" = 3, "Agree" = 4, and "Strongly Agree" = 5, as a measure of
attitude toward abortion.
5. Times for swimmers to complete a 50-meter race
6. Months of the year Meskerm, Tikimit…
7. Socioeconomic status of a family when classified as low, middle and upper classes.
8. Blood type of individuals, A, B, AB and O.
9. Pollen counts provided as numbers between 1 and 10 where 1 implies there is almost no pollen
and 10 that it is rampant, but for which the values do not represent an actual counts of grains of
pollen.
10. Regions numbers of Ethiopia
11. The number of students in a college
12. The net wages of a group of workers
6
CHAPTER TWO
According to the role of time, data are classified in to cross-section and time series data. Cross-
section data is a set of observations taken at one point in time, while, time series data is a set of
observations collected for a sequence of times, usually at equal interval which may be on
weekly, monthly, quarterly, yearly, etc basis.
Before any statistical work can be done data must be collected. Depending on the type of
variable and the objective of the study different data collection methods can be employed. In the
collection of data we have to be systematic. If data are collected haphazardly, it will be difficult
to answer our research questions in a conclusive way.
Various data collection techniques can be used such as:
• Observation • Using available information
• Interview (Face-to-face/telephone interviews) • Focus group discussions (FGD)
• Questionnaire (mailed and self-administered questionnaire)
• Other data collection techniques – life histories, case studies, etc.
i) Observation – It includes all methods from simple visual observations to the use of high level
machines and measurements, sophisticated equipment or facilities, such as radiographic, X-ray
machines, microscope.
An observation guide should be prepared prior to data collection.
Advantages: Gives relatively more detailed, accurate and context related information.
Disadvantages: Investigators or observer’s own biases, prejudice, desires, and etc. and needs
more resources and skilled human power during the use of high level machines.
ii) Interview
Could be face to face /telephone interview
Advantage:
- suitable for use with illiterates
- permits clarifications of questions
- higher response rate than self-administered questionnaire
7
Disadvantage:
- presence of interviewer can influence the response
- more costly than self-administered questionnaire
iii) Questionnaire (Mailed and self-administered questionnaire)
Questionnaire is list of questions arranged in a predetermined sequence for a predetermined
purpose.
Self-administered questionnaires: under this method, the questionnaire is distributed by hand to
the respondents. The use of self-administered questionnaires is simpler and cheaper; such
questionnaires can be administered to many persons simultaneously (e.g. to a class of students).
Mailed Questionnaire Method
- The questionnaires are sent by post to the informants.
Limitations of questionnaire:
➢ The method can be used only if the respondents are educated.
➢ The response rates tend to be relatively low.
➢ Informants may not return the completed questionnaire back and even if they did, they
may have filled them incorrectly.
➢ It may not give the investigator a chance to explain the questions or ask supplementary
and follow up questions.
Types of questions used in a questionnaire
Depending on how questions are asked and recorded we can distinguish two major possibilities -
Open –ended questions, and closed ended questions.
a) Open-ended questions: Open-ended questions permit free responses that should be recorded
in the respondent’s own words. The respondent is not given any possible answers to choose
from. Such questions are useful to obtain information on:
• Facts with which the researcher is not very familiar
• Opinions, attitudes, suggestions of informants, or Sensitive issues
b) Closed- ended questions: Closed questions offer a list of possible options or answers from
which the respondents must choose. When designing closed questions one should try to:
• Offer a list of options that are exhaustive and mutually exclusive
• Keep the number of options as few as possible.
1.1.2. Methods of Data Presentation
The data collected in a survey is called raw data. In most cases, useful information is not
immediately evident from the mass of unsorted data. Collected data need to be organized in such
a way as to condense the information they contain in a way that will show patterns of variation
clearly. Precise methods of analysis can be decided up on only when the characteristics of the
data are understood. For the primary objective of this different techniques of data organization
and presentation like order array, tables and diagrams are used.
Statistical Tables
A statistical table is an orderly and systematic presentation of data in rows and columns. Rows
are horizontal and columns are vertical arrangements. The use of tables for organizing, for
8
example qualitative data, involves grouping the data into mutually exclusive categories of the
variables and counting the number of occurrences (frequency) to each category.
The simple frequency table is used when the individual observations involve only to a single
variable whereas the cross tabulation is used to obtain the frequency distribution of one variable
by the subset of another variable.
Examples:
Simple or one-way table
Table 1: Immunization status of 210 children in a certain Woreda
Immunization status number of children Percent (%)
Not immunized 75 35.7
Partially immunized 57 27.1
Fully immunized 78 37.2
Two-way table: This table shows two characteristics and is formed when either the row or the
column is divided into two or more parts.
Table 2: Immunization status by marital status of the women of childbearing age in a town.
Immunization Status
Marital Status Immunized Non Immunized Total
Single 58 177 235
Married 156 294 450
Divorce 10 18 28
Widowed 7 7 14
Total 231 496 727
Frequency distributions
For data to be more easily appreciated and to draw quick comparisons, it is often useful to
arrange the data in the form of a table, or in one of a number of different graphical forms.
Frequency: is the number of times a certain value of the variables is repeated in the given data.
It is the number of observations belonging to a given value or a group.
Frequency distribution: is a table which contains the values and the corresponding frequencies.
From the definition, a frequency distribution has two parts, namely- the values of the variables
on the one hand and the number of observations (frequency) corresponding to the values of the
variables on the other.
Array (ordered array): is a serial arrangement of numerical data in an ascending or descending
order.
Types of frequency distribution
There are two types of frequency distributions categorical (qualitative) and numerical
(quantitative).
1. Categorical frequency distribution: Here data are classified according to non-
numerical categories. To construct a categorical frequency distribution, the categories
contained in the frequency distribution must be mutually exclusive and exhaustive. In
other words, an element must be counted in one and only one category.
9
Example: Seniors of a high school were interviewed on their plan after completing high school.
The following data give plans of 548 seniors of a high school.
10
Steps in the construction of grouped continuous frequency distribution;
• Determine the number of classes to use, preferably between 5 and 20. It is possible to
take the approximate number of classes (K) can be the Sturge’s Formula, given by:
K = 1 + 3.322×log(n),, where n is the number of observations.
• Determine the class size (class width) as:
W = (Maximum value – Minimum value)/K = Range/K.
• Pick a suitable starting point less than or equal to the minimum value. The starting point
is called the lower limit of the first class. Continue to add the class width to this lower
limit to get the rest of the lower limits.
• To find the upper limit of the first class, subtract U from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the upper
limits.
• Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units
from the upper limits.
• Find the frequency and relative frequency of each class.
Example: Construct a grouped frequency distribution of the following data on the amount of time
(in hours) that 80 college students devoted to leisure activities during a typical school week:
23 24 18 14 20 24 24 26 23 21 16 15 19 20 22 14 13 20 19 27 29 2238
28 34 44 23 19 21 31 16 28 19 18 12 27 15 21 25 16 30 17 22 29 29 18
25 20 16 11 17 12 15 24 25 21 22 17 18 15 21 20 23 18 17 15 16 26 23
22 11 16 18 20 23 19 17 15 20 10
Solution:
Maximum value = 44 and Minimum value = 10.
Range = 44 – 10 =34 and class width, W = 35/7 = 4.857 ~ =5.
Using the above formula: K = 1 + 3.322 × log (80) = 7.32 ≈ 7 classes, Let 10 be the lower limit
of the first class. That is LCL1 = 10, LCL2 =10+W= 10+5=15, etc.
10, 15, 20, 25, 30, 35, and 40 are lower class limits.
Find the upper class limit; e.g. the first upper class limit (UCL1)=15-U=15-1=14,
UCL2 =1hghjkl;’4+W=14+5 = 19, etc.
14, 19, 24, 29, 34, 39, 44 are the upper class limits.
The class boundaries are calculated by: UCB = UCL + ½*U and LCB = LCL – ½*U.
Example: consider the above example and determine the class boundaries.
11
UCB1 = UCL1 + ½*(U=1)=14 +1/2 = 14.5 and LCB1 = LCL1 - ½*(U=1) =10 - 1/2 = 9.5 etc.
The class marks are also calculated as: m1 = ½*(UCL1 +LCL1) = ½*(UCB1 + LCB1) = 12.
m2 = ½*(UCL2 +LCL2) = 17, etc.
So, the complete frequency distribution table with cumulative frequencies is as follows.
So, the complete frequency distribution table with cumulative frequencies is as follows.
Class class class mark frequency relative less than cumulative greater
limit boundary (mi) (fi) frequency frequency than cf
10 – 14 9.5 – 14.5 12 8 0.1 8 80
15 – 19 14.5 – 19.5 17 28 0.35 36 72
20 – 24 19.5 – 24.5 22 27 0.3375 63 44
25 – 29 24.5 – 29.5 27 12 0.15 75 17
30 – 34 29.5 – 34.5 32 3 0.0375 78 5
35 – 39 34.5 – 39.5 37 1 0.0125 79 2
40–44 39.5 – 44.5 42 1 0.0125 80 1
12
90
80 78
75
70
60 57
50
40
30
20
10
0
not immunized partially immunized fully immunized
Immunization status
Fig.1 Immunization status
b) Component Bar chart: Bars are sub-divided into component parts of the figure. These
sorts of diagrams are constructed when each total is built up from two or more
component figures. This is done by dividing the bars into parts representing the
components and shading them accordingly.
Consider the data on immunization status of women by marital status (table 2)
500
400
300 294
immunized
200 non immunized
177
100
156
58 18 7
0 10
single married divorced widowed
Marital status
Fig. 2. Immunization status by marital status of women 15-49 years
c) Multiple bar charts: In this type of chart the component figures are shown as separate
bars adjoining each other. The height of each bar represents the actual frequency of the
component figure. It depicts distributional pattern of more than one variable and
comparisons of each component are desired.
Example of multiple bar chart: consider that data on immunization status of women by marital
status.
13
350
294
300
250
200 177
156 immunized
150 non immunized
100
58
50
10 18
7 7
0
single married divorced widowed
Marital status
Fig. 3. Immunization status by marital status of women 15-49 years
2) Pie-chart: it is a circle representing a categorical data by dividing the circle into different
sectors of angle in proportion of 360o to the amount associated to each category. The proportion
of the category can express either by percentages or by angles.
That is degree of central angle of a category = (amount of the category / total amount)* 360 o.The
proportion of a category = (frequency of a category / total frequency)* 100%.
FI NI
37% 36% NI
PI
FI
PI
27%
Fig. 4.Immunization status of children
Type of Graphs
The following are the most commonly used graphical presentations of data.
1) Histograms: A histogram is the graph of the frequency distribution of continuous
measurement variables. It is constructed on the basis of the following principles:
a) The horizontal axis is a continuous scale running from one extreme end of the distribution to
the other. It should be labeled with the name of the variable and the units of measurement.
b) For each class in the distribution a vertical rectangle is drawn with (i) its base on the
horizontal axis extending from one class boundary of the class to the other class boundary,
there will never be any gap between the histogram rectangles. (ii) the bases of all rectangles
will be determined by the width of the class intervals. If a distribution with unequal class-
14
interval is to be presented by means of a histogram, it is necessary to make adjustment for
varying magnitudes of the class intervals.
Example: Consider the data on time (in hours) that 80 college students devoted to leisure
activities during a typical school week. Draw the histogram
2) Frequency Polygon: If we join the midpoints of the tops of the adjacent rectangles of the
histogram with line segments a frequency polygon is obtained. When the polygon is continued to
the X-axis just outside the range of the lengths the total area under the polygon will be equal to
the total area under the histogram.
Example: Consider the above data on time spend on leisure activities.
30
28 27
25
20
15
12
10
8
5
3
0 1 1
0 5 10 15 20 25 30 35 40 45
Fig 5: Frequency polygon curve on time spent for leisure activities by students
15
90
80 80 78 79 80
75
70 72
63
60
50
44 Less than Ogive
40
36 More than Ogive
30
20
17
10 8
5
0 0 2 1 0
9.5 14.5 19.5 24.5 29.5 34.5 39.5 44.5
Fig 7: Cumulative frequency curve for amount of time college students devoted to leisure
activities
16
CHAPTER THREE
3.1.Measures of Central Tendency
When we want to make comparison between groups of numbers it is good to have a single value
that is considered to be a good representative of each group. This single value is called the
average of the group. Averages are also called measures of central tendency.
Objectives
Since the number of sample points is frequently large and it is easy to lose track of the overall
picture by looking at all the data at once, the data must be summarized as briefly as possible.
Some objectives of measuring central tendency:
• To comprehend (understand) the data easily.
• To facilitate comparison.
• To make further statistical analysis.
The Summation Notation
Let X1, X2, X3, …,Xnbe a number of measurements where n is the total number of observation
th
and Xi is i observation.
n
The symbol X
i =1
i (read as “the sum of Xi where i runs from 1 to n”) is mathematical shorthand
n
for X1+X2+X3+...+Xn. That is X
i =1
i = X1+X2+…+Xn
Example: Suppose the following were scores made on the first homework assignment for five
students in the class: 5, 7, 7, 6, and 8.
5
X
i =1
i = X1+X2+ X3 + X4+ X5 = 5 + 7+7+6+8=33
Properties of Summation
n
kX = k X ,
i =1 i =1
where k is any constant
n n
(a + bX ) = na + b X
i =1
i
i =1
i , a and b are constants.
n n n
(X
i =1
i + Yi ) = X i + Yi
i =1 i =1
5 5
b) Yi = 36
i =1
f) X Y
i =1
i i =241
17
5 5
c) 10 = 10 * 5 = 50 g)
i =1
X
i =1
i
2
= 223
5 5 5 5 5
d) ( X i + Yi ) =
i =1
X i + Yi =69
i =1 i =1
h) ( X i )( Yi ) = 1188
i =1 i =1
If X is a variable having values X1, X2,…,Xk occurring with frequencies of f1, f2,…, fk
respectively, then its arithmetic mean is given by:
X1f + X2f2 + …+Xk fk ∑k
i=1 Xif
̅
X = 1f +f +⋯+f = i
.
1 2 k ∑k
i=1 fi
Example: Suppose the X values are 3, 5, 4, 2, 7 and 6 with corresponding frequencies of 2, 1, 3,
2, 1 and 1 respectively. Then fine the mean for data.
Xi 3 5 4 2 7 6
frequency, fi 2 1 3 2 1 1
18
▪ mi is the midpoint of the ith class and
▪ fi is the ith class frequency.
Example: Calculate the mean for grouped data on the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school week given below:
Time spent (hours) Frequency
10 – 14 8
15 – 19 28
20 – 24 27
25 – 29 12
30 – 34 3
35 – 39 1
40 - 44 1
Solution:
• First find the class marks (midpoints)
• Find the product of frequency and class marks
• Find mean using the formula.
The class marks of the distribution are: 12, 17, 22, 27, 32, 37, 42.
Then the mean of the data is computed as:
∑7i=1 mif 12∗8+17∗28+⋯+42∗1 1655
̅=
X i
= = = 20.7 hours.
∑7i=1 fi 8+28+⋯+1 80
Example: If the mean final exam mark of one class of 50 students is 30 and the mean of marks
of another class of 100 students in the same final exam is 40. What is the mean mark of all 150
students?
50 * 30 + 100 * 40
Solution: X c = = 36.7 (50*30 + 100*40)/(50 + 100) =36.7.
50 + 100
4) If a wrong figure has been used when calculating the mean, then the correct mean can be
obtained without repeating the whole process using:
19
correct value − wrong value
Correct mean = wrong mean +
n
Where n= number of observations
Example: An average weight of 10 students was calculated to be 65. Later it was discovered that
one weight was misread as 40 instead of 80 k.g.
Calculate the correct average weight.
80 − 40
Correct mean = 65+ = 65+4 = 69
10
5) The effect of transforming original series on the mean.
a) If a constant k is added to / subtracted from/ every observation then the new mean
will be the old mean ± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be
k*old mean.
Example: The mean of a set of numbers is 500.
a. If 10 are added to each of the numbers in the set, then what will be the mean of the new
set?
New mean = 500+10 =510
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the
new set? New mean = -5*500= -2500
Example: The mean of n observations X , X , …,X are known to be 12 . New set of another
1 2 n
observations are obtained by the linear transformation Yi= 2Xi– 0.5 ( i = 1, 2, …, n ) then
what will be the mean of the new set of observations
Solutions: New Mean = 2* Old Mean – 0.5 = 2*12 – 0.5 = 23.5.
Advantages of arithmetic mean
▪ It is based on all values
▪ It is easy to calculate and simple to understand
▪ It is suitable for further mathematical treatment.
▪ It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
Disadvantages of arithmetic mean
▪ It is affected by extreme observations.
▪ It cannot be used in the case of open end classes.
▪ It cannot be determined by the method of inspection.
▪ It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty,
beauty.
▪ Sometimes it leads to wrong conclusion if the details of the data from which it is obtained are
not available.
Weighted Mean
In computation of arithmetic mean we had given equal importance to each observation. While,
when averaging quantities, it is often necessary to account for the fact that not all of them are
equally important in the phenomenon being described. In order to give quantities being averaged
their proper degree of importance, it is necessary to assign them relative importance called
weights, and then calculate a weighted mean.
20
In general, the weighted mean X̅w of a set of values X1, X2, …,Xn, whose relative importance is
expressed numerically by a corresponding set of weights W1, W2, … Wn, is given by:
X1W + X2W2 + …+Xn Wn ∑n
i=1 XiW
̅
X w = W1 +W +⋯+W = i
.
1 2 n ∑n
i=1 Wi
Example: A student obtained results 60, 75, 63, 59, and 55 in English, Biology, Mathematics,
Physics and Chemistry examinations respectively. Find the students weighted arithmetic mean if
weights 1, 2, 1, 3, 3 respectively are allotted to the subjects.
Solution: ̅ X w = (60*1 +75*2 + 63*1 + 59*3 + 55*3)/ (1+2+1+3+3) = 615/10 = 61.5.
G.M = n X1 * X 2 * ... * X n
Taking the logarithms of both sides
1
log(G.M) = log(n X 1 * X 2 * ... * X n ) = log(X 1 * X 2 * ... * X n ) n
1 1
log(G.M) = log(X 1 * X 2 * .... * X n ) = (log X 1 + log X 2 + ... + log X n )
n n
n
1
log(G.M) = log X i
n i=1
The logarithm of the G.M of a set of observation is the arithmetic mean of
their logarithm.
1 n
G.M = Anti log( log X i )
n i =1
Example 2.7: Find the G.M of the numbers 2, 4, 8.
Solutions:
G.M = n X1 * X 2 * ... * X n = 3 2 * 4 * 8 = 3 64 = 4
Remark: The Geometric Mean is useful and appropriate for finding averages of ratios.
k
n
H.M = k , n = fi
fi
i =1 X i
i =1
21
If observations X1, X2, …Xn have weights W1, W2, …Wn respectively, then their harmonic mean
is given by
W i
H.M = n
i =1
, This is called Weighted Harmonic Mean.
Wi X i
i =1
Remark: The Harmonic Mean is useful and appropriate in finding average speeds and average
rates.
Example 2.1.8: A cyclist pedals from his house to his college at speed of 10 km/hr and back
from the college to his house at 15 km/hr. Find the average speed.
22
∆2 = fmo − f3 ;
fmo = frequency of the modal class
f1 = frequencyoftheclassimmediatelyprecidingthemodalclass;
f3 = frequency of the class immediately succeeding the modal class.
Note: The modal class is a class with the highest frequency.
Example: Consider the following grouped quantitative data. Calculate the modal value of the
data.
Class limit Class boundary Frequency
6 – 11 5.5 – 11.5 2
12 – 17 11.5 – 17.5 2
18 – 23 17.5 – 23.5 7
24 – 29 23.5 – 29.5 4
30 – 35 29.5 – 35.5 3
36 – 41 35.5 – 41.5 2
5
= 17.5+ 6
5 + 3
=21.25
23
series. It is the middle most value in the sense that the number of values less than the median is
equal to the number of values greater than it.
Suppose there are n observations in a sample and if these observations are ordered from smallest
to largest, then the sample median foe ungrouped data is defined as:
n + 1 th
(1) The ( ) observations if n is odd
2
n th n th
(2) The average of the (2) and (2 + 1) observations if n is even.
2
b) Ascending order: 1, 2, 3, 5, 7, 8, 8, 9 (n=8)
4 rd + 5 th 5 + 7
Median = = =6
2 2
Median for Grouped Data
For a grouped (continuous) frequency distribution, median is calculated as:
n
( −cf)
2
Median = L + ∗ w , where
f
L = lower class boundary of the median class
w = length of the interval
n = total frequency of the sample
cf = Cumulative frequency preceding the median class.
f = Frequency of that interval containing the median.
The median class is the class with the smallest cumulative frequency (less than type) greater than
n
or equal to
2
Example: Find the median for the following distribution
40 – 44 7 7
45 – 49 10 17
50 – 54 22 39
55 – 59 15 54
60 – 64 12 66
65 – 69 6 72
70 – 74 3 75
24
n 75
= = 37.5
2 2
39 is the first cumulative frequency to be greater than or equal to 37.5.
Therefore, 50 – 54 is the median class. L = 49.5, n=75, w = 5, cf =17, f = 22
n
( −cf)
2
Hence, Median = L + ∗w
f
(37.5 − 17)5
= 49.5+ = 54.16
22
Note:
• Median is a positional average and hence not influenced by extreme observations.
• Median can be calculated in the case of open end intervals.
• Median can be located even if the data are incomplete.
Other measures of locations (Quantiles: quartiles, deciles, percentiles)
When a distribution is arranged in order of magnitude of items, the median is the value of the
middle term. Their measures that depend up on their positions in distribution quartiles, deciles,
and percentiles are collectively called quantiles.
Quartiles: Quartiles are measures that divide the frequency distribution in to four equal parts.
The value of the variables corresponding to these divisions are denoted Q1, Q2, and Q3 often
called the first, the second and the third quartile respectively.
Q1 is a value in which 25% items are less than or equal to it. Q2 has 50% items with value less
than or equal to it and Q3 has 75% items whose values are less than or equal to it.
th
k(n + 1)th
The k quartile Qk for ungrouped data is the value of the item which is the position,
4
where k =1, 2, 3 and n is the total number of observations.
The computation of three quartiles for a grouped data can be done as follows:
kn
• Calculate and search for the minimum cumulative frequency which is greater than or
4
kn
equal to , k=1, 2, 3.
4
• The class corresponding to this cumulative frequency is the kthquartile class. This is the
class where Qk lies.
kn
w ( 4 −cf)
• Thus, Qk = L + , k =1, 2, 3, where
f
L = lower class boundary of the kth quartile class
25
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding
the kth quartile class
w= the class width of the quartile class and
f= frequency of the kth quartile class
Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. The
values of the variables corresponding to these divisions are denoted D1, D2,.. D9 often called the
26
f = frequency of the kth percentiles class
Note: To compute quantiles, we first sort the data in ascending order.
Q2 = D5 = P50 = median, P25 = Q1, P75 = Q3, and Di = Pi*10,i=1, 2, 3,…9.
Example: Considering the following distribution
Calculate: a) All quartiles b) The 7thdecile c) The 90th percentile.
Class limit Frequency Cumulative freq.(less than type)
141 – 150 17 17
151 – 160 29 46
161 – 170 42 88
171 – 180 72 160
181 – 190 84 244
191 – 200 107 351
201 – 210 49 400
211 – 220 34 434
221 – 230 31 465
231 – 240 16 481
241 – 250 12 493
Solution a) quartiles
Q1: Determine the class containing the first quartile.
n
= 123.25 . Hence, 171- 180 is the class containing the first quartile.
4
L =170.5, n =493, w= 10, cf = 88, f= 72
kn
w ( −cf)
4 10(123.25 − 88)
Q1 = L + = 170.5+ = 174.43
f 72
Q2: Determine the class containing the second quartile.
2n
= 246.5 . Hence, 191- 200 is the class containing the second quartile.
4
L =190.5, n =493, w= 10, cf =244 , f= 107
2n
w ( −cf)
4 10(246.5 − 244)
Q2 = L + = 190.5+ = 190.73
f 107
Q3: Determine the class containing the third quartile.
3n
= 369.75 . Hence, 201- 210 is the class containing the third quartile.
4
L =200.5, n =493, w= 10, cf = 351 , f= 49
3n
w ( −cf)
4 10(369.75 − 351)
Q3 = L + = 200.5+ = 204.33
f 49
27
b) D7: Determine the class containing the 7thdecile.
7n
= 345.1 . Hence, 191- 200 is the class containing the seventh decile.
10
L =190.5, n =493, w= 10, cf = 244 , f= 107
7n
w ( −cf)
10 10(345.1 − 244)
D7= L + = 190.5+ = 199.95
f 107
c) P90: Determine the class containing the 90th percentile.
90n
= 443.7 . Hence, 221- 230 is the class containing 90thpercentile.
100
L =220.5, n =493, w= 10, cf = 434 , f= 31
90n
w(
100
−cf) 10(443.7 − 434)
P90= L + = 220.5+ = 223.63
f 31
28
REVIEW EXERCISES
1. A company was experiencing a chronic weld defect problem with a water outlet tube
assembly. Each assembly manufactured is leak tested in a water tank. Data were collected
on a gap between the flange and the pipe for 6 assemblies that leaked and 6 good
assemblies that passed the leak test. Leaker .290, .104, .207, .145, .104, .124
i. Calculate the sample mean x.
2. The following are the numbers of minutes that a person had to wait for the bus to work on
15 working days10, 1, 13, 9, 5, 9, 2, 10, 3, 8, 6, 17, 2, 10, and 15. Find
A. the mean; B. the median;
3. For each of the following distributions, decide whether it is possible to find the mean and
whether it is possible to find the median. Explain your answers.
Grade Frequency
40-49 5
50-59 18
60-69 27
70-79 15
80-89 6
IQ Frequency
Less than 90 3
90-99 14
100-109 22
110-119 19
More than 119 7
Find the first and third quartiles Q1 and Q3 for grouped data.
4. The average annual salaries paid to top-level management in three companies are
$94,000, $102,000, and $99,000. If the respective numbers of top-level executives in
these companies are 4, 15, and 11, find the average salary paid to these 30 executives.
5. In a nuclear engineering class there are 22 juniors, 18 seniors, and 10 graduate students.
If the juniors averaged 71 in the midterm examination, the seniors averaged 78, and the
graduate students averaged 89, what is the mean for the entire class?
6. If an instructor counts the final examination in a course four times as much as each 1-
hour examination, what is the weighted average grade of a student who received grades
of 69, 75, 56, and 72 in four 1-hour examinations and a final examination grade of 78?
29