Professional Documents
Culture Documents
AGE 301 Handout by Dr. Show
AGE 301 Handout by Dr. Show
AGE 301 Handout by Dr. Show
Introduction
Statistics is a universal subject used in all disciplines and in all areas of human endeavour. The
word statistics was originally applied only to such data at the state required for its official
purpose. To a layman; it also refers to any set of quantitative data relating to a particular
measurement, whether that data is of interest or not.
The systematic collection of official statistics for political purposes originated in Germany
towards the end of the 18th Century, by comparing data such as population, industrial and
agricultural output. Also in England, a collection of numerical data enabled government
departments to predict levels of revenues and expenditure with more precision than before.
1
1.1 Meaning of statistics
The earliest origin of statistics lies in the desire of rulers to count the numbers of inhabitants or
measure the value of taxable land in their domains. This has developed to careful measurement
of weight, distance or counting of physical quantities and items in many disciplines such as
agriculture, life and behavioral sciences.
Thus, the study of statistics is therefore essential for sound reasoning, precise judgment and
objective decision in the face of up-to-date accurate and reliable data.
Most of us, especially those in the media-reporters have little or nothing to do with a large mass
of data
Statistics is also the science and practice of developing human knowledge through the use of
empirical data expressed in quantitative form. It is based on statistical theory. It is a branch of
applied mathematics where randomness and uncertainty are modelled by probability theory
(Wikipedia Encyclopedia).
In Nigeria, the official data collection and its usage started with the Statistical Act of 1947 which
established the Department of Statistics in the office of the Governor General of the Federation.
Thus, many researchers, educationalists, businessmen and government agencies at the national,
state, or local level relies on data to answer fundamental questions pertaining to their operations
and programs. In fact, there can be no meaningful science without statistics.
2
In-Text Question
Statistics could also be defined as?
a. A structural science
b. Codes that help programmers in programming
c. a branch of applied mathematics where randomness and uncertainty are modeled by
probability theory
d. None of the above
In-Text Answer
c.) A branch of applied mathematics where randomness and uncertainty are modelled by
probability theory
Descriptive Statistics
It is the act of summarizing and giving a descriptive account of numerical information in the
form of reports, charts and diagrams. The goal of descriptive statistics is to gain information
3
from collecting data. It begins with a collection of data by either counting or measurement in an
inquiry.
It involves the summary of specific aspects of the data, such as average value, and measure of
spread. Suitable graphs, diagrams and chart are then used to gain understanding and clear
interpretation of the phenomenon under investigation, keeping firmly in mind where the data
comes from.
Statistical Method
This is a device for classifying data and making clear relationship between variable under
consideration. This can be achieved by using the statistical tools and formulae. It ranges from
the computation of simple summaries of data (mean, median, mode, etc.) to complex modelling
used in policy formulation.
Inference Statistics
This is the act of making a deductive statement about a population from the quantities computed
from its representative sample. It is a process of making inference or generalizing about the
population under certain conditions and assumptions. Statistical inference involves the processes
of estimation of parameters and hypothesis testing.
4
1.4 Terms and Concepts in Statistics
There are a lot of terms and concept in statistics we need to learn to keep us abreast and give us
more understanding about statistics. The following terms and concept discussed below are used
daily in the field of statistics.
Sample: A sample is a representative part of a population observed for the purpose of making a
scientific statement or taking decisions about the population. A good sample must be randomly
selected and adequate.
A sample can be random or purposive. A random sample may be obtained by tossing a coin,
throwing a die, drawing discs from a container or using a table of random numbers. A purposive
judgmental sample is obtained when members of a population are selected by discretion or
personal judgment
Statistics: A statistics is a quantity / summary calculated from a sample for the purpose of
drawing conclusion about the related population, e.g. sample means ( x ), sample variance ( 2 )
etc.
The characteristics of units in the population can be measured or counted (quantitative) e.g.
weight, height age, number of cars. It can also be observed (qualitative or attributes e.g. color, of
eyes, beauty, complexion etc.)
Variate: A variate (variables) is any quantity or attributes whose value varies from one unit of
observation to another. A quantitative variate (variables) may be discrete or continuous
Continuous Variate: A continuous variate is a variate which may take all values within a given
range. Its values are obtained by measurements e.g. height, volume, time, examination score etc.
5
Discrete Random Variate: A discrete random variate is one whose value changes by steps. Its
value may be obtained by counting. It normally takes integer values e.g. number of cars, number
of chairs.
Statistical data: These are data obtained through objective measurement or enumeration of
characteristics using the state of the art equipment that is precise and unbiased. Such data when
subjected to statistical analysis produce results with high precision.
Internal Data
When data is collected from within the organization and used in the organization concerned, it is
called internal data. Examples are data from accounts and internal records of an establishment.
6
External Data
If data is collected from outside the organization, it is called external data. Examples are data
from journals not published by the organization itself. There are two major sources of statistical
data: the internal source and the external source.
Primary Data
These are data generated by first hand or data obtained directly from respondents by personal
interview, measurement or observation.
Secondary Data
These are data obtained from publication, newspapers, magazines and annual reports. They are
usually summarized data used for a purpose other than the intended one.
Summary
In Study Session 1, you have learnt that:
1. The study of statistics is essential for sound reasoning, precise judgment and objective
decision in the face of up-to-date accurate and reliable data.
2. Statistics can be defined as the science of collecting organizing and interpreting numerical
facts, which we called data
3. The science of data statistics are descriptive statistics, statistical methods and statistical
inference
4. Statistics could be used for a lot of our day to day activities
5. A population is the collection of items under investigation
6. A parameter is a summary / quantity computed from a population
7. A variate (variables) is any quantity or attributes whose value varies from one unit of
observation to another
8. Data can be described as a mass of unprocessed information obtained from measurement
of counting of a characteristics or phenomenon
9. Data can be categorized as internal or external data.
7
Self-Assessment Question (SAQs) for Study Session 1
Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
1. Define population
2. What are the Source of Statistical data?
Notes on SAQ
SAQ 1.1
Statistics can simply be defined as the “science of data”. It is the science of collecting organizing
and interpreting numerical facts, which we called data.
SAQ 1.2
SAQ 1.3
8
SAQ 1.4
2.
References
Brookes, B.C. and Dick, W. F. L. (1969): An introduction to Statistic Method, 2nd Edition, H. E.
B. Publishers.
Moore, D.S. and McCabe, G. P. (1993): Introduction to the practice of Statistics; 2nd Edition;
New York: W. H. Freeman and Company.
Adamu, S. O. and Johnson, T. L. (1997): Statistics for Beginners, Book 1: SAAL Publications.
9
Study Session 2 Presentation of Data
Introduction
The aim of this study session is to introduce the various methods of presenting statistical data.
Presentation of data in tables, charts and diagrams facilitates understanding of the important
feature of the data.
We shall discuss the frequency table, cumulative Frequency table, Stem plot, Box plot and
Histogram assuming that we are very familiar with other graphs such as pie chart, frequency
curve, frequency polygon etc.
In-Text Question
Why is it necessary to present data in tables, charts and diagrams?
a. To have a clear understanding of the data and illustrate the relationship between variables
b. To break information into pieces
c. Allow a blind man understand the data
d. To have a clear understanding of the data
10
In-Text Answer
a.) To have a clear understanding of the data and illustrate the relationship between variables
Procedure
Given a set of observation x1, x2 …. xN for a single variable.
1. Find the range (R): (i.e. Difference between the largest and smallest values) of the data.
2. Determine the number of classes (K) (depending on the size of the data).
3. Find the class interval (C): (i.e. Range divide by the number of classes) .
4. Tally (i.e. assign the values to classes).
5. Find the class frequencies.
Note: With the advent of computers, all these steps can be accomplished easily.
Example 2.1: The following are the scores of 40 students in Mathematics test:
50, 08, 14, 20, 46, 23, 26, 47, 32, 31, 48, 40, 49, 40, 41,
38, 51, 86, 55, 82, 56, 72, 60, 98, 59, 76, 55, 80, 52, 63,
57, 67, 53, 70, 69, 63, 65, 66, 22, 27
Construct a frequency table for the above data.
Solution
Range: 98 – 08 = 90
No. of classes = 10
Range 90
Class Interval = 9
No. of classes 10
11
Working Table
Table 2.1
Frequency Table
Table 2.2
Score Frequency
01 up to 10 1
11 up to 20 2
21 up to 30 4
31 up to 40 5
41 up to 50 6
51 up to 60 9
61 up to 70 7
71 up to 80 3
81 up to 90 2
91 up to 100 1
Total 40
12
2.2.1 Cumulative Curve (OGIVE)
The graph of the cumulative frequency of a single variable is called an OGIVE. It is drawn by
plotting the cumulative frequency against the upper class boundary of a class interval. On the
OGIVE it is possible to obtain the median the quartile and inter-quartile range. (IQR)
Example 2.2: Using the data in Example 1. Construct the cumulative frequency curve.
Solution
Table 2.3
OGIVE
Cum. Freq.
0
Score
Diagram 2.1
13
Example 2.3
The following data represent the ages (in years) of people living in a housing estate in Ibadan.
30, 31 17 16 6 2 8 43 18 18 32 33
9 18 33 19 21 13 14 13 14 6 45 52 61
23 26 14 15 14 15 27 19 36 37 11 12
11 12 20 39 40 20 63 69 64 29 28 27
15
Present the above data in a frequency table using a suitable class interval.
Solution
Maximum value = 69
Minimum value = 2
Range = 69 – 2 = 67
A choice of 10 classes will result in some classes with zero frequencies while the choice of 6
classes is more reasonable with at least one item in each class. In practice, it is easy to determine
the number of classes for a given set of data. We are using K = 6 as our number of classes.
R 67
Class interval = = 10.13 ≃ 10.0
K 6
Table 2.4
14
Class interval is a sub-division of the total range of values which a (continuous) variable
may take.
Class frequency is the number of observations of the variate which falls in a given
interval (column 3)
Relative frequency for a class is the actual frequency of the class divided by total
frequency. (Column 4). Sometimes, it is better to work with relative frequencies
[especially in the calculation of probability values].
Cumulative frequency of a class is the sum of all the frequencies before the class up to
and including the frequency of that class (column 5).
Relative Cumulative Frequency: When the relative frequency of a class is expressed as
a proportion of total frequency, what we have is called the relative cumulative frequency
(column 6). It is sometimes called the distribution function.
Exercise
Now answer the following questions from the table.
How many residents are aged between 11 and 30 years?
ii. What is the probability that a person selected at random from the Estate will be less than
31 years old?
Answers
(i) 29 (ii) 16 (iii) 0.68
15
them without detailed reference to the text, on the grounds that users may well pick things up
from the tables or graphs without reading the whole text. Below are some ways in analysing data
2.3.1 Histogram
Histogram is a chart used for presenting the frequency distribution of the values of a variable.
(Assuming the variate is a continuous type).
A histogram is a group of rectangles drawn above each class interval such that the area of each
rectangle is proportional to frequency of the observations falling in the corresponding class
interval. The chart is constructed by plotting the values of the variable along the X-axis and the
frequencies along the Y-axis.
Vertical lines are drawn at the lower and upper class boundary of each class up to the
frequencies. Horizontal lines representing the width of each class interval are then drawn on top
of each vertical line.
In a situation where the class intervals are not the same, the height must be adjusted so that the
area represents the frequency.
Draw the histogram of the data in Example 2.2 above
16
Histogram of Ages
Diagram 2.2
20
Frequency
10
0 1 10 20 30 40 50 60 70
Ages (in years)
17
Constructing a Stemplot
To construct a stemplot, take not of the following steps;
The stemplot is drawn with two columns separated by a vertical line. The stems are listed to the
left of the vertical line. It is important that each stem is listed only once and that no numbers are
skipped, even if it means that some stems will have no leaves. The leaves are listed in increasing
order in a row to the right of each stem.
Example 2.4
Present the following data in a stem-and-leaf plot
68 66 72 75 76 106 54 57 56 63 59 66 68
64 88 84 81
Solution
Table 2.5
Stem Leaf
5 4 6 7 9
6 3 4 6 8 8
7 2 2 5 6
8 1 4 8
9
10 6
Example 2.5
Given the weight of 20 rams at the end of two weeks feeding on a special diet as follows:
46, 59, 35, 41, 46, 21, 24, 33, 40, 45, 49, 53, 48, 54, 61, 36, 70, 58, 47, 12
Make a stem plot for these data
18
Solution
The stem plot is given below
1 2
2 14
3 356
4 01566789
5 3489
6 1
7 1
Important Features
19
Example 2.6
Suppose 20 cows were fed with the same special feed as in example 3: the back-to-back stem
plot is shown below:
Table 2.6
Weight Of Cow Weight Of Ram
0 1 2
1 2 14
2 3 356
31 4 01566789
542 5 3489
7655421 6 1
42 7 1
8
1 9
Observations
NOTE: Stem plot works well for small set of data especially when the observations are all
greater than zero.
20
In-Text Question
The box plot chart is most useful when comparing two or more sets of sample data. True or False
In-Text Answer
True
In a box plot, the ends of the box are at the quartiles, so that the length of the box is the inter
quartile range. The median is marked by a line within the box. The ‘whiskers’re the two lines
outside the box that extends to the smallest and largest observations. Outliers are shown as dots,
outside the shickers.
Example 2.7
Consider the data in example 2.6 above. Construct the box plots.
Solution
Diagram 2.3
Boxplots
10
9
8
7
6
5
4
3
2
1
0
Fig. 1.2a Fig. 1.2b
Weight of Ram Weight of Cow
21
In a box plot, the center, the inter-quartile range, the spread are immediately apparent. However,
the box plot is generally inferior to the stem plot or histogram in that it shows only the center and
the partition values; it tells nothing about the shape of the distribution and other values in the
data set.
A stem plot (for large data set) provides a clearer display of a single distribution especially, when
accompanied by the median and quartile as numerical sign post.
Summary
In Study Session 2, you have learnt that:
1. It is necessary to present data in tables, charts and diagrams in order to have a clear
understanding of the data
2. The first step in examining intelligently a set of data for a single quantitative variable is
by constructing a frequency table
3. Frequency table is a tabular arrangement of data into various classes together with their
corresponding frequencies
4. The graph of the cumulative frequency of a single variable is called an OGIVE
5. Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph.
6. Histogram is a chart used for presenting the frequency distribution of the values of a
variable
7. Stemplot is a graphical display of quantitative data that is useful in visualizing the shape
of distribution
8. Back-to-back stemplots are used to compare two distributions side-by-side.
9. Box Plot is useful when comparing two or more sets of sample data.
22
Self-Assessment Question (SAQs) for Study Session 2
Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
1. The first step in examining a set of data for a single quantitative variable is by
constructing a frequency table. True or False
2. What is a Frequency Table
Notes on SAQ
SAQ 2.1
They give a clear understanding of the data, and to illustrate the relationship existing between the
variables being examined.
SAQ 2.2
1. True
2. This is a tabular arrangement of data into various classes together with their
corresponding frequencies.
SAQ 2.3
i. Histogram
ii. Stemplot
iii. Bxplot
23
Reference
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method Second Edition,
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published
by Arnold & Stoughton
Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics, second Edition.
New York:W.H. Freeman and coy.
24
Study Session 3 Measure of the Centre of a Set of Observations
Introduction
The primary aim of any investigator is to obtain a simple summary value (average) that can be
used to describe all the observations in a set. Thus an average is a single value that can
represents all the observations in a distribution.
The most representative value is one that is at the center of the distribution. They are otherwise
referred to as measures of location or measures of central tendency
25
3.1 Measures of Central Tendency
These are measures of the center of a distribution. They are single values that give a description
of the data. They are also referred to as measures of central tendency. Some of them are
Arithmetic mean, mode, median, geometric mean and harmonic mean. We shall discuss them
one after the other. They are otherwise known as descriptive statistics.
In-Text Question
Measures of central tendency are multiple values that give a description of the data. True or
False
In-Text Answer
False
1. Be single-valued
2. Be algebraically tractable
3. Should consider every observed value
3.2 Mean
The average (arithmetic mean) of a set of observation is the sum of the observation divided by
the number of observation. Given n observations are denoted by x1, x2, x3 ---- xn, the mean is
defined by
1
X ( x1 x 2 x3 ... x n )
n
26
Or in a compact notation, it can be written as
1
X
n
xi
The above formula is for the simple series and is most useful when few (n < 20) observations are
considered.
Example 3.1
Here are the ages of 15 students in a class 16, 18, 20, 21, 22, 19, 17, 18, 19, 17, 17, 18, 17, 17,
20. Calculate the mean.
Solution
1
The average age of the students is X
n
xi
1
X [16 18 17 20]
15
257
X
15
In a grouped frequency table for a continuous variate, Xi’s are the center of interval (i.e. Average
of the upper and lower class boundary of a class) otherwise known as class mark.
27
Example 3.2
Given the frequency distribution of a random variable X as follows:
Table 3.1
Group Frequency
1–5 2
6 – 10 4
11 – 15 8
16 – 20 5
21 – 25 3
26 - 30 1
Total 23
Find the class mark of a particular class by adding the lower and upper class boundaries of the
class and divide by 2.
Table 3.2
23 329
28
N = ∑f
X
fx
N
329
= = 14.304
23
d
Mean: X A where d = X – A
n
If a constant factor C is used then
d
X A C
n
fd X A
X A C Where U
f C
Example3.3
The exact pension allowance paid (in Naira) to 25 workers of a company is given in the table
below.
29
Table 3.3
Pension In No. of
N Person (f)
25 7
30 5
35 6
40 4
45 3
Calculate the mean using an assumed mean 35 and 5 as the common factor.
Solution:
Table 3.4
Pension In No. of X A fU
U
N Person (f) C
25 7 -2 - 14
30 5 -1 -5
35 6 0 0
40 4 1 4
45 3 2 6
25 -9
Let A = 35, C = 5
9
X 35 5 = 33.20
25
Example 3.4
Consider the data in example 2.3, using a suitable assumed mean and constant factor, compute
the mean.
30
Table 3.5
53
X 15 ( )10
50
= 15 + 10.6
= 25.6
Note:
It is always easier to select the class mark with the largest frequency as the assumed mean.
Merits
The mean is an average that considers all the observations in the data set. It is simple and easy to
compute and it is the most widely used average.
Demerits
Its value is greatly affected by the extremely too large or too small observation.
31
3.3 Median
The median is an average of position. It is the value of the variable that divides a distribution
into two equal parts when the values are arranged in order of magnitude.
To compute the median of a distribution:
~ (n 1) th
X Item.
2
~
iii. If n is even, the median X is the average of the two middle observations’ is the ordered
list.
X n 1 th X n
1
~ 2 2
i.e. X
2
Example 3.5 (n is even)
The values of a random variable X are given as 11, 10, 13, 9, 13, 14, 16, and 20. Find the
median.
Solution
In an Array: 9, 10, 11, 13, 13, 14, 16, and 20. Since n is even.
Xn Xn
1
~ 2 2
Median = X
2
X4 X5
=
2
13 13
i.e. = 13
2
32
Example 3.6 (n is odd)
The values of a random variable X are given as 9, 7, 5, 20, 2, 12 and 1. Find the median.
In an array: 1, 2, 5, 7 , 9, 12, 20
n is odd , therefore
~
The median X X 7 1 th
( )
2
X 4th
= 7
Note: The occurrence of 7 in the above example is just a coincidence it could have been any
other value in the middle of the data set.
N
Cfb
~
X Lm 2 w
m
where Lm = Lower limit of the median class
fm = Frequency of median class
33
Example 3.7
The table below shows the length of 100 rods (in inches) produced in a factory
Table 3.6
Length Number of rods
(inches) (f)
1–2 1
3–4 8
5–6 26
7–8 38
9 – 10 19
11 – 12 7
13 – 14 1
Solution
The first thing to do is to obtain the cumulative frequency distribution as follow
Table 3.7
Class f Cumulative
Frequency
(cf)
1–2 1 1
3–4 8 9
5-6 26 35
7–8 38 73
9 – 10 19 92
11 -12 7 99
13 -14 1 100
34
N 100
i. determine 50 , clearly the median value belong to the class
2 2
(7 – 8).
ii. The lower class boundary (Lm) of the median class is 6.5.
~ 50 35
X 6.5 2
38
= 6.5 + 0.789
= 7.289
~
7.29 (2 dp)
Example 3.8
The following data represent the weight of products manufactured in a factory (in kg.
Table 3.8
Weight Number of
Products
45 – 54 1
55 – 64 3
65 – 74 5
75 – 84 18
85 – 94 33
95 – 104 25
105 – 114 21
115 – 124 12
125 – 134 5
135 - 144 2
35
Calculate the median.
Solution
First obtain the cumulative frequency distribution as in Example 3.7.
The following can be obtained from the above table as in Example 3.7.
N 125
62.5 ; cfb = 60
2 2
cfb = 94.5, fm = 25, w = 10 (i.e. 104.5 – 94.5)
~ 62.5 60
X 94.5 10
25
= 94.5 + 1
= 95.5
Merit
1. It is easy to calculate
3. Its value is not affected by extreme values; thus it is a resistant measure of central
tendency.
Demerit
1. It does not take into consideration all the values of the variable.
3.4 Mode
The Mode is the value of the variable that occurs most often in a set of data. It is the most
unstable measure of location. It is not a unique measure of location as in the arithmetic mean. In
some cases it may not exist. Sometimes when it exists it is more than one (e.g. bimodal
distribution).
36
Let us see how the mode can be obtained from discrete data.
Example 3.9
Consider the data in example 3.5 the modal value is 13. Since it is the only value that occurred
twice.
Example 3.10
Consider the data in example 3.6.
The mode does not exist.
Example 3.11
From Example 2 the mode is X̂ = 2 i.e. the value with the highest frequency.
i. from the frequency curve by finding the value at the highest point or
1
Xˆ Lm W
1 2
Where Lm = lower limit of the modal class.
1 = difference between the frequency of the modal class and the class before it.
2 = difference between the frequency of the modal class and that above it.
37
Example 3.12
From the data in Example 3.7
Calculate mode:
i. the modal class is the one with the highest frequency. i.e. (7 – 8).
ii. Lm = 6.5
1 = 38 – 26 = 12
2 = 38 – 19 = 19
w = 2
12
X̂ 65 2
12 19
= 6.5 + 0.774
= 7.27
Example 3.13
Also consider the data in Example 3.8 the mode is obtained as
15
X̂ 84.5 10
15 8
= 84.5 + 6.52
= 91.02
Merit
1. The mode is easily understood by many people.
2. It is easy to calculate.
38
Demerit
1. It is not a unique measure of location.
2. It presents a misleading picture of the distribution.
3. It does not take into account all the available data
4. It is the most ideal measure of location when the distribution is highly skewed. e.g.
distribution of wages of workers in a factory.
N (i )
Cfb1
Q1 = lq1 4 w for grouped data
fq1
Where lq1 = Lower limit of quartile 1
fq1 = Frequency of the q1 class
39
w1 = Width of q1 class
3N
4 Cfb
Q3 = lq3 w
fq3
Where lq3 = Lower limit of quartile three class
fq3 = Frequency of the q3 class
w3 = Width of q3 class
40
Midrange
The half way between the smallest and the largest observation in a set of data is called the
midrange or range midpoint. It is obtained by adding the smallest and the largest together and
dividing the result by 2.
Example 3.14
Find the midrange of the following data: 1, 5, 7, 15, 12, 9, 7,
Solution
Smallest observation 1
Largest observation 15
15 1
Midrange = 7
2
Example 3.15
Find the midrange of the following data representing the number of children in 12 households in
Agbowo area of Ibadan.
4, 2, 1, 0, 2, 6, 2, 3, 5, 1,
Solution
60
Midrange = 3
2
Usefulness
Information on midrange of temperature reading by Meteorologists is used by visitors in the
tourism industry.
Limitations
It takes into account only the extreme observation.
41
Geometric Mean
Given observation X1, X2, ---, Xn, of a random variable X the geometric mean denoted by GM
define as the nth root of the product of n observation in a set. i.e
GM = n X 1 , X 2 , , X n
Example 3.16
Find the geometric mean of the data in Example 3.14.
Solution
7
GM = 1.5.7.15.12.9.7
7
= 396900
= 6.31
Example 3.17
Obtain the geometric mean of the data in Example 3.14
Solution
10
GM = 4.1.0..........5.1
Usefulness
Geometric mean is very useful in the computation of rates and indices e.g. Computation of price
indices, etc.
Limitation
1. It cannot be calculated when the value zero is one of the observation to be used.
42
Harmonic Mean
Given the observation x1, x2, ----, xn of a random variable X, the harmonic mean denoted by
HM is defined as the reciprocal of the mean of the reciprocal of the observations i.e.
Example 3.18
Find the harmonic mean of the data in Example 3.14.
Solution
1
HM =
1 1 1 1 1 1 1 1
7 1 5 7 15 12 9 7
= 4.02
Example 3.19
Find the harmonic mean of the data in Example 3.14.
Solution
1
HM =
1 1 1 1 1
10 4 1 0 1
43
Limitations
Deciles are those values that divide a distribution to five equal parts. They are denoted by Di i
= 1, 2, ---, 9 D1, D2, D3 …. D9.
For the grouped data deciles two (D2 ) is defined as
2N
5 cfbD2
D L D2 w
fD2
where
LD2 = Lower limit of decile two class
fD2 = Frequency of the decile two class
w2 = Width of decile 1 class
Percentiles are those values that divide a distribution into one hundred equal parts. They are
denoted by P1, P2, P3, ….., P99. For a grouped distribution the 65th percentile is defined as
65 N
100 cfbP65
P65 L p65 w
f P65
44
Lp65 = Lower limit of 65th percentile class
Fp65 = Frequency of the 65th percentile class
w1 = Width of 65th percentile class
Example 3.20
Consider the data in Example 3.9
Calculate the i. first quartile (q1)
ii. third quartile (q3)
iii. 4th Decile (D4)
iv. 45th Percentile (P45)
Solution
From the table in Example 3.9
Table 3.9
Class f cf
45 – 54 1 1
55 – 64 3 4
65 – 74 5 9
75 – 84 18 27
85 – 94 33 60
95 – 104 25 85
105 – 114 21 106
115 – 124 12 118
125 – 134 5 123
135 - 144 2 125
45
N
cfbq1
i. q1 Lq1 4 w
f q1
31.25 27
= 84.5 10
33
= 84.5 + 1.29
= 85.79
3N
cfbq3
ii. q3 Lq3 4 w
f q3
93.75 85
= 104.5 10
21
= 104.5 + 4.17
= 108.67
4N
cfbD4
iii. D4 LD4 5 w
f D4
100 85
= 104.5 10
21
= 104.5 + 7.14
= 111.64
45 N
cfbP45
iv. P45 LP145 100 w
f P145
46
56.25 27
= 84.5 10
33
= 84.5 + 8.86
= 93.36
Summary
In Study Session 3, you have learnt that:
1. Measures of central tendency are single values that give a description of the data.
2. The arithmetic mean is the average of a set of observation is the sum of the observation
divided by the number of observation
3. The mean is an average that considers all the observations in the data set
4. The median is an average of position.
5. Median is a good measure of location in a skewed distribution.
6. The Mode is the value of the variable that occurs most often in a set of data
7. The mode is not a unique measure of location
8. The partition values are: the quartile, deciles and percentiles.
47
SAQ 3.3 (Tests Learning Outcomes 3.3)
What is the formula for the Calculation of Median From a grouped data?
1. Define Mode
2. Give three demerit of mode
Notes on SAQ
SAQ 3.1
SAQ 3.2
1. The arithmetic mean of a set of observation is the sum of the observation divided by the
number of observation
K K
fi X i
2. X ; N fi
i 1
N i 1
SAQ 3.3
N
~ Cfb
X Lm 2 w
m
48
SAQ 3.4
1. The Mode is the value of the variable that occurs most often in a set of data. It is the most
unstable measure of location
2.
SAQ 3.5
SAQ 3.6
Geometric mean is very useful in the computation of rates and indices e.g. Computation of price
indices
49
References
Adamu S.O and Johnson Tinuke L (1998): Statistics for Beginners: Book 1. SAAL Publications.
Ibadan. ISBN: 978-34411-3-2
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B. Paperback.
Clarke G.M. and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published
by Arnold & Stoughton
Connor, L. R and Morrell, (1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,
London: Pitman Books Limited.
Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi: Vikas Publishing
House PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New York: W.H. Freeman and Coy.
Olubosoye O.E, Olaomi J.O and Shittu O.I (2002): Statistics for Engineering, Physical and
Biological Sciences”. Ibadan: A Divine Touch Publications.
50
Study Session 4 Measures of Dispersion/Variation
Introduction
Dispersion/Variation is degree of scatter or variation of individual values of a variable about the
central value such as the median or the mean.
In this Study Session we shall discuss the range, semi-inter-quartile range, mean deviation from
the mean, median variance and standard deviation
51
Box 4.1: Definition of Variation
Variation can be defined as a way to show how data is dispersed, or spread out.
Several measures of variation are used in statistics which will be discussed at the course of this
study session.
The range is thus a measure which is very easy to determine and use. The range is efficient when
n > 10, otherwise it is not good as it ignores all the values in between. It is commonly used in
statistical quality control.
However, the range may fail to discriminate if the distributions are of different types.
Semi-Interquartile Range: is half the difference between the first and third quartiles. It is good
measure of spread for midrange and the quartiles.
Q3 Q1
S .I .R
2
X i X
MD i 1
for simple series
N
52
In a grouped data
N
f Xi X
MD X i 1
N
f
i 1
Example 4.1
Below is the average of 10 Heads of household randomly selected from a community
54, 59, 35, 41, 46, 25, 47, 60, 54, 46
Find the (i) Range (ii) Mean (iii) Mean deviation from the mean (iv) Mean deviation
from the median.
Solution
i. Range = 60 – 25 = 35
ii. Mean = X
X
54 59 .... 46
n 10
= 46.7
81
= = 8.10
10
Array: 25, 35, 41, 46, 46, 47, 54, 54, 59, 60
n n
X X 1
iv Median = 2 2 46.5
2
53
54 46.5 59 46.5 46 46.5
MDXˆ
10
7.5 12.5 11.5 5.5 0.5 21.5 0.5 13.5 7.5 0.5
=
10
81
=
10
= 8.1
Example 4.2
The table below shows the frequency distribution of the scores of 42 students in STA 111 test.
Table 4.1
No. of
Scores Students
(f)
0 – 10 2
10 – 20 5
20 – 30 8
30 – 40 12
40 – 50 9
50 – 60 5
60 – 70 1
Find the mean deviation from the mean for the data.
54
Solution
Table 4.2
X
fx 1450 34.52
f 42
Mean Deviation X =
f XX
f
365.76
=
42
= 11.089
55
4.4 The Variance
The variance of a set of observations is the average of the squared deviation from the mean.
Let x1, x2, x3, ----, xn be a random sample from a population The sample variance S2, is
defined as:
S2
1 n
X i X 2
n i 1
where X
X i
n
for discrete data or simple series
f X X
2
i i
S 2
f i
Another formula for calculating variance can be derived from the above as follow
S2
1 n
X i X 2
n i 1
nS 2 X i X
n
2
i 1
N
ns 2 X i2 2 XX i X 2
i 1
= X i
2
2X Xi X 2
= X i
2
nX 2
1
Therefore S2
n
X i2 X 2
f i
In-Text Question
Standard deviation could be defined as?
In-Text Answer
b.) The square root of the variance
X X
2
S i 1
or
N
S
X i
2
X2
n
Example 4.2
57
Consider the data in example 4.1, calculate the standard deviation and coefficient of variation.
Solution:
X X
2
i Standard Deviation S =
n
S
ii. Coefficient of Variation C.V = x 100
X
10.37
= x 100
46.7
= 22.21
Comparison of Dispersion: Comparison of two distributions with different means and unit of
measurement is done using the coefficient of variation.
S
i.e. C.V x 100
X
The distribution with smaller C.V is said to be better
58
4.6 Coding Method
This is the method used when larger values of the variable are involved in calculation.
This is achieved by choosing one of the values (or class mark) as the assumed mean (A) and
determine the common factor (C). The values of the variable Xi (or class mark) are transformed
using the code:
XA
Ui
C
Thus the formula for calculating the variance becomes
1
f1U i
fU
2
C
S 2
2
f f
Example 4.3
Given the following grouped data. Compute the (i) Mean and (ii) Standard deviation. And
(iii) coefficient variation using an assumed men of 77 and 5 as a common factor
Table 4.3
Class f
50 – 54 1
55 – 59 2
60 – 64 10
65 – 69 12
70 – 74 18
75 – 79 25
80 – 84 9
85 – 89 6
90 – 94 4
95 – 99 3
Total 80
59
Solution
Table 4.5
90 -36 330
A = 77 C = 5
fU
X A C
f
36
= 77 5
90
= 77 – 2
= 75
1
f1U i
fU
2
C
S 2
2
f f
60
1 36 2
= 330 5
90 90
= 3.55
S S2
= 1.88
S
Coefficient of variation: CV = x 100
X
1.88
= x 100
75
= 2.51
Summary
In Study Session 4, you have learnt that:
61
Self-Assessment Question (SAQs) for Study Session 4
Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
62
Notes on SAQ
SAQ 4.1
Variation can be defined as a way to show how data is dispersed, or spread out.
SAQ 4.2
This is the simplest measure of variation. It is the difference between the largest and the smallest
value in a set of data.
SAQ 4.3
f Xi X
MD X i 1
N
f
i 1
SAQ 4.4
The variance of a set of observations is the average of the squared deviation from the mean
SAQ 4.5
a.) The root mean squared deviation from the mean (RMSD)
SAQ 4.6
This method is used when larger values of the variable are involved in calculation.
63
References
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications, ISBN: 978-34411-3-2
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,
London: Pitman Books Limited,
Gupta, C. B. (1973) “An Introduction to Statistical Methods” Vikas New Delhi: Publishing
House PVT Ltd...
Moore D.S and Mc cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New York: W.H. Freeman and coy..
Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering, Physical and
Biological Sciences”. Ibadan: A Divine Touch Publications,. ISBN: 978-35606-7-0
64
Study Session 5 Algebraic Treatment of Mean and Variance
Introduction
It is advisable to adjust the values of the mean and variance to check for mistakes, it may also be
desired to combine these statistics without recourse to the individual observation of the variable.
The various methods of doing this will be discussed in this study session.
Given two sets of data consisting of n1 and n2 items and X 1 and X 2 and their variance S12
and S 22 respectively with the some mean, then the combined mean is defined by
n1 X 1 n2 X 2
X 12
n1 n2
n1 n2 2
65
Suppose we have
ni, i = 1, 2, ----, k
X
for i = 1, 2, 3, …, K,
ni number of observation in variable i.
X mean of variable i.
S i2 variance of variable i.
n1 X 1 n2 X 2 ...n k X k
X 12, , k
n1 n2 ... nk
k
n X i i
i 1
k
n i 1
i
2
(n 1) S12 (n2 1) S 22 (nk 1) S k2 n1 Xˆ 1 X 12 k
1
2
nk Xˆ k X 12 k
2
12, k
n1 n2 nk k
Example 5.1
The Mean and Standard Deviation of two variables of 100 and 150 items are 50, 5 40, and 6
respectively. Find the Standard Deviation of all the 250 items taken together.
Solution
66
Example 5.2
A survey was conducted at three locations in a community to study a single variable. At each
location, the sample size (ni), the mean X i and standard deviation i were given the
following table.
Table 5.1
Location I II III
ni 200 250 300
Xi 95 10 15
i 3 4 5
Obtain the combined mean and standard deviation for the variable in all the three locations
Solution
ni X i n2 X 2 n3 X 3
Hence X i 23
ni n2 n3
i223
ni n2 n3 3
123 1349.55
2
123 1349.55
= 36.74
67
5.2 Adjusting Values of Mean and Standard Deviations for Mistakes
Sometimes mistakes occur in the computation of mean and variance of a set of data when a
correct value in the original data is replaced by an incorrect one. Instead of going through the
entire process to correct such mistakes, some simple algebraic adjustment can be made as shown
in the following examples.
Example 5.3
The mean and standard deviation of a set of 100 observations were worked out as 40 and 5
respectively by a student who by mistake took the value 50 in place of 40 for one observation.
Recalculate the correct mean and standard deviation.
Solution
n = 100; X = 40; 2 = 25
X
X
n
40 =
X
100
Incorrect: ∑X = 4000
Correct: ∑X = 4000 – 50 + 40 = 3990
3990
Corrected mean X = = 39.90
100
2
=
X 2
X2
n
25 =
X 2
40 2
100
2500 = ∑X2 — 160,000
∑X2 = 162,500
68
Correct ∑X2 = 162,500 – 502 x 402
= 161,600
161600
Correct 2 = (39.90) 2
100
2399
=
100
= 23.99
= 23.99
= 4.89
69
References
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi:Vikas Publishing House
PVT Ltd.
Moore D.S and McCabe G.P (1993): Introduction to the Practice of Statistics, second Edition.
New York: W.H. Freeman and coy
70
Study session 6: Measure of Skewness and Kurtosis
Introduction
A fundamental task in many statistical analyses is to characterize the location and variability of a
data set. A further characterization of the data includes skewness and kurtosis.
In this Study session, you will learn the definition of skewness and kurtosis, you will also learn
how to calculate measure of skewness and kurtosis from simple series and grouped data.
X X
N 3
i 1
Skewness: 3
i
( N 1) s 3
f X X
3
3
i
For grouped data
( N 1) s 3
Where X is the mean, S is the standard deviation, and N is the number of data points.
71
The skewness for a normal distribution is zero, and any symmetric data should have a skewness
near zero.
Negative values for the skewness indicate data that are skewed left and.
Positive values for the skewness indicate data that are skewed right. By skeweness to the left,
we mean that the left tail is long relative to the right tail. Similarly, skeweness to the right means
that the right tail is long relative to the left tail.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution
That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather
rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the
mean rather than a sharp peak. A uniform distribution would be the extreme case. Kurtosis is
the standardized 4m central moment of a distribution.
The histogram is an effective graphical technique for showing both the skewness and kurtosis of
data set.
For univariate data X1, X2, -----, XN, the formula for kurtosis is:
X X
N 4
i 1
For discrete data Kurtosis: 4
i
( N 1) s 4
f X X
4
4
i
For grouped data
( N 1) s 4
where X is the mean, s is the standard deviation, and N is the number of data points.
In-text Question
_____________ is a measure of symmetry, or more precisely, the lack of symmetry?
a) Skewness
b) Kurtosis
c) Grouped data
d) Sample series
72
In-text Answer
a) Skewness
6.2 Calculating measure of skewness and kurtosis from simple series and
grouped data
Excess Kurtosis: The Kurtosis for a standard normal distribution is three. For this reason,
excess kurtosis is defined as
X X
N 4
i 1 i
For discrete data: Excess Kurtosis: K 4 3
( N 1) s 4
f X X
4
i
For grouped data: or K 4 3
( N 1) s 4
The standard normal distribution has excess kurtosis of zero. Positive kurtosis indicates a
“peaked” distribution and negative kurtosis indicates a “flat” distribution.
In-text Question
A distribution, or data set, is symmetric if it looks the same to the left and right of the center
point. True\ False
a) False
b) True
73
c) None of the above
d) All of the above
In-text Answer
a) True
A
B
A -------------- Leptokurtic
B -------------- Mesokurtic - Normal
C -------------- Platykurtic
Example 6.1
Twelve numbers were generated from computer are as follows:
10, 43, 67, 89, 70, 80, 62, 80, 03, 42, 71, 35
a. Obtain the measures of skewness and kurtosis.
b. Interpret your result.
74
Solution
Table 6.1
X X i X X i X
2
X i X
3
X i X
4
X
X
652
n 12
X 54.3
8516.68
S
11
= 27.825
Skewness: 3
x i x) 3
( N 1) S 3
145673.444
=
11 x (27.825
145673.444
=
236972.6385
= -0.6147
That is negatively skewed distribution.
75
N
x x
4
i
Kurtosis: 3 i 1
( N 1)S 4
13445494 .99
=
11 x (27.825) 4
13445494.89
=
6593763.668
= 2.039
Excess Kurtosis K 4 3
= 2.039-3
= - 0.961
i.e. platykurtic.
In-test Question
The Kurtosis for a standard normal distribution is three. For this reason, excess kurtosis is
defined as ____________ ?
N
x x
4
i
a) 3 i 1
( N 1)S 4
b) 3
x i x) 3
( N 1) S 3
X
X
652
c) n 12
X 54.3
X X
N 4
i 1 i
d) K4 3
( N 1) s 4
76
In-text Answer
X X
N 4
i 1 i
d) K 4 3
( N 1) s 4
Table 6.2
Class f
10-14 1
15-19 4
20-24 8
25-29 19
30-34 35
35-39 20
40-44 7
45-49 5
50-54 1
77
Solution
Diagram 6.2
40
30
20
10 0 9.5 14.5 19.5 24.5 28.5 34.5 39.5 44.5 49.5 54.5
Table 6.3
Mid-Point F fx i Xi X X i X
2
X i X
3
X i X
4
Class Xi
78
Table 6.3
Mid-Point F fx i Xi X X i X
2
X i X
3
X i X
4
Class Xi
10-14 12 1 12 -20.1 404.01 8120.6 163224.04
fx i = 3210
f = 100
X
fx i
=
3210
= 32.1
f 100
79
Table 6.4
f X i X f X i X f X i X
2 3 4
f X X
2
5299
S i
= = 53.52
N 1 99
S = 7.316
Skewness:
f x i x) 3
( N 1) S 3
910.257
=
99 x 391.58
= 0.0236
x x
4
i
Kurtosis: 4
( N 1)S 4
949317.96
=
99 x 2864.8
949317.96
=
283615.5
= 3.3
Excess Kurtosis K 4 3
= 0.3
80
Since Skewness = 0.0236; Kurtosis = 3.3; and Excess Kurtosis = 0.3.
This implies that the distribution is near normal. The Kurtosis indicates a flat peak i.e.
leptokurtic
SAQ 6.1-6.2
1. Consider the data in post test question 3 in chapter 4, obtain the measure of skewness and
kurtosis.
2. Consider the data in post test question 1 in chapter 4, obtain a measure of Excess
Kurtosis and interpret your results.
3. Consider the post test question 1 in chapter 5, calculate the measure of Skewness and
interpret your result.
81
Reference
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. , Ibadan: SAAL
PublicationsISBN: 978-34411-3-2
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London:Arnold &
Stoughton.
File://C:\DOCUME~1\FACULT~\LOCALS~\Temp\triHINHP.htm
82
Study Session 7: Methods of Collecting Statistical Data
Introduction
In the previous session, you learnt the various ways in which a set of data can be summarized
and calculated some descriptive statistics, examined the shape and how summaries can be
combined and corrected for errors.
In this Study session, you will learn about the various methods that can be employed in the
collection of statistical data.
83
Methods of Data Collection
There are five broad methods of data collection. They are:
Enquiry concerned with the leisure activities of a town population may verily begin by getting
statistical data about the use made of the local libraries, attendances at cinema, membership of
clubs and societies.
A mass of information about the popularly studied social surveys is available in historical
documents, statistical reports, records of institutions and other surveys.
Some materials are collected in form of case records by psychiatrics, social workers etc. which
are of interest to the sociologist and psychologists. Such materials have limitations for the
research workers in that, it can only represent a highly specialized population i.e. only the case
that happen to came before social workers.
84
There are personal documents which can come directly from the informants such as diaries,
autobiographies and surveys. These give insight into personal character, experiences and beliefs
that formal interviewing can hardly achieve.
The possibility of any investigation bias affecting their contents is eliminated. The use of this
method has many difficulties e.g:
a. How to get the documents
b. How to get a representative collection of documents.
Some people are better in writing letters and essays than others but not everybody can produce
documents and they are at their best when unsolicited for. The method of data collection is
usually by copying out the relevant data from the records available.
An example where this method is suitable is in the case of traffic censuses. Actual measurement
or counting also comes under the heading of observations. Examples occur in statistical quality
control.
Problems
i. If the characteristics of the population are to be inferred from those of sample, the sample
should ideally be randomly selected.
ii. To instruct an investigator to observe people of all types, men and women of different
ages, social class etc. does not make the sample a random one. It does not ensure that the
resultant group is representative.
iii. The observer can hardly be expected to observe and note everything relevant to the
subject.
85
iv. His selection of the aspect of behaviour and entrainment which he notes may follow
certain channels.
v. If what he is studying is so familiar, he may fail to note the normal etc.
In-text Question
___________ as a method of data collection is defined as accurate watching and classic method
of scientific enquiry as they occur in nature.
a) Problems
b) Merits
c) Observation
d) Demerits
In-text Answer
c) Observation
Merits
The advantages of this method are similar to personal interview and the method has some unique
advantages such as:
i. Providing more reliable information.
ii. Supplying of additional and necessary information
Demerits
The disadvantages are also similar to personal interview.
i. It is exceptionally certified.
ii. Highly trained personnel are needed for observation.
iii. Because of scrutiny, it is time consuming.
86
3. Mail or Postal Questionnaires: This is one of the most widely used methods of data
collection mostly in social surveys. Questionnaires are mailed out to respondents who in turn are
expected to send them back through the post when they are duly completed. The choice of this
method is governed by:
a. Limited resources
b. Economic advantages
c. Potential efficiency.
In-text Question
_________ is one of the most widely used methods of data collection mostly in social surveys.
a) Dairy
b) Telephone
c) Interview
d) Mail or Postal Questionnaires
In-text Answer
a) Mail or Postal Questionnaires
Merits
i. It is generally quicker and cheaper than other methods.
ii. It avoids the problems associated with the use of interviewers.
iii. It is useful when information concerning several members of household is required and
allows for some intra-household consultation.
iv. It is useful where questions demanded is considered rather than when immediate answers
are required.
v. Questions of personal or embarrassing nature are answered more willingly and accurately
than when the respondents are together with the interviewer; who is a complete stranger
to them.
vi. The problem of non-contacts in the sense of respondent not being at home is avoided.
87
Demerits
i. The method can only be considered when the questions are sufficiently simple and
straight forward to be understood with the help of the printed instructions and definitions.
It is unsuitable where the objectives of the survey take a good deal of explanation.
ii. The answers to mail questionnaire have to be accepted as final. There is no opportunity
to probe beyond the given answers.
iv. The answers cannot be treated as independent since the respondent can see all the
questions before answering any of them.
Some of the disadvantages of this method can be overcome by combining it with interview
method.
4. Personal Interview: This is the method that is used mainly in most surveys. It could be a
formal interview in which set questions are asked and the answers recorded in a standard form or
a less formal one in which the interviewer is at liberty to vary the sequence of questions, to
explain their meanings, to change the wordings or where he/she may not have a set of questions
at all but only a number of key points around with which to build the interview.
The interviewer should possess some vital qualities such as (a) Honesty, (b) Interest (c)
Accuracy (d) Adaptability (e) Personality and temperament (f) Intelligence and education.
Merits
i. The interviewer is free and has more opportunity to restructure questions whenever it is
necessary to do so.
ii. It allows more accurate information to be obtained by asking the respondent for further
explanation.
88
iii. A skilled interviewer can easily persuade an unwilling respondent. This will increase the
number of responses.
iv. A skilled interviewer will know when to make call backs and then make more effective
efforts.
v. In addition to recording verbal answers, the interviewer can note the non-verbal reactions
of respondents to questions.
vi. It can be used for persons of all educational levels.
vii. It can be used to explore areas in which little information exists.
In-text Question
___________ is the method that is used mainly in most surveys?
a) Intelligence
b) Personal interview
c) Adaptability
d) All of the above
In-text Answer
b) Personal interview
Demerits
i. Personal interviews are expensive to conduct if the sample to be taken is widely scattered
geographically.
iii. The respondent in order to boast his image to please the interviewer may give biased
answers.
iv. It may be difficult to interview some individuals such as highly income and influential
people who are not always available.
v. If recalls are necessary, and when the sample is large, it will take more time than
necessary to complete the survey.
89
vi. Respondents may give inaccurate or false information due to lapse in memory,
misunderstanding or may be deliberate.
5 Telephone: This is the method of collecting data through the telephone like other methods, it
has many advantages especially in industrialized countries. In a developing country like
Nigeria, this method of collecting information cannot be efficient because of the inefficiency
of the telephone system.
Merits
vii. It facilitates recording of replies without causing any embarrassment to the respondent.
Apart from the fact that the telephone system is not effective in a developing country and
therefore renders the method unsuitable, it has other demerits.
Demerits
b. If the population is widely located all over the country, cost consideration will limit
extensive coverage of the country.
90
d. Cost consideration may restrict the number of questions asked or the time given to the
respondents to answer the questions.
e. Answers given may not be treated in confidence as the telephone could be bugged or
even dropped.
In-text Question
Collection of data through phone like other method has many advantages. True\ False
In-text Answer
c) True
SAQ 7.1-7.2
1. What is statistical data collection?
91
References
Adamu, S. O. (1978): “The Nigerian Statistical System”. Ibadan: University Press.
Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey
methods and Indicators).” Lagos: Victory Ventures ISBN 978-33867-1-9
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Moser, C. A. (1968): “Survey Methods in Social Investigation” London: Heinemann Educational
Books Ltd.,
Osuntogun, E. O. (1997): “Introduction to Social and Economic Statistics” Unpublished paper.
92
Study Session 8: Regression Analysis
Introduction
In this study, you will be introduced to the theory of linear regression analysis. Different types of
relationships shall be shown on the scatter plot and the estimate parameter of the model shall be
obtained by method of least squares. An introduction to the test of significance of the regression
line will also be given.
The variable being predicted is usually referred to as the response (dependent) variable and the
other variable is called the explanatory (independent) variable. The values of the explanatory
variable are usually fixed and under the control of the investigator while the values of the
response variable are determined by the values of the explanatory variables.
93
Thus regression analysis attempts to determine how changes in the explanatory variable affect
the response variable. The variables involved are assumed to be measured and recorded as
interval scaled or ratio scale data. If the variables are strictly qualitative (i.e. attributes) the
method of regression cannot be used.
The appropriate method used in studying association between two qualitative variables will be
discussed.
In-Text Question
The values of the explanatory variable are usually fixed. True or False
In-Text Answer
True
A regression model can be simple if there is only one explanatory variable, and multiple if there
are more than one explanation variable.
A regression model is linear if its parameter does not contain any exponents and are not
multiples of other parameters in the model; otherwise, the model is said to be non-linear. The
value of the highest power of a model is called the order of the model.
Scatter Plot
The first step in the study of the relationship between two variable is to draw the scatter diagram.
It portrays the direction, form and strength of any relationship between quantitative variables. It
94
is drawn by plotting the values of the response variable (on the Y-axis) against the values of the
explanatory variable (on the X-axis).
The shape of the scattered points on the graph gives an idea of the type of relationship between
the two variables.
Y Y
O X O X
(c) (d)
Figure 8.1: shows some of the common types of relationships that exist between two variables:
95
Figure 8.1(a) depicts a linear relationship
Figure 8.1(b) and (c) depicts non-linear relationship; figure 8.1 (a) is a quadratic relationship
while figure 8.1 (c) is an exponential relationship, while figure 8.1 (c) shows no
relationship between variables X and Y (i.e. spacious relationship). Since neither a line
nor curve can be fit on the scatter plot.
In-Text Question
The following are types of regression models except _______________
A. Simple
B. Multiple
C. Short
D. Non-Linear
In-Text Answer
C. Short
Yi 0 1 X i ei For i = 1, 2, ----, n
0 is the intercept (The point at which the regression line cuts the Y-axis i.e. when
X = 0).
96
1 if the slope (regression coefficient) of the line. It gives the rate of change in Y per unit
change in X.
ei is the error term distributed random error term with mean O and variance 2 . The parabeteics
of the model can be estimated by method of Ordinary Least Squares (OLS).
Yi X i ei
E Yi E 0 1 X i ei
= 0 X i
Similarly
= Var(ei)
= e2
Y
2
i Yˆ
Where 2
e can be estimated by S 2
n 1
97
8.4 Estimation of Parameters
Suppose there are n pairs of observations on X and Y as
(x1, yi), (x2, y2), -----, (xn, yn).
The assumed linear relationship is
Yi = a + bXi + ei 8.1
Where a and b are estimates of 0 and 1 in the original model. Equation (8.1) can be
expressed as
ei = Yi – a – bXi 8.2
Let Q = e 2
i Yi a bX i 2 8.3
Q
2 Yi a bX i 0 8.4
a
Q
2 Yi a bX i X i 0 8.5
b
The normal equations 8.4 and 8.5 can be solved simultaneously to obtain
a Y bX 8.6
and
b
XY nXY
SS XY
8.7
X nX 2 2
SS XX
e 2
i
Se i 1
8.9
n 1
Y
2
i Yˆ
=
n 1
98
Where Yi is the observed value of Y
c Xi
Var 0 8.10
S XX
and
the variance of the regression coefficient is
2
Var 1 8.11
S XX
Coefficient of Determination
This is the proportion of variation in the response variable (Y) that is explained by the
explanatory variable (X).
It is defined by
n
1 XY nXY
SS Re gression
R2 ni 1 8.12
SS Total
Y 2 nY 2
i 1
Explained variation
=
Total variation
where 0 R 2 1
R 2 0 When b 0 and R 2 1 when all the points fall on the fitted regression line.
99
This can be done by examining the residual plot (i.e plot of residual error ei Yi Yˆi against
the data points.
The most objective method is by arranging the sum of square and cross products in an Analysis
of Variance (ANOVA) table, and carry out the Fisher’s test (F-test) or student (t-test) as follows:
Table 8.1
ANOVA TABLE
SOURCE df SS MS F-cal
Y n
i
i
2
ssY
Where V1 = K – 1; V2 = n – K – 1.
= 0.05 or 0.01
Decision rule
Reject H0 if Fcal FV1 ,V2 , at level of significance an conclude there is enough evidence to
100
Example 8.1
The table below shows the weight losses, (in kilogram) (Y) of a sample of person and the
number of months (X), they have been on a special weight reducing diet.
Table 8.2
Y 4 17 14 1 11 22 9 12 4 7
X 7 32 26 1 20 34 17 21 5 12
Solution
Diagram 8.2
Scatter Plot
30 -
20 -
Weight loss
10 - 0 10 20 30 32 40
No. of months
101
Table 8.3
Y X XY X2 Y2 Yˆ Y Ŷ
2
4 7 28 49 16 4.159 0.02528
1 1 1 1 1 0.763 0.0561
4 5 20 25 16 3.027 0.9467
Y = 10.1
X = 17.5
b. Regression of Y on X Y = a + bx
b = XY nXY
= X 2
nX 2
= 0.566
102
a = Y bX
= 10.1 – 0.566(17.5)
= 0.197
Yˆ = 0.197 + 0.566X
c. a = 0.197m when X = 0, Y = 0.197
b = 0.566 implies for every month spent taking special weight reducing diet, there is
an average reduction of 0.57 kilogramme loss in weight.
d. Yˆ = 0.197 + 0.566(27)
= 0.197 + 15.28
= 15.48 kg.
e
i 1
2
i
e. Se =
n 1
Y
2
i Yˆ
=
n 1
Y Yˆ
2
from the working table. i = 11.0688
11.0688
Se =
9
= 1.109
Example 8.2
A quality control Manager collects 10 samples of iron roods from the production line at regular
interval of time. Each time the average length (Y) and diameter (X) of the rods are measured.
The results are given below.
103
Table 8.4
Average Average
Diameter (X) Length (Y)
in mm. in cm.
18.1 8.8
23.0 9.5
17.5 8.9
20.2 9.1
14.7 8.6
13.8 8.3
15.1 8.5
13.8 8.2
16.1 9.4
12.6 7.2
X = 164..9 Y = 86.5
X = 16.49 Y = 8.65
X 2
= 813.85 Y 2
= 752.25
XY = 1441.85
Hypothesis
H0 : 1 = 0
H1 : 1 0
= 0.05
104
a. b =
XY nXY
X nX2 2
= 0.163
a = Y bX
= 8.65 – 0.163(16.49)
= 5.96
Yˆ = 5.96 + 0.163X
b. SSTOTAL = (Y Y ) 2
= Y i
2
nY 2
= 752.25 – 10(8.65)
= 4.025
SSTrt =
b XY nXY
= 0.163(15.465)
= 2.52
Table 8.5
ANOVA
Source df SS MS Fc
TOTAL 9 4.025
F0.95, 1, 18 = 5.32
105
Conclusion: Since Fc > F0.815, 1, 8
we reject H0 and conclude that there are genuine reasons to show that the diameter influences the
length of the rods at 5% level of significance.
e MSE
c. S.E(b) =
X X ) 2 X 2 nX 2
0.188
=
94.649
= 0.045
SAQ 8.1-8.5
1. A test was performed to determine the relationship between the chemical content (Y)
of a particular solution and the crystallization temperature (X) in deg. The following
quantities are calculated.
X i
2
= 8800 X Y i i = 4300
Y i
2
= 2620
106
Assuming a linear relationship Yi X i ei
a. Calculate the least squares estimate of and each correct to two significant
figures.
b. Test the significance of the fitted model at 5% level of significance.
c. Obtain the standard error of parameter in the model.
d. A previous similar exercise with n = 1.5 shows a regression coefficient of 1 of 0.10
with a standard error of 0.008. Test the hypothesis that the slope of your regression
model is the same as that of the previous exercise at 5% level of significant.
2. Twelve students took two papers in the same subject and the marks in percentages were
as follows:
S/No. 1 2 3 4 5 6 7 8 9 10 11 12
Paper I 65 73 42 52 84 60 70 79 60 83 57 7
Paper II 78 88 60 73 92 77 84 89 70 99 73 8
3. A random sample of ten families had the following income and food expenditure (in N
per week).
Families A B C D E F G H I J
Family Income 20 30 33 40 15 13 26 38 35 43
Family 7 9 8 11 5 4 8 10 9 10
Expenditure
107
a. Estimate the regression line of food expenditure on income and interpret your results.
b. Obtain the regression line of income on food expenditure and interpret the result.
4. The following results have been obtained from a sample of 11 observations on the value
of sales (Y) of a firm and the corresponding prices (X).
X 519.18 , Y 217.82 , X 2
3134543 , X Y i i 1296836
Y 2
539512
a. Estimate the regression line at sales on price and interpret the results
b. What is the part of the variation in sales which is not explained by the regression line?
3. The following table includes the gross national product (X) and the demand for food (Y)
measured in arbitrary units, in an underdeveloped country over the ten year period 1960
– 1969.
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
Y 6 7 8 10 8 9 10 9 11 10
X 50 52 55 59 57 58 62 65 68 70
108
References
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House
PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New Delhi: W.H. Freeman and coy.
109
Study Session 9: Correlation and Association
Introduction
So far in study eight, you have learnt how to measure the direction and strength of the
relationship between the explanatory variable and the response variable for the purpose of
predicting one from the other. However, in this study, you will learn how to measure the relation
or association between two variables without distinction between the two variables and not for
the purpose of prediction.
9.1 Correlation
Correlation refers to the relationship or association between two or more variables while
correlation coefficient is a quantity that measures the strength of the linear relationship between
two qualitative variables. The measure of relationship between two attributes (qualitative
variable) is usually referred to as association. This will be discussed in the next section.
r
XY nXY
X nX Y
2 2 2
nY 2
110
X X Y Y
X X Y Y
2
If r = +1, x and y are said to be directly or positively correlated and the regression line is upward
sloping on the Scatter plot.
If r = -1, x and y are said to be inversely or negatively correlated and the regression is downward
sloping on the Scatter plot.
S XY
b
SX
S XY
and r
S X SY
S X2
Therefore r b
SY2
XY nXY = 646.5
X 2
nX 2 = 1142.5
111
Y 2
nY 2 = 376.9
646.5
r
(1142.5)(376.9)
646.5
=
656.21
= 0.985
Alternatively
S X2
r b
S Y2
33.80
= (0.566)
19.41
= 0.985
To obtain the rank correlation coefficient, the observed values of the variables are replaced by
their respective ranks either in ascending or descending order of magnitude.
The coefficient of rank correlation is given by
6 d 2
R 1
n(n 2 1)
Where d = difference of rank for any pair of variables – 1 < R< 1 and the interpretation is the
same as in product moment correlation coefficient.
If there are ties the average of the ranks are assigned to the units involved.
112
Example 9.3
Two judges were asked to assess twelve beauty contestants in a beauty contest. The twelve on
contestants were ranked according to their performance as follows:
Table 9.1
Judge 1 2 3 4 5 6 7 8 9 10 11 12
A 11 9 7 10 5 1 4 12 8 3 2 6
B 5 7 11 12 6 4 8 9 10 2 1 3
Table 9.2
n = 12
1 2 3 4 5 6 7 8 9 10 11 12
d 6 2 -4 -2 -1 -3 -4 -3 -2 1 1 3
d2 12 4 16 4 1 9 16 9 4 1 1 9
d 2
86
6 d 2
R 1
n(n 2 1)
6(86)
= 1
12(144 1)
= 1 – 0.30
= 0.70
Comment: There is a fairly strong agreement in the opinion of the Judges.
113
Example 9.4
A study was conducted to determine the relationship between level of smoking measures by the
number of sticks of cigarette smoked per day (X) and a Tercim index of health (Y). The
following data were obtained on a random sample of 10 male smokers.
Table 9.3
X 8 20 15 12 15 9 16 10 12 8
Y 4 5 5 7 10 13 8 6 3 8
Calculate the spearman rank correlation coefficient and comment on your result.
Solution
Table 9.4
d 2
169.5
6 d 2
R 1
n(n 2 1)
6(169.5)
= 1
10(100 1)
= 1 – 1.027
= 0.027
114
Comment
The above result shows that there is weak negative association between smoking habit and the
report health index.
SAQ 9.1-9.2
A group of sportsmen take part in a competition which includes two gymnasium test; squat
jumps and chins. The score for each exercise is the number performed in one minute. The score
of eight sportsmen taken from this group are given below:
Sportsmen A B C D E F G H
Squat jumps 47 72 60 44 56 63 71 64
Chins 25 48 30 40 27 35 30 34
a. Calculate the Spearman coefficient of rank correlation between these two sets of scores.
b. The overall winner of the gymnasium tests is the sportsman with the highest total score
when the number of squat jumps is added to the number of chins.
Determine the total scores and state which sportsman was the winner.
c. The rank correlation between the total scores and the number of squat jumps is 0.86
for the data above. Calculate the rank correlation between the total score and the total
115
score and the number of chins. If to save time, only one exercise was to be used in
future, state, giving a reason which one you would recommend to be used.
References
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications ISBN: 978-34411-3-2
Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering , Physical and
Biological Sciences”. Ibadan:A Divine Touch Publications. ISBN: 978- 35606-7-0
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method. Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: Arnold
& Stoughton
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, second
Edition. New York: W.H. Freeman and coy.
116
Study Session 10: Proportions, Rates and Indices
Introduction
Rates, ratio and indices have become very important in the descriptive analysis of certain events
and characteristics. They are especially useful in the study of vital characteristics such as price,
death, birth, population growth epidemics, etc.
In this study, you will be introduced to the three concepts, their uses and applications using some
sample data with particular emphasis on price indices.
10.1 Explain the meaning of the terms proportion, rate and indices;
10.2 Explain items to be taken into consideration when constructing an index number
117
10.1 Proportion, Rates and indices
Definition: Proportion is the ratio of a number of items with certain characteristics (X) and the
total number of items exposed to such characteristics (N).
It is defined as
n( X )
P( X )
N
The above expresses the chance of occurrences of such characteristics. (i.e. Probability of event
x).
Example
If the voting age population (people 18 years and above) in a ward consists of 550 males and 600
females. What is the proportion of males?
Solution
n (males) = 550
Total population: N = 550 + 600 = 1150
n(males )
Proportion of males: P( Males )
N
550
= = 0.478
1150
Rates
When proportion refers to the number of events or cases occurring during certain period of time,
it becomes a rate and is usually expressed as so many per 1000. Thus we refer to birth rate as the
number of birth per 1000 population in a year.
So also we have death rate, migration rate marriage rate etc. Some examples shall be given to
illustrate this concept later.
Index Number
An Index is a real number that measures the rate of increase or decrease in wage, production
value, quantity, price, or volume of a certain phenomenon in the current period relative to as
specific period in the past. (a base period). It is usually measured in percentage.
118
An Index number is a device for estimating trends in prices, wages, production and other
economic variables.
In its simplest form, an index number represents a special kind of average or a weighted average,
compiled from a sample of items judged to be representative of a whole.In this study, our focus
shall be on the construction of consumer price index, since the principle and methods that will be
discussed apply equally to indices of sales, production, wage, value, quantity indices.
In-Text Question
When proportion refers to the number of events or cases occurring during certain period of time,
it becomes a _______
A. Rate
B. An index
C. Ratio
D. Map
In-Text Answer
A. Rate
Just as we mentioned in the earlier part of this lecture, an index number is a special kind of
average that considers the prices of many commodities expressed in different units or the
quantities measured also in different units. The commodities could also of different weights in
119
the “basket” of goods considered for the index. All these constitute the problems usually
encountered in the construction of an index number.
Thus in the construction of price index, the following factor are considered:
a. Choice of Item
Decision should be taken on the item to be included in an index. Such commodity to be included
should be (i) relevant, (ii) representative (iii) reliable and (iv) comparable over a period of time.
b. Source of Data
Decision should also be taken on the source of data for the items composing the “basket” to be
used in the construction of index number, should the data prices of commodities be collected
from a local market, a supermarket or an urban market. Great care should be taken to ensure that
prices are collected from population market that is patronized by different category of people and
where majority of the selected commodity can be found.
120
c. The Base Period
A base year is a reference period. The chosen year should generally be a fairly “normal” year,
free of occurrence of unusual events such as war, famine, prolonged strike or hyper-inflation. If
it is difficult to select a year in particular, the average of a series of years can be taken.
d. The Weight
Different weights are used in different parts of the country for a particular commodity. For
instance “congo” is used in the Western part of Nigeria, ‘mudu’ in the North and ‘tin’ in the
East. For the purpose of constructing and index number the weight in different region need to be
harmonized to a single unit.
In-Text Question
Choice of item can determine price index. True or False
In-Text Answer
True
Pn
PR x 100
P0
2. Simple Aggregate Method: This method considers the price of basket of goods and
services in the current years relative to that of the based period. It is denoted by:
121
SAM
P n
x 100
P 0
Limitation: It attaches equal weight to all commodities. It does not take into account the relative
importance of the commodities.
3. Simple Average Relative Method: This is the sum of the price relative divided by the
number of items considered. It is denoted by
SAP
P n P0
x 100 where S is the number of items
N
Its limitation is same as in (2)
4. Weight Simple Average Relative Method: To circumvent the problems of assigning equal
weight to different items, the weighted simple aggregate price index is given as
WSAR
W P n P0
x 100
W
Example 10.1
The following are the prices of commodity, A, B and C in 1975 and 1985.
Using 1974 as the base year
Table 10.1
A 40 50
B 12 35
C 45 95
122
Solution
Table 10.2
Total 97 180
SAPR aggregate =
P n
P 0
180
= x 100
87
6.28
= x 100
3
Example 10.2
Given the prices of some staple foods in 1980 and 1996 with the corresponding weight. Using
1980 = 100
Table 10.3
123
Compute i. the price relatives
ii. simple aggregate price index
iii. weight average price index
Table 10.4
144.5
= x 100
57.25
= 252.14
PL =
P q n 0
x 100
P q 0 0
124
Limitation: Since the base year quantities reflects the price of out modeled purchasing pattern. It
gives undue weight to items that has increased in price. Therefore Laspeyer’s price index tends
to overestimate.
6. Paashe’s Price index: This method assumes that the consumption pattern of the
consumer has changed in the current year. It is denoted by:
Pp =
P q n n
x 100
P q 0 n
Limitation: Some people tend to spend less on goods that have risen in price, the current
weighting procedure (Paashes) gives undue weight to items that have reduce in price, it tends to
understate the rise in prices. Hence, the underestimate on the price index.
7. Fisher’s Ideal Index Number: This method overcomes the problems of Paashes and
Laspeyer’s. This is considered as the most efficient method of constructing an index
number. It is the geometric mean of the Laspeyer’s and Paashes price indices denoted by:
P1 .Pp
8. Marshal-Eldgeworth Price Index: This method takes into account the pattern of
consumption in the current and base periods. It uses the arithmetic mean of the base and
current period quantities as weight. It is given by:
M Ep
P q n 0 qn
x 100
P q 0 0 qn
Example 10.3
The prices and quantity demanded of commodities A, B and C in the current and base years are
given below
125
Table 10.5
A 4 50 10 40
B 3 10 9 2
C 2 5 4 2
Table 10.6
A 4 50 10 40
B 3 10 9 2
C 2 5 4 2
i. Laspeyer’s P1 =
= 610/240 x 100
= 254.2
126
ii. Passshe’s Pp =
= 426/170 x 100
= 250.6
iii. M-E =
= [(610 + 426+170)] x 100
= 252.7
iv. Fisher’s =
=
= 252.4
127
In-Text Question
Changes in purchasing power and real income can be measured using the consumer price
indices. True or False
In-Text Answer
True
1961 1962
128
Assuming that palm oil is twice as important as kerosene, what is the price index for 1962
taking 1961 = 100.
2. Five feed components are to be used in the construction of an animal feedstuff index
number. From the figures given in the following table, calculate a Laspeyer’s price index
taking 1964 = 100.
129
References
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications. ISBN: 978-34411-3-2
Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey
methods and Indicators).” Lagos:Victory Ventures. ISBN 978-33867-1-9
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third Edition. London: Arnold &
Stoughton.
Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics. Second edition.
New York: W.H. Freeman and coy.
130
Study Session 11: Time Series Analysis
Introduction
Time series analysis is the application of time series technique to time structured data usually
referred to as time series data. Time series data is the record of observations measuring certain
quantity of interest at regular or irregular interval of time.
In this study session, you will learn about the time series data, methods of analysis of time series
data and the components of a time series
131
capable of describing the system that generate the time structured data; making reliable forecast
for the future and testing of hypotheses.
Thus time series provides a basis for economic and business planning, production and system
planning, control and optimization of industrial process. The intrinsic nature of a time series is
that its observations are dependent or correlated and the order of the observations is therefore
dependent.
Since life must be understood looking backwards and must be lived by looking forward, time
series provides useful tools that helps to predict the future by approximating models that use past
data.
Discrete time series is one where observations are taken at discrete specific time intervals,
usually equally spaced e.g. interest rates, yields, volume of sales and production. Such series
arise from fields such as Agriculture, Business circles etc.
Continuous time series are observation taken at any time t (t T) in the index set T. This type
of series are common in the Engineering, Geophysics and Medical Sciences.
In-Text Question
Discrete time series is one where observations are taken at discrete specific time intervals,
usually equally spaced. True or False
In-Text Answer
True
132
1.2 Methods of Analysis of Time Series data
Time series data can be analyzed using either the deterministic method or Dynamic method.
Deterministic Method
A time series is said to be deterministic if future values are determined exactly by some
mathematical function. For example
(i) = +
(ii) X = Cos(2t)
Where a and b are constants and t is time that is fixed.
(iii) X t Tt S t Ct I t
Where Tt is the trend component; St is the seasonal component; Ct is the cyclical component
and it is the irregular component.
This method involves the use of Autocorrelation function (ACF) and Partial autocorrelation
function (PACF) and correlogram in discrete domain. It also involve the Fourier transforms in
Frequency domain Analysis in frequency domain is carried out using the extension of Fourier
method and spectral density function
In-Text Question
A time series is said to be non-deterministic if future values can only be determined in terms of a
probability distribution guided by some assumptions. True or False
In-Text Answer
True
133
Deterministic Time series Analysis
The analysis of time series depends on the type of system that generates the data. Analysis in
time domain refers to the analysis of discrete time series
Time Plot
The first and most important diagnostic tool of time series data is the time plot. It is a graphical
representation of a time series data. It is constructed by plotting the observation xt on the
vertical axis against time ' t ' on the horizontal axis. When properly drawn, it shows up the
important features of the series such as trend, seasonality, discontinuities and outliers. The time
plot of the data gives an idea of the type of model that is suitable for the data.
It could also indicate whether it would be necessary to transform the observed data to achieve
certain stable conditions suitable for meaningful analysis and inference.
Time Plot of Xt
35
30
25
20
Xt
15
10
5
0
t 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96
t
134
11.3 Components of Time Series
Movements in a time structured data are governed by some peculiar and inherent forces which
may be characterized by their regularity/ periodicity and their effect n the entire series. The
forces could also be due to changes in the social, economic, psychological or environmental
characteristic in the system.
The patterns generated by these forces are referred to as components of time series. Some of the
components are: the trend (Tt), Seasonal movement (St), cyclical movement (Ct) and the
Irregular movement (It).
Trend can be upward or downward. Upward trend is displayed in the time plot. This type of plot
is expected from sales of a commodity where increase is always expected.
135
Seasonal Variation
This refers to identical or almost identical patterns, which a time series appears to follow during
corresponding months of successive years due to mainly recurring event that takes place
annually. The movement appears to be periodic (exhibit variation at a fixed time within a given
interval if time). Many time series, such as sales figures and temperature readings, exhibit
variation, which are periodic annually.
There are factors responsible for this repetitive pattern year after year and the major factor is
weather condition. During winter, more woolen clothes are sold in UK and some other part of the
world. Also, regardless of increasing trend in the sales of ice cream, there is more sales of ice
cream during summer than winter.
Seasonal variation is denoted by (St). Time plot showing of monthly number of rainfall is given
below. Season is completed within one year, therefore a complete cycle is detected in the time
plot below. Seasonal variation is also found in a quarterly data. In that case, a complete cycle
will be completed within four quarters that make a year.
136
Cyclical Variation
Cyclical Variation refers to as a long term oscillation about the trend which may or not be
periodic due to some other physical causes. The movement may or not exactly follow similar
pattern after equal interval of time. Examples include daily variation in temperature and rainfall
Irregular Variation
This refer to erratic or sporadic movement of time series due to occurrence of random per chance
event, which are unforeseen, hence, it cannot be isolated directly. They are not deterministic.
These variations may or not be random.
Though it is assumed that these chance events produce short time variation, however, they can be
very intense and may result in a new cyclical or other variation. Included among these random
factors are such events as strikes, flood, volcanic eruption, earthquake, fire outbreak, sudden
change in government policy and so on. It is denoted by it.
137
Traditionally, it is possibly to decompose time series into the trend, seasonal, cyclical and
irregular components. Using either of
Additive model: X t Tt S t C t I t
or Multiplicative model: X t Tt S t Ct I t
where Tt is the trend component; St is the seasonal component; Ct is the cyclical component and
It is the irregular component
The resulting trend equation can be used for forecasting while the original data can be de-
seasonalized.
(i) semi-average,
Assuming all these methods are familiar to us, the least squares method uses the normal equation
X t a bt t with the assumption that the error term t are independent and not serially
correlated. Otherwise, the regression equation is spurious i.e. the parameters of the models are
biased, and inconsistent due to the presence of a lagged dependent variable, the estimated OLS
standard error is invalid.
Decomposition of a time series can be achieved using any f the following models:
Xt = Tt + St + Ct + It
Xt = Tt . St . Ct . It
138
(c) Mixed model
Xt = Tt StCt + It
or Xt = Tt + St . Ct . It
The additive model assumes that the actual values are the sum of the four separate effects. This
assumption is probably true when short periods are involved or where the rate of growth or
decline in the trend is small as may be shown in the time plot.
The multiplicative model suggests that the actual values are the product of the separate effects.
This model is indicated when there is a marked (or sharp) growth or decline in a time series data
as may be shown in the time plot.
Additive Model: Xt – Tt = St + Ct + It or
Xt
Multiplicative Model: S t Ct I t
Tt
139
multiplicative, the deviations are arranged in a table with a view to obtaining the average
otherwise referred to as seasonal indices St.
K
For additive model, the condition S
i 1
i 0 is imposed. That is the sum of seasonal effects
(indices) over the quarters add up to zero because if there were no seasonal effect, we expect Xt
– St = 0. If the means does not sum up to zero, the mean is then averaged among the quarters /
months / day / weeks, thus the seasonal effects are adjusted by subtracting (or adding) the
1
average from the mean to obtain the adjusted means (i.e. seasonal effects). i.e
n
d i Si ;
m
S i 0 but if S i m then
K
g therefore S i g 0
(1.3)
For multiplicative model, the condition S j S is imposed where S is the number of quarters
in quarterly series. That is the sum of the seasonal effect over the year is S. The ratio of the
actual values (Xt) and the trend (Tt) is obtained as X t Tt because, if there were no seasonal
1 S
effect we expect X t Tt 1 for each time period. Thus S j 1.
S j 1
n
S
j 1
j S (1.4)
The averaging procedure which produces the seasonal components follows the same pattern as in
the additive model except that the adjustments to the averages which corrects for rounding to S.
This is achieved by summing the averages and multiplying the resultant quotient by the
unadjusted averages.
1
Let S j C , then the adjustment is
S
S j . Thus the de-trended, de-seasonalized series can
be obtained by eliminating the trend (Tt) and seasonal components (St) for each time period from
the actual data by subtraction or division depending or whether the additive or multiplication
model was used.
140
Additive model Xt – Tt – St = Ct + It
Xt
Multiplicative model Ct I t
Tt S t
De- seasonalized series is obtained after the seasonally adjusted data has been calculated. The
residual ratio is obtained either by dividing these seasonally adjusted figures Xt S t by the
trend values or by dividing the ratio de-trended series X t S m X t by the respective seasonal
indices.
Finally, the cyclical variation (Ct ) can be found by smoothing the joint Ct and It components and
is eliminated as before.
Additive model Xt – Tt – St - Ct = It
Xt
Multiplicative model It
Tt S t C t
Although the general method of decomposition has included the four possible components which
make up a time series, it should be noted that it is not a rule for all the four to be present. If
annual data are being used, there can be no seasonal component. Similarly, if short periods of
time are involved, the cyclical components can be ignored. In both cases one of the steps
outlined in the decomposition of time series above may be omitted.
Prediction / Forecasting
The essence of decomposing a time series is for a statistician to measure the effect of each
component and to make meaningful and reliable forecast; taking into consideration the effect of
the component on the forecast values for different time periods.
Thus, if a multiplicative model was used, a sensible predictor for the period K might be
Xˆ t ( k ) Tˆt k Sˆ t k
141
where Tˆt and Ŝ t are the estimated trend and seasonal effects respectively.
Similarly if the additive model was used the predictor for period k might be
Xˆ t ( k ) Tˆt k Sˆ t k
Example
The data below gives the monthly sales of umbrella in XYZ company from 2004 – 2011
Table 11.1
JAN 10 15 10 8 10 8 9 10
FEB 18 12 10 9 9 12 13 8
MAR 22 13 9 12 10 10 3 11
APRIL 8 15 20 14 12 15 8 13
MAY 16 11 10 19 10 12 15 8
JUN 10 16 18 20 18 11 15 11
JUL 18 22 16 25 28 16 18 15
AUG 20 30 20 25 30 13 19 17
SEP 15 20 21 17 15 9 10 11
OCT 10 15 18 15 17 18 18 4
NOV 14 25 16 15 15 22 23 15
DEC 11 10 14 7 7 10 9 8
142
(a) Use a suitable average to decompose the series into trend and seasonal component, hence or
otherwise forecast the sales for 2012 – 2013 using the additive model.
(b) Which is the most appropriate model in the sense of providing the better forecast?
Solution:
The first thing to do is to construct the time plot in order to view the maximum and minimum
values, examine the existence of outliers and fluctuations.
Table 11.2 Showing the Computations of the Trend, Seasonal indices and De-
seasonalized data.
143
MAY 16 -2.15 18.2
144
APRIL 20 187 365 15.21 4.79 -0.45 20.4
145
DEC 7 173 349 14.54 -7.54 -4.76 11.8
146
NOV 22 147 298 12.42 9.58 4.09 17.9
147
JUL 15 5.72 9.3
Month/
Year JAN FEB MAR APRIL MAY JUN JUL AUG SEP OCT NOV DEC
2006 0.08 -3.54 -2.96 -1.63 -6.04 -0.79 5.50 13.75 3.71 -1.46 8.50 -6.33
2007 -5.67 -5.29 -6.46 4.79 -5.00 2.92 1.04 4.96 6.08 2.96 0.50 -1.96
2008 -8.54 -7.58 -4.29 -2.13 3.21 4.42 9.33 9.42 1.58 0.04 0.50 -7.54
2009 -4.88 -6.00 -5.00 -3.08 -5.08 3.00 12.96 14.83 -0.29 1.50 -0.29 -7.50
2010 -5.29 -0.33 -2.13 2.54 -0.88 -2.04 2.88 0.13 -3.29 5.88 9.58 -2.67
2011 -4.00 -0.29 -10.33 -5.38 1.63 1.63 4.79 5.67 -3.88 4.21 9.67 -4.04
- -
Total -31.13 -27.83 -32.42 -3.21 15.13 4.58 40.00 54.88 5.13 9.25 28.54 33.38
AVG -4.45 -3.98 -4.63 -0.46 -2.16 0.65 5.71 7.84 0.73 1.32 4.08 -4.77 -0.10
Adjustment 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 -0.01
S.I -4.44 -3.97 -4.62 -0.45 -2.15 0.66 5.72 7.85 0.74 1.33 4.09 -4.76 0.00
148
35
30
25
20
15
10
5
0
APRIL
APRIL
APRIL
APRIL
APRIL
APRIL
APRIL
APRIL
OCT
OCT
OCT
OCT
OCT
OCT
OCT
OCT
2004 JAN
2005 JAN
2006 JAN
2007 JAN
2008 JAN
2009 JAN
2010 JAN
2011 JAN
JUL
JUL
JUL
JUL
JUL
JUL
JUL
JUL
Sales of Umbrella Moving Average De- Seasonalized data
Fig. 1.5: Time Plot of Sales of Umbrella, Moving average and De-seasonalized data
1. The components of time series were described with charts for their illustration
2. The additive and multiplicative method of analysis of time series data with examples.
3. The procedure for construction of seasonal indices and de-seasonalized data. Some useful
examples were given to illustrate the techniques.
149
(i) Obtain the time-plot of the observation
(ii) Use a 3-point moving average to the trend values.
3. For the following time series:
Year Yt
1990 2.4
1991 3.6
1992 5.4
1993 7.8
1994 11.6
1995 17.3
References
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House
PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New Delhi: W.H. Freeman and coy.
Shittu, O. I. and Yaya, O. S. (2011): “Introduction to Time Series Analysis”, Babs-Tunde
Intercontinental Print, Nigeria. ISBN 978-33867-1-9. pp. 282
150