AGE 301 Handout by Dr. Show

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 150

Study Session 1 What is Statistics?

Introduction
Statistics is a universal subject used in all disciplines and in all areas of human endeavour. The
word statistics was originally applied only to such data at the state required for its official
purpose. To a layman; it also refers to any set of quantitative data relating to a particular
measurement, whether that data is of interest or not.

The systematic collection of official statistics for political purposes originated in Germany
towards the end of the 18th Century, by comparing data such as population, industrial and
agricultural output. Also in England, a collection of numerical data enabled government
departments to predict levels of revenues and expenditure with more precision than before.

Learning Outcomes for Study Session 1


When you have studied this session, you should be able to:

1.1 Explain the meaning of statistics


1.2 Discuss the nature, scope and coverage of statistics;
1.3 Mention the use of statistics in our day-to-day activities
1.4 Define terms and concepts that would facilitate understanding of this course.

1
1.1 Meaning of statistics
The earliest origin of statistics lies in the desire of rulers to count the numbers of inhabitants or
measure the value of taxable land in their domains. This has developed to careful measurement
of weight, distance or counting of physical quantities and items in many disciplines such as
agriculture, life and behavioral sciences.

Thus, the study of statistics is therefore essential for sound reasoning, precise judgment and
objective decision in the face of up-to-date accurate and reliable data.

Box 1.1: Meaning of Statistics


Statistics can simply be defined as the “science of data”. It is the science of collecting,
organizing and interpreting numerical facts, which we called data.

Most of us, especially those in the media-reporters have little or nothing to do with a large mass
of data

Statistics is also the science and practice of developing human knowledge through the use of
empirical data expressed in quantitative form. It is based on statistical theory. It is a branch of
applied mathematics where randomness and uncertainty are modelled by probability theory
(Wikipedia Encyclopedia).

In Nigeria, the official data collection and its usage started with the Statistical Act of 1947 which
established the Department of Statistics in the office of the Governor General of the Federation.

Thus, many researchers, educationalists, businessmen and government agencies at the national,
state, or local level relies on data to answer fundamental questions pertaining to their operations
and programs. In fact, there can be no meaningful science without statistics.

2
In-Text Question
Statistics could also be defined as?

a. A structural science
b. Codes that help programmers in programming
c. a branch of applied mathematics where randomness and uncertainty are modeled by
probability theory
d. None of the above

In-Text Answer
c.) A branch of applied mathematics where randomness and uncertainty are modelled by
probability theory

1.2 Branches of Statistics


The science of data “statistics” can be divided into three broad parts which are not mutually
exclusive viz.; descriptive statistics, statistical methods and statistical inference

Descriptive Statistics
It is the act of summarizing and giving a descriptive account of numerical information in the
form of reports, charts and diagrams. The goal of descriptive statistics is to gain information

3
from collecting data. It begins with a collection of data by either counting or measurement in an
inquiry.

It involves the summary of specific aspects of the data, such as average value, and measure of
spread. Suitable graphs, diagrams and chart are then used to gain understanding and clear
interpretation of the phenomenon under investigation, keeping firmly in mind where the data
comes from.

Statistical Method
This is a device for classifying data and making clear relationship between variable under
consideration. This can be achieved by using the statistical tools and formulae. It ranges from
the computation of simple summaries of data (mean, median, mode, etc.) to complex modelling
used in policy formulation.
Inference Statistics
This is the act of making a deductive statement about a population from the quantities computed
from its representative sample. It is a process of making inference or generalizing about the
population under certain conditions and assumptions. Statistical inference involves the processes
of estimation of parameters and hypothesis testing.

1.3 Uses of Statistics


Statistics could be used for a lot of our day to day activities which is mentioned below:

1. Planning and decision making by individuals, state, business organizations research


institutions etc.
2. Forecasting and prediction for the future based on a good model provided that its basic
assumptions are not violated.
3. Project implementation and control; this is especially useful in ongoing projects such as
network analysis, construction of roads and bridges, and implementation of government
programs and policies
4. Motoring and evaluation of plans, projects, programmes and policy initiatives. It also
assists in motoring, and evaluation of the activities of government programmes.

4
1.4 Terms and Concepts in Statistics
There are a lot of terms and concept in statistics we need to learn to keep us abreast and give us
more understanding about statistics. The following terms and concept discussed below are used
daily in the field of statistics.

1.4.1 Population Sample and Variate


In the earlier part of this study session, we explained that the main aim of statistics is to gain
information about a population. We may want to know what the population is:

Population: A population is the collection of items under investigation. It may be finite


(countable) or infinity (uncountable).

Parameter: A parameter is a summary / quantity computed from a population, e.g. means (  ),

population variances (  2 ) etc.

Sample: A sample is a representative part of a population observed for the purpose of making a
scientific statement or taking decisions about the population. A good sample must be randomly
selected and adequate.

A sample can be random or purposive. A random sample may be obtained by tossing a coin,
throwing a die, drawing discs from a container or using a table of random numbers. A purposive
judgmental sample is obtained when members of a population are selected by discretion or
personal judgment

Statistics: A statistics is a quantity / summary calculated from a sample for the purpose of
drawing conclusion about the related population, e.g. sample means ( x ), sample variance (  2 )
etc.

The characteristics of units in the population can be measured or counted (quantitative) e.g.
weight, height age, number of cars. It can also be observed (qualitative or attributes e.g. color, of
eyes, beauty, complexion etc.)
Variate: A variate (variables) is any quantity or attributes whose value varies from one unit of
observation to another. A quantitative variate (variables) may be discrete or continuous
Continuous Variate: A continuous variate is a variate which may take all values within a given
range. Its values are obtained by measurements e.g. height, volume, time, examination score etc.

5
Discrete Random Variate: A discrete random variate is one whose value changes by steps. Its
value may be obtained by counting. It normally takes integer values e.g. number of cars, number
of chairs.

1.4.2 What is Data?


Having defined statistics as the science of data, it is necessary at this juncture to ask ourselves,
the pertinent question: What is data?
Data: Data can be described as a mass of unprocessed information obtained from measurement
of counting of a characteristics or phenomenon. In their raw form, they are usually massive and
disorderly. They become meaningful only when the data have been reduced to some kind of
order by some kind of tables or diagrams.

Statistical data: These are data obtained through objective measurement or enumeration of
characteristics using the state of the art equipment that is precise and unbiased. Such data when
subjected to statistical analysis produce results with high precision.

Sources of Statistical Data


Statistical data can be obtained from

1. Census - Complete enumeration of all the unit of the population


2. Surveys - the study of representative part of a population.
3. Experimentation: Observation from experiments carried out in laboratories and research
centres.
Types of Data
Data can be categorized as internal or external data.

Internal Data
When data is collected from within the organization and used in the organization concerned, it is
called internal data. Examples are data from accounts and internal records of an establishment.

6
External Data
If data is collected from outside the organization, it is called external data. Examples are data
from journals not published by the organization itself. There are two major sources of statistical
data: the internal source and the external source.

Primary Data
These are data generated by first hand or data obtained directly from respondents by personal
interview, measurement or observation.

Secondary Data
These are data obtained from publication, newspapers, magazines and annual reports. They are
usually summarized data used for a purpose other than the intended one.

Summary
In Study Session 1, you have learnt that:

1. The study of statistics is essential for sound reasoning, precise judgment and objective
decision in the face of up-to-date accurate and reliable data.
2. Statistics can be defined as the science of collecting organizing and interpreting numerical
facts, which we called data
3. The science of data statistics are descriptive statistics, statistical methods and statistical
inference
4. Statistics could be used for a lot of our day to day activities
5. A population is the collection of items under investigation
6. A parameter is a summary / quantity computed from a population
7. A variate (variables) is any quantity or attributes whose value varies from one unit of
observation to another
8. Data can be described as a mass of unprocessed information obtained from measurement
of counting of a characteristics or phenomenon
9. Data can be categorized as internal or external data.

7
Self-Assessment Question (SAQs) for Study Session 1
Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.

SAQ 1.1 (Tests Learning Outcomes 1.1)


What is the meaning of Statistics?

SAQ 1.2 (Tests Learning Outcomes 1.2)


List the branches of Statistics

SAQ 1.3 (Tests Learning Outcomes 1.3)


Mention three uses of statistics?

SAQ 1.4 (Tests Learning Outcomes 1.4)

1. Define population
2. What are the Source of Statistical data?

Notes on SAQ
SAQ 1.1

Statistics can simply be defined as the “science of data”. It is the science of collecting organizing
and interpreting numerical facts, which we called data.

SAQ 1.2

Descriptive statistics, statistical methods and statistical inference

SAQ 1.3

I. Planning and decision making by individuals, state, business organizations research


institutions etc.
II. Forecasting and prediction for the future based on a good model provided that its basic
assumptions are not violated.
III. It assists in motoring and evaluation of the activities of government programs.

8
SAQ 1.4

1. A population is the collection of items under investigation

2.

i. Census - Complete enumeration of all the unit of the population


ii. Surveys - the study of representative part of a population.
iii. Experimentation: Observation from experiments carried out in laboratories and research
centres.

References
Brookes, B.C. and Dick, W. F. L. (1969): An introduction to Statistic Method, 2nd Edition, H. E.
B. Publishers.

Moore, D.S. and McCabe, G. P. (1993): Introduction to the practice of Statistics; 2nd Edition;
New York: W. H. Freeman and Company.
Adamu, S. O. and Johnson, T. L. (1997): Statistics for Beginners, Book 1: SAAL Publications.

9
Study Session 2 Presentation of Data

Introduction
The aim of this study session is to introduce the various methods of presenting statistical data.
Presentation of data in tables, charts and diagrams facilitates understanding of the important
feature of the data.

Learning Outcomes for Study Session 2


When you have studied this session, you should be able to:

2.1 Explain the various ways of presenting a mass of data;


2.2 Construct a frequency table;
2.3 Explain and Carry out simple descriptive analysis of data in tables and diagrams.

2.1 Ways of Presenting a Mass of Data


Numerical information (data) about the characteristics of a variable, when collected is often
massive and complex. More often than not, it is necessary to present data in tables, charts and
diagrams in order to have a clear understanding of the data, and to illustrate the relationship
existing between the variables being examined.

We shall discuss the frequency table, cumulative Frequency table, Stem plot, Box plot and
Histogram assuming that we are very familiar with other graphs such as pie chart, frequency
curve, frequency polygon etc.

In-Text Question
Why is it necessary to present data in tables, charts and diagrams?

a. To have a clear understanding of the data and illustrate the relationship between variables
b. To break information into pieces
c. Allow a blind man understand the data
d. To have a clear understanding of the data

10
In-Text Answer
a.) To have a clear understanding of the data and illustrate the relationship between variables

2.2 Frequency Table


The first step in examining intelligently a set of data for a single quantitative variable is by
constructing a frequency table. This is a tabular arrangement of data into various classes
together with their corresponding frequencies.

Procedure
Given a set of observation x1, x2 …. xN for a single variable.

1. Find the range (R): (i.e. Difference between the largest and smallest values) of the data.
2. Determine the number of classes (K) (depending on the size of the data).
3. Find the class interval (C): (i.e. Range divide by the number of classes) .
4. Tally (i.e. assign the values to classes).
5. Find the class frequencies.
Note: With the advent of computers, all these steps can be accomplished easily.
Example 2.1: The following are the scores of 40 students in Mathematics test:
50, 08, 14, 20, 46, 23, 26, 47, 32, 31, 48, 40, 49, 40, 41,
38, 51, 86, 55, 82, 56, 72, 60, 98, 59, 76, 55, 80, 52, 63,
57, 67, 53, 70, 69, 63, 65, 66, 22, 27
Construct a frequency table for the above data.
Solution
Range: 98 – 08 = 90
No. of classes = 10

Range 90
Class Interval =   9
No. of classes 10

11
Working Table
Table 2.1

Class Tally Frequency`


1 – 10 I 1
11 – 20 II 2
21 – 30 IIII 4
31 – 40 IIII 5
41 – 50 IIII I 6
51 – 60 IIII IIII 9
61 – 70 IIII II 7
71 – 80 III 3
81 – 90 II 2
91 - 100 I 1

Frequency Table
Table 2.2

Score Frequency

01 up to 10 1
11 up to 20 2
21 up to 30 4
31 up to 40 5
41 up to 50 6
51 up to 60 9
61 up to 70 7
71 up to 80 3
81 up to 90 2
91 up to 100 1

Total 40

12
2.2.1 Cumulative Curve (OGIVE)
The graph of the cumulative frequency of a single variable is called an OGIVE. It is drawn by
plotting the cumulative frequency against the upper class boundary of a class interval. On the
OGIVE it is possible to obtain the median the quartile and inter-quartile range. (IQR)
Example 2.2: Using the data in Example 1. Construct the cumulative frequency curve.
Solution

Table 2.3

Score Frequency Cumulative


Frequency
Less than 10 1 1
11 up to 20 2 3
21 up to 30 4 7
31 up to 40 5 12
41 up to 50 6 18
51 up to 60 9 27
61 up to 70 7 34
71 up to 80 3 37
81 up to 90 2 39
91 up to 100 1 40
Total 40

OGIVE

Cum. Freq.

0
Score
Diagram 2.1

13
Example 2.3
The following data represent the ages (in years) of people living in a housing estate in Ibadan.
30, 31 17 16 6 2 8 43 18 18 32 33
9 18 33 19 21 13 14 13 14 6 45 52 61
23 26 14 15 14 15 27 19 36 37 11 12
11 12 20 39 40 20 63 69 64 29 28 27
15
Present the above data in a frequency table using a suitable class interval.
Solution
Maximum value = 69
Minimum value = 2
Range = 69 – 2 = 67

A choice of 10 classes will result in some classes with zero frequencies while the choice of 6
classes is more reasonable with at least one item in each class. In practice, it is easy to determine
the number of classes for a given set of data. We are using K = 6 as our number of classes.

R 67
Class interval =  = 10.13 ≃ 10.0
K 6

Table 2.4

(1) (2) (3) (4) (5) (6)


Frequency Relative Cumulative
Class Tally (F) Frequency Frequency CRF
(RF) (CF)
1 – 10 IIII 5 0.10 5 0.10
11 – 20 IIII IIII IIII 20 0.40 25 0.50
21 – 30 IIII 9 0.18 34 0.68
31 – 40 IIII IIII 8 0.16 42 0.84
41 – 50 IIII III 3 0.06 45 0.90
51 – 60 III 1 0.02 46 0.92
61 – 70 I 4 0.08 50 1.00
IIII
Total 50 1.00
It is pertinent to define the columns in the frequency table for better understanding.

14
Class interval is a sub-division of the total range of values which a (continuous) variable
may take.
Class frequency is the number of observations of the variate which falls in a given
interval (column 3)
Relative frequency for a class is the actual frequency of the class divided by total
frequency. (Column 4). Sometimes, it is better to work with relative frequencies
[especially in the calculation of probability values].
Cumulative frequency of a class is the sum of all the frequencies before the class up to
and including the frequency of that class (column 5).
Relative Cumulative Frequency: When the relative frequency of a class is expressed as
a proportion of total frequency, what we have is called the relative cumulative frequency
(column 6). It is sometimes called the distribution function.

Box 2.1: Observations from the Table


The data have been summarized and we now have a clearer picture of the distribution of the ages
of inhabitants of the Estate.

Exercise
Now answer the following questions from the table.
How many residents are aged between 11 and 30 years?

i. How many residents are aged above 30 years?

ii. What is the probability that a person selected at random from the Estate will be less than
31 years old?

Answers
(i) 29 (ii) 16 (iii) 0.68

2.3 Simple descriptive analysis of data in tables and diagrams


Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph. Tables,
charts and graphs should, ideally, be self-explanatory. The reader should be able to understand

15
them without detailed reference to the text, on the grounds that users may well pick things up
from the tables or graphs without reading the whole text. Below are some ways in analysing data

2.3.1 Histogram
Histogram is a chart used for presenting the frequency distribution of the values of a variable.
(Assuming the variate is a continuous type).

A histogram is a group of rectangles drawn above each class interval such that the area of each
rectangle is proportional to frequency of the observations falling in the corresponding class
interval. The chart is constructed by plotting the values of the variable along the X-axis and the
frequencies along the Y-axis.
Vertical lines are drawn at the lower and upper class boundary of each class up to the
frequencies. Horizontal lines representing the width of each class interval are then drawn on top
of each vertical line.
In a situation where the class intervals are not the same, the height must be adjusted so that the
area represents the frequency.
Draw the histogram of the data in Example 2.2 above

16
Histogram of Ages
Diagram 2.2

20

Frequency
10
0 1 10 20 30 40 50 60 70
Ages (in years)

2.3.2 Stem plots (Stem and Leave Plots)


In statistics, a stemplot (or stem and leaf plot) is a graphical display of quantitative data that is
similar to a histogram and is useful in visualizing the shape of distribution. It was invented by J.
W. Turkey (1915 – 2000). Stemplots contain more information than do histograms because;
unlike in a histogram where bars are used, the individual data values are displayed in a table-like
format, in order of increasing magnitude. A basic stemplot contains two columns separated by a
vertical line. The left column contains the stems and the right column contains the leaves.

17
Constructing a Stemplot
To construct a stemplot, take not of the following steps;

I. The observations must first be sorted in ascending order.


II. It must be determined what the stems will represent and what the leaves will represent.
Typically, the leaf contains the last digit of the number and the stem contains all of the
other digits (in the case of very large or very small numbers, the data values may be
rounded to a particular place value (such as the hundreds place) that will be used for the
leaves. The remaining digits to the left of the rounded place value are used as the stems).

The stemplot is drawn with two columns separated by a vertical line. The stems are listed to the
left of the vertical line. It is important that each stem is listed only once and that no numbers are
skipped, even if it means that some stems will have no leaves. The leaves are listed in increasing
order in a row to the right of each stem.
Example 2.4
Present the following data in a stem-and-leaf plot

68 66 72 75 76 106 54 57 56 63 59 66 68
64 88 84 81
Solution

Table 2.5
Stem Leaf
5 4 6 7 9
6 3 4 6 8 8
7 2 2 5 6
8 1 4 8
9
10 6
Example 2.5
Given the weight of 20 rams at the end of two weeks feeding on a special diet as follows:
46, 59, 35, 41, 46, 21, 24, 33, 40, 45, 49, 53, 48, 54, 61, 36, 70, 58, 47, 12
Make a stem plot for these data

18
Solution
The stem plot is given below
1 2
2 14
3 356
4 01566789
5 3489
6 1
7 1

Important Features

i. It is easy to locate the centre of the distribution, i.e. median = 46


ii. It is also possible to examine the shape of the distribution. Turn the stem plot on its side so
that the larger observation falls on the right (e.g. The above distribution is symmetric)
just as it is possible to measure the median first quartile (q1) the third quartile (q3) and
inter-quartile range (IQR).
iii. It is also possible to look for deviation from the overall shape of the data e.g. outliers

2.3.3 Back-to-back stemplot


Back-to-back stemplots are used to compare two distributions side-by-side. This type of double
stemplot contains three columns, each separated by a vertical line. The center column contains
the stems. The first and third columns, each contain the leaves of a different distribution. The
numbers for the leaves of the distribution in the leftmost column are aligned to the right and are
listed in increasing order from right to left. Here is an example of a back-to-back stemplot
comparing the distribution of the weight of cow to another distribution weight of ram.

19
Example 2.6

Suppose 20 cows were fed with the same special feed as in example 3: the back-to-back stem
plot is shown below:

Table 2.6
Weight Of Cow Weight Of Ram

0 1 2
1 2 14
2 3 356
31 4 01566789
542 5 3489
7655421 6 1
42 7 1
8
1 9

Observations

i. Weight of Ram is symmetric


ii. Weight of Cow is skewed to the right
iii. There is an outlier in the weight of cow (i.e. 91 kg.)

NOTE: Stem plot works well for small set of data especially when the observations are all
greater than zero.

2.3.4 Box Plot (Box and Whiskers Plot)


This is a chart that looks like a box when drawn. They are most useful when comparing two or
more sets of sample data. A box plot shows the centers and spread of the data, gives a clear
picture of the symmetry of a data set and shows outliers very clearly. It is constructed by first
calculating the median 1st and 3rd quartiles.

20
In-Text Question
The box plot chart is most useful when comparing two or more sets of sample data. True or False

In-Text Answer
True

In a box plot, the ends of the box are at the quartiles, so that the length of the box is the inter
quartile range. The median is marked by a line within the box. The ‘whiskers’re the two lines
outside the box that extends to the smallest and largest observations. Outliers are shown as dots,
outside the shickers.
Example 2.7
Consider the data in example 2.6 above. Construct the box plots.
Solution

Diagram 2.3
Boxplots
10
9
8
7
6
5
4
3
2
1
0
Fig. 1.2a Fig. 1.2b
Weight of Ram Weight of Cow

21
In a box plot, the center, the inter-quartile range, the spread are immediately apparent. However,
the box plot is generally inferior to the stem plot or histogram in that it shows only the center and
the partition values; it tells nothing about the shape of the distribution and other values in the
data set.
A stem plot (for large data set) provides a clearer display of a single distribution especially, when
accompanied by the median and quartile as numerical sign post.

Summary
In Study Session 2, you have learnt that:

1. It is necessary to present data in tables, charts and diagrams in order to have a clear
understanding of the data
2. The first step in examining intelligently a set of data for a single quantitative variable is
by constructing a frequency table
3. Frequency table is a tabular arrangement of data into various classes together with their
corresponding frequencies
4. The graph of the cumulative frequency of a single variable is called an OGIVE
5. Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph.
6. Histogram is a chart used for presenting the frequency distribution of the values of a
variable
7. Stemplot is a graphical display of quantitative data that is useful in visualizing the shape
of distribution
8. Back-to-back stemplots are used to compare two distributions side-by-side.
9. Box Plot is useful when comparing two or more sets of sample data.

22
Self-Assessment Question (SAQs) for Study Session 2
Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.

SAQ 2.1 (Tests Learning Outcomes 2.1)


Why is it necessary to present data in tables, charts and diagrams?

SAQ 2.2 (Tests Learning Outcomes 2.2)

1. The first step in examining a set of data for a single quantitative variable is by
constructing a frequency table. True or False
2. What is a Frequency Table

SAQ 2.3 (Tests Learning Outcomes 2.3)


Mention three ways in analyzing data?

Notes on SAQ

SAQ 2.1
They give a clear understanding of the data, and to illustrate the relationship existing between the
variables being examined.
SAQ 2.2

1. True
2. This is a tabular arrangement of data into various classes together with their
corresponding frequencies.

SAQ 2.3

i. Histogram
ii. Stemplot
iii. Bxplot

23
Reference
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method Second Edition,
Published by H.E.B.Paperback.

Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published
by Arnold & Stoughton

Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics, second Edition.
New York:W.H. Freeman and coy.

24
Study Session 3 Measure of the Centre of a Set of Observations

Introduction
The primary aim of any investigator is to obtain a simple summary value (average) that can be
used to describe all the observations in a set. Thus an average is a single value that can
represents all the observations in a distribution.

The most representative value is one that is at the center of the distribution. They are otherwise
referred to as measures of location or measures of central tendency

Learning Outcomes for Study Session 3


When you have studied this session, you should be able to:

3.1 Discuss the term measures of central tendency;


3.2 Explain and calculate the mean
3.3 Explain and calculate Median
3.4 Explain and calculate mode from a grouped data
3.5 Define and calculate the partition values.
3.6 Discuss Other Measures of Central Tendency

25
3.1 Measures of Central Tendency
These are measures of the center of a distribution. They are single values that give a description
of the data. They are also referred to as measures of central tendency. Some of them are
Arithmetic mean, mode, median, geometric mean and harmonic mean. We shall discuss them
one after the other. They are otherwise known as descriptive statistics.

In-Text Question
Measures of central tendency are multiple values that give a description of the data. True or
False

In-Text Answer
False

However, a descriptive statistic should possess the following desirable properties.


A descriptive statistic should

1. Be single-valued
2. Be algebraically tractable
3. Should consider every observed value

3.2 Mean
The average (arithmetic mean) of a set of observation is the sum of the observation divided by
the number of observation. Given n observations are denoted by x1, x2, x3 ---- xn, the mean is
defined by

1
X  ( x1  x 2  x3  ...  x n )
n

26
Or in a compact notation, it can be written as
1
X 
n
 xi
The above formula is for the simple series and is most useful when few (n < 20) observations are
considered.

Example 3.1
Here are the ages of 15 students in a class 16, 18, 20, 21, 22, 19, 17, 18, 19, 17, 17, 18, 17, 17,
20. Calculate the mean.
Solution

1
The average age of the students is X 
n
 xi
1
X  [16  18      17  20]
15

257
X 
15

= 17.3 ≃ 17 years 4 months

3.2.1 Calculation of Mean from Grouped Data


We have seen in study session two that a large set of observations can be summarized into a
frequency table which elicits some information about the data. This makes the computation of
the mean from a grouped data is very easy.
The mean of a set of N observation of a discrete (continuous) variate, grouped so that the value
xi (xi is the centre of intervals) i = 1, 2, … K occurs with frequencies fi is
K K
fi X i
X  ; N   fi
i 1
N i 1

In a grouped frequency table for a continuous variate, Xi’s are the center of interval (i.e. Average
of the upper and lower class boundary of a class) otherwise known as class mark.

27
Example 3.2
Given the frequency distribution of a random variable X as follows:
Table 3.1

Group Frequency

1–5 2
6 – 10 4
11 – 15 8
16 – 20 5
21 – 25 3
26 - 30 1

Total 23

Find the mean of the distribution.


Solution

Find the class mark of a particular class by adding the lower and upper class boundaries of the
class and divide by 2.

Table 3.2

Group Class Mark f fx


(X)
1–5 3 2 6
6 – 10 8 4 32
11 – 15 13 8 104
16 – 20 18 5 90
21 – 25 23 3 69
26 - 30 28 1 28

23 329

28
N = ∑f

X 
 fx
N

329
= = 14.304
23

Use of Assumed Mean


Sometimes, large values of the variable are involve in the calculation of mean, in order to make
our computation easier, we may assume one of the values as the mean. Then the revised formula
for the mean is:
If the assumed mean is A, then

d 
Mean: X  A    where d = X – A
 n 
 
If a constant factor C is used then

 d 
X  A C
 n 
 

For a grouped data

  fd  X A
X  A C Where U 
f  C
 

Example3.3
The exact pension allowance paid (in Naira) to 25 workers of a company is given in the table
below.

29
Table 3.3

Pension In No. of
N Person (f)
25 7
30 5
35 6
40 4
45 3
Calculate the mean using an assumed mean 35 and 5 as the common factor.

Solution:
Table 3.4

Pension In No. of X A fU
U
N Person (f) C
25 7 -2 - 14
30 5 -1 -5
35 6 0 0
40 4 1 4
45 3 2 6
25 -9

Let A = 35, C = 5

9
X  35   5 = 33.20
 25 

Example 3.4
Consider the data in example 2.3, using a suitable assumed mean and constant factor, compute
the mean.

30
Table 3.5

Group f X X–A XA fU


C
(U)
1 – 10 5 5 - 10 -1 -5
11 – 20 20 15 0 0 0
21 – 30 9 25 10 1 9
31 – 40 8 35 20 2 16
41 – 50 3 45 30 3 9
51 – 60 1 55 40 4 4
61 - 70 4 65 50 5 20
Total 50 53
A = 15
C = 10

53
X  15  ( )10
50
= 15 + 10.6
= 25.6

Note:
It is always easier to select the class mark with the largest frequency as the assumed mean.

Merits
The mean is an average that considers all the observations in the data set. It is simple and easy to
compute and it is the most widely used average.

Demerits
Its value is greatly affected by the extremely too large or too small observation.

31
3.3 Median
The median is an average of position. It is the value of the variable that divides a distribution
into two equal parts when the values are arranged in order of magnitude.
To compute the median of a distribution:

i. Arrange all observations in order of size, from smallest to largest.


~
ii. If n (number of observation is odd, the median X is the center of observation in the
ordered list. The location of the median is

~ (n  1) th
X  Item.
2
~
iii. If n is even, the median X is the average of the two middle observations’ is the ordered
list.

X  n 1 th  X  n 
   1 
~  2  2 
i.e. X 
2
Example 3.5 (n is even)
The values of a random variable X are given as 11, 10, 13, 9, 13, 14, 16, and 20. Find the
median.

Solution
In an Array: 9, 10, 11, 13, 13, 14, 16, and 20. Since n is even.

Xn  Xn 
   1 
~ 2 2 
Median = X 
2
X4  X5
=
2

13  13
i.e. =  13
2

32
Example 3.6 (n is odd)
The values of a random variable X are given as 9, 7, 5, 20, 2, 12 and 1. Find the median.

In an array: 1, 2, 5, 7 , 9, 12, 20
n is odd , therefore
~
The median X  X 7 1 th
( )
2

 X 4th

= 7

Note: The occurrence of 7 in the above example is just a coincidence it could have been any
other value in the middle of the data set.

3.2.1 Calculation of Median From a grouped data


The formula for calculating the median from grouped data is defined as

N 
  Cfb 
~
X  Lm   2 w
 m 
 
 
where Lm = Lower limit of the median class
fm = Frequency of median class

N = f is the total frequency

Cfb = Cumulative frequency before the median class


w = Class width.

33
Example 3.7
The table below shows the length of 100 rods (in inches) produced in a factory
Table 3.6
Length Number of rods
(inches) (f)

1–2 1
3–4 8
5–6 26
7–8 38
9 – 10 19
11 – 12 7
13 – 14 1

Calculate the median

Solution
The first thing to do is to obtain the cumulative frequency distribution as follow

Table 3.7

Class f Cumulative
Frequency
(cf)
1–2 1 1
3–4 8 9
5-6 26 35
7–8 38 73
9 – 10 19 92
11 -12 7 99
13 -14 1 100

34
N 100
i. determine   50 , clearly the median value belong to the class
2 2
(7 – 8).

ii. The lower class boundary (Lm) of the median class is 6.5.

iii. frequency of the median class (fm) is 38

iv. the cumulative frequency before the median class (cf 6) is 35

v. the class interval (w) is 2 and the median is obtained as

~  50  35 
X  6.5   2
 38 

= 6.5 + 0.789
= 7.289
~
 7.29 (2 dp)

Example 3.8
The following data represent the weight of products manufactured in a factory (in kg.
Table 3.8

Weight Number of
Products
45 – 54 1
55 – 64 3
65 – 74 5
75 – 84 18
85 – 94 33
95 – 104 25
105 – 114 21
115 – 124 12
125 – 134 5
135 - 144 2

35
Calculate the median.

Solution
First obtain the cumulative frequency distribution as in Example 3.7.
The following can be obtained from the above table as in Example 3.7.

N 125
  62.5 ; cfb = 60
2 2
cfb = 94.5, fm = 25, w = 10 (i.e. 104.5 – 94.5)

~  62.5  60 
 X  94.5   10
 25 

= 94.5 + 1
= 95.5

Merit

1. It is easy to calculate

2. It is easy to understand by many people.

3. Its value is not affected by extreme values; thus it is a resistant measure of central
tendency.

4. It is a good measure of location in a skewed distribution.

Demerit

1. It does not take into consideration all the values of the variable.

3.4 Mode
The Mode is the value of the variable that occurs most often in a set of data. It is the most
unstable measure of location. It is not a unique measure of location as in the arithmetic mean. In
some cases it may not exist. Sometimes when it exists it is more than one (e.g. bimodal
distribution).

36
Let us see how the mode can be obtained from discrete data.

Example 3.9
Consider the data in example 3.5 the modal value is 13. Since it is the only value that occurred
twice.
Example 3.10
Consider the data in example 3.6.
The mode does not exist.

Example 3.11

From Example 2 the mode is X̂ = 2 i.e. the value with the highest frequency.

3.4.1 Calculation of Mode from Grouped Data


The mode of a grouped distribution can be obtained either

i. from the frequency curve by finding the value at the highest point or

ii. By calculation using the following formula.


From a grouped data the mode is defined as

 1 
Xˆ  Lm   W
 1   2 
Where Lm = lower limit of the modal class.

1 = difference between the frequency of the modal class and the class before it.

 2 = difference between the frequency of the modal class and that above it.

w = is the class width.

37
Example 3.12
From the data in Example 3.7
Calculate mode:
i. the modal class is the one with the highest frequency. i.e. (7 – 8).
ii. Lm = 6.5

1 = 38 – 26 = 12

2 = 38 – 19 = 19

w = 2

 12 
 X̂  65   2
 12  19 

= 6.5 + 0.774
= 7.27

Example 3.13
Also consider the data in Example 3.8 the mode is obtained as

 15 
X̂  84.5   10
 15  8 

= 84.5 + 6.52
= 91.02

Merit
1. The mode is easily understood by many people.
2. It is easy to calculate.

38
Demerit
1. It is not a unique measure of location.
2. It presents a misleading picture of the distribution.
3. It does not take into account all the available data

4. It is the most ideal measure of location when the distribution is highly skewed. e.g.
distribution of wages of workers in a factory.

3.5 Partition Values


We have seen in section (3.2) that the median is an average that divides a distribution into two
equal parts. So also there is other quantity that divides a set of data (in an array) into different
equal parts. Such data must have been arranged in order of magnitude. Some of the partition
values are: the quartile, deciles and percentiles.
Quartiles divide a set of data in an array into four equal parts.

For simple Series


th
N
First quartiles: Q1 = X  item
4
th
N
Q2 = X = median = X   item for simple series
2
th
 3N 
Third quartiles: Q3 = X  item
 4 

For grouped data


i. First Quartile

 N (i ) 
  Cfb1 
Q1 = lq1   4  w for grouped data
 fq1 
 
 
Where lq1 = Lower limit of quartile 1
fq1 = Frequency of the q1 class

39
w1 = Width of q1 class

 fq 1 = Cf below the q1 class

ii. Third Quartile

 3N 
 4  Cfb 
Q3 = lq3   w
 fq3 
 
Where lq3 = Lower limit of quartile three class
fq3 = Frequency of the q3 class
w3 = Width of q3 class

 fq 3 = Cf below the q3 class

3.6 Other Measures of Central Tendency


Other measures of central tendency include the Midrange, Harmonic mean and Geometric mean.

40
Midrange
The half way between the smallest and the largest observation in a set of data is called the
midrange or range midpoint. It is obtained by adding the smallest and the largest together and
dividing the result by 2.

Example 3.14
Find the midrange of the following data: 1, 5, 7, 15, 12, 9, 7,
Solution
Smallest observation 1
Largest observation 15

15  1
Midrange = 7
2

Example 3.15
Find the midrange of the following data representing the number of children in 12 households in
Agbowo area of Ibadan.
4, 2, 1, 0, 2, 6, 2, 3, 5, 1,

Solution

60
Midrange = 3
2

Usefulness
Information on midrange of temperature reading by Meteorologists is used by visitors in the
tourism industry.

Limitations
It takes into account only the extreme observation.

41
Geometric Mean
Given observation X1, X2, ---, Xn, of a random variable X the geometric mean denoted by GM
define as the nth root of the product of n observation in a set. i.e

GM = n X 1 , X 2 ,  , X n

Example 3.16
Find the geometric mean of the data in Example 3.14.

Solution
7
GM = 1.5.7.15.12.9.7
7
= 396900

= 6.31

Example 3.17
Obtain the geometric mean of the data in Example 3.14

Solution
10
GM = 4.1.0..........5.1

= 0 (since zero is one of the observation)

Usefulness
Geometric mean is very useful in the computation of rates and indices e.g. Computation of price
indices, etc.

Limitation

1. It cannot be calculated when the value zero is one of the observation to be used.

2. It is a readily used measure of location.

42
Harmonic Mean
Given the observation x1, x2, ----, xn of a random variable X, the harmonic mean denoted by
HM is defined as the reciprocal of the mean of the reciprocal of the observations i.e.

Example 3.18
Find the harmonic mean of the data in Example 3.14.

Solution

1
HM =
1 1 1 1 1 1 1 1
       
7 1 5 7 15 12 9 7 

= 4.02

Example 3.19
Find the harmonic mean of the data in Example 3.14.

Solution

1
HM =
1 1 1 1 1
    
10  4 1 0 1

= 0 (since 0 is one of the observation)

Note: HM < GM < AM


Usefulness
Harmonic mean is used in the calculation of rates e.g. average speed.

43
Limitations

1. It is hardly used in practice.


2. It cannot be calculated when zero is one of the observation in the set.

3.6.1 Other Partition Values from Grouped Data


The other partition values that can be calculated from grouped data are the Deciles and the
percentiles.

Deciles are those values that divide a distribution to five equal parts. They are denoted by Di i
= 1, 2, ---, 9 D1, D2, D3 …. D9.
For the grouped data deciles two (D2 ) is defined as

 2N 
 5  cfbD2 
D  L D2  w
 fD2 
 

where
LD2 = Lower limit of decile two class
fD2 = Frequency of the decile two class
w2 = Width of decile 1 class

f D1 = Cumulative frequency below the decile two class

Percentiles are those values that divide a distribution into one hundred equal parts. They are
denoted by P1, P2, P3, ….., P99. For a grouped distribution the 65th percentile is defined as

 65 N 
 100  cfbP65 
P65  L p65  w
 f P65 
 

44
Lp65 = Lower limit of 65th percentile class
Fp65 = Frequency of the 65th percentile class
w1 = Width of 65th percentile class

f p 65 = Cumulative frequency below the 65th percentile class

Example 3.20
Consider the data in Example 3.9
Calculate the i. first quartile (q1)
ii. third quartile (q3)
iii. 4th Decile (D4)
iv. 45th Percentile (P45)

Solution
From the table in Example 3.9
Table 3.9

Class f cf

45 – 54 1 1
55 – 64 3 4
65 – 74 5 9
75 – 84 18 27
85 – 94 33 60
95 – 104 25 85
105 – 114 21 106
115 – 124 12 118
125 – 134 5 123
135 - 144 2 125

45
N 
  cfbq1 
i. q1  Lq1   4 w
 f q1 
 
 

 31.25  27 
= 84.5   10
 33 

= 84.5 + 1.29
= 85.79

 3N 
  cfbq3 
ii. q3  Lq3   4 w
 f q3 
 
 

 93.75  85 
= 104.5   10
 21 

= 104.5 + 4.17
= 108.67

 4N 
  cfbD4 
iii. D4  LD4  5 w
 f D4 
 
 

 100  85 
= 104.5   10
 21 

= 104.5 + 7.14
= 111.64

 45 N 
  cfbP45 
iv. P45  LP145   100 w
 f P145 
 
 

46
 56.25  27 
= 84.5   10
 33 

= 84.5 + 8.86
= 93.36

Summary
In Study Session 3, you have learnt that:

1. Measures of central tendency are single values that give a description of the data.
2. The arithmetic mean is the average of a set of observation is the sum of the observation
divided by the number of observation
3. The mean is an average that considers all the observations in the data set
4. The median is an average of position.
5. Median is a good measure of location in a skewed distribution.
6. The Mode is the value of the variable that occurs most often in a set of data
7. The mode is not a unique measure of location
8. The partition values are: the quartile, deciles and percentiles.

Self-Assessment Question (SAQs) for Study Session 3


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.

SAQ 3.1 (Tests Learning Outcomes 3.1)


List four measures of central tendency

SAQ 3.2 (Tests Learning Outcomes 3.2)

1. What is an arithmetic mean?


2. What is the formula for the Calculation of Mean from Grouped Data?

47
SAQ 3.3 (Tests Learning Outcomes 3.3)
What is the formula for the Calculation of Median From a grouped data?

SAQ 3.4 (Tests Learning Outcomes 3.4)

1. Define Mode
2. Give three demerit of mode

SAQ 3.5 (Tests Learning Outcomes 3.5)


Name the partition values

SAQ 3.6 (Tests Learning Outcomes 3.6)


Mention the usefulness of the midrange and Geometric mean

Notes on SAQ
SAQ 3.1

Arithmetic mean, mode, median, geometric mean

SAQ 3.2

1. The arithmetic mean of a set of observation is the sum of the observation divided by the
number of observation

K K
fi X i
2. X  ; N   fi
i 1
N i 1

SAQ 3.3

N 
~   Cfb 
X  Lm   2 w
 m 
 
 

48
SAQ 3.4

1. The Mode is the value of the variable that occurs most often in a set of data. It is the most
unstable measure of location

2.

i. It is not a unique measure of location,


ii. It presents a misleading picture of the distribution
iii. It does not take into account all the available data

SAQ 3.5

The quartile, deciles and percentiles

SAQ 3.6

Information on midrange of temperature reading by Meteorologists is used by visitors in the


tourism industry.

Geometric mean is very useful in the computation of rates and indices e.g. Computation of price
indices

49
References
Adamu S.O and Johnson Tinuke L (1998): Statistics for Beginners: Book 1. SAAL Publications.
Ibadan. ISBN: 978-34411-3-2
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B. Paperback.

Clarke G.M. and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published
by Arnold & Stoughton

Connor, L. R and Morrell, (1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,
London: Pitman Books Limited.
Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi: Vikas Publishing
House PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New York: W.H. Freeman and Coy.

Olubosoye O.E, Olaomi J.O and Shittu O.I (2002): Statistics for Engineering, Physical and
Biological Sciences”. Ibadan: A Divine Touch Publications.

50
Study Session 4 Measures of Dispersion/Variation

Introduction
Dispersion/Variation is degree of scatter or variation of individual values of a variable about the
central value such as the median or the mean.

In this Study Session we shall discuss the range, semi-inter-quartile range, mean deviation from
the mean, median variance and standard deviation

Learning Outcomes for Study Session 4


When you have studied this session, you should be able to:

4.1 Explain the meaning of variation and is measures


4.2 Explain the Range
4.3 Explain Mean deviation and its Calculation
4.4 Explain The variance and its calculation
4.5 Explain Standard Deviation
4.6 Explain the use of coding method when dealing with large values of a variable

4.1 Variation and its Measures


Weight, like so many other things, is not static or unchanging. Not everyone who is 5 feet tall is
100 pounds, there is some variability. When reporting these numbers or reviewing them for a
project, a researcher needs to understand how much difference there is in the scores. This is
where we will look at measures of variability.

51
Box 4.1: Definition of Variation

Variation can be defined as a way to show how data is dispersed, or spread out.

Several measures of variation are used in statistics which will be discussed at the course of this
study session.

4.2 The Range


This is the simplest measure of variation. It is the difference between the largest and the smallest
value in a set of data.
Range = X (max) = X (min.)

The range is thus a measure which is very easy to determine and use. The range is efficient when
n > 10, otherwise it is not good as it ignores all the values in between. It is commonly used in
statistical quality control.
However, the range may fail to discriminate if the distributions are of different types.

Semi-Interquartile Range: is half the difference between the first and third quartiles. It is good
measure of spread for midrange and the quartiles.

Q3  Q1
S .I .R 
2

4.3 The Mean Absolute Deviation


Mean deviation is the mean absolute deviation from the center. A measure of the center could be
the arithmetic mean or median. It can be shown that the mean deviation of a distribution is least
when the deviations are from the median. Given a set of X1, X2, ….., XN the mean deviation
from the arithmetic mean is defined by:

X i X
MD  i 1
for simple series
N

52
In a grouped data
N

f Xi  X
MD X  i 1
N

f
i 1

Example 4.1
Below is the average of 10 Heads of household randomly selected from a community
54, 59, 35, 41, 46, 25, 47, 60, 54, 46

Find the (i) Range (ii) Mean (iii) Mean deviation from the mean (iv) Mean deviation
from the median.
Solution
i. Range = 60 – 25 = 35

ii. Mean = X 
X 
54  59  ....  46
n 10
= 46.7

iii. Mean Deviation MD X =


X X
n

54  46.7  59  46.7  ....  46  46.7


=
10
7.3 + 12.3 + 11.7 + 5.7 + 0.7 + 21.7 + 0.3 + 13.3 + 7.3 + 0.7

81
= = 8.10
10
Array: 25, 35, 41, 46, 46, 47, 54, 54, 59, 60

n n 
X    X   1
iv Median =  2  2   46.5
2

53
54  46.5  59  46.5       46  46.5
MDXˆ  
10

7.5  12.5  11.5  5.5  0.5  21.5  0.5  13.5  7.5  0.5
=
10

81
=
10
= 8.1

Example 4.2
The table below shows the frequency distribution of the scores of 42 students in STA 111 test.

Table 4.1

No. of
Scores Students
(f)

0 – 10 2
10 – 20 5
20 – 30 8
30 – 40 12
40 – 50 9
50 – 60 5
60 – 70 1

Find the mean deviation from the mean for the data.

54
Solution

Table 4.2

Classes X F fX XX XX f XX

0 – 10 5 2 10 - 29.52 29.52 59.04


10 – 20 15 5 75 - 19.52 19.52 97.6
20 – 30 25 8 200 - 9.52 9.52 76.16
30 – 40 35 12 420 0.48 0.48 5.76
40 – 50 45 9 405 10.48 10.48 94.32
50 – 60 35 5 275 20.48 20.48 102.4
60 – 70 65 1 65 36.48 30.41 30.48

X
 fx  1450  34.52
 f 42

Mean Deviation X  =
f XX
f
365.76
=
42
= 11.089

55
4.4 The Variance
The variance of a set of observations is the average of the squared deviation from the mean.

Let x1, x2, x3, ----, xn be a random sample from a population The sample variance S2, is
defined as:

S2 
1 n
 X i  X 2
n i 1

where X 
X i

n
for discrete data or simple series

For grouped data, sample variance is defined as:

 f X  X 
2
i i
S 2

f i

Another formula for calculating variance can be derived from the above as follow

S2 
1 n
 X i  X 2
n i 1

nS 2   X i  X 
n
2

i 1

 
N
ns 2   X i2  2 XX i  X 2
i 1

= X i
2
 2X  Xi   X 2

= X i
2
 nX 2

1
Therefore S2 
n
 X i2  X 2

However, for grouped data


56
S 2

fX1 i
2

f i

4.5 Standard Deviation


The standard deviation is the square root of the variance. It is sometimes referred to as the root
mean squared deviation from the mean (RMSD).
It should be noted that the variance is measured in units of X2 rather than X. This makes it
difficult to understand the size of the variance. A measure of variability that is closely related to
variance but expressed in the same unit of observation is called Standard Deviation.

In-Text Question
Standard deviation could be defined as?

a. The cube root of the variance


b. The square root of the variance
c. Both the square and cube root of the variance
d. Fourth root of the variance

In-Text Answer
b.) The square root of the variance

Standard deviation is the positive square root of the variance. It is defined as

 X  X 
2

S i 1
or
N

S
X i
2

X2
n

Example 4.2

57
Consider the data in example 4.1, calculate the standard deviation and coefficient of variation.

Solution:

 X  X 
2

i Standard Deviation S =
n

(54  46.7) 2  ....  (46  46.7) 2


=
10
= 10.87

S
ii. Coefficient of Variation C.V = x 100
X

10.37
= x 100
46.7
= 22.21

Comparison of Dispersion: Comparison of two distributions with different means and unit of
measurement is done using the coefficient of variation.

Definition: Coefficient of Variation (C.V) is a dimensionless quantity that measures the


relative variation between two series observed in different units.
It is defined as the ratio of the standard deviation and the mean of a set of data expressed as a
percentage.

S
i.e. C.V  x 100
X
The distribution with smaller C.V is said to be better

58
4.6 Coding Method
This is the method used when larger values of the variable are involved in calculation.

This is achieved by choosing one of the values (or class mark) as the assumed mean (A) and
determine the common factor (C). The values of the variable Xi (or class mark) are transformed
using the code:

XA
Ui 
C
Thus the formula for calculating the variance becomes

1 
 f1U i 
 fU  
2

C
S 2
 2

 f   f 

Example 4.3

Given the following grouped data. Compute the (i) Mean and (ii) Standard deviation. And
(iii) coefficient variation using an assumed men of 77 and 5 as a common factor

Table 4.3

Class f
50 – 54 1
55 – 59 2
60 – 64 10
65 – 69 12
70 – 74 18
75 – 79 25
80 – 84 9
85 – 89 6
90 – 94 4
95 – 99 3
Total 80

59
Solution

Table 4.5

Classes f Class X–A Ui fU U i2 fU 2


Mark (X) XA
C
50 – 54 1 52 -25 -5 -5 25 25
55 – 59 2 57 -20 -4 -8 16 32
60 – 64 10 62 -15 -3 -30 9 90
65 – 69 12 67 -10 -2 -24 4 48
70 – 74 18 72 -5 -1 -18 1 18
75 – 79 25 77 0 0 0 0 0
80 – 84 9 82 5 1 9 1 9
85 – 89 6 87 10 2 16 4 24
90 – 94 4 92 15 3 12 9 36
95 – 99 3 97 20 4 12 16 48

90 -36 330

A = 77 C = 5

  fU 
X  A C
 f 
 

  36 
= 77   5
 90 

= 77 – 2
= 75

1 
 f1U i 
 fU  
2

C
S 2
 2

 f   f 

60
1   36  2

= 330  5
90  90 

= 3.55

S  S2

= 1.88

S
Coefficient of variation: CV = x 100
X

1.88
= x 100
75
= 2.51

Summary
In Study Session 4, you have learnt that:

1. Variation is a way to show how data is dispersed, or spread out.


2. The range is the simplest measure of variation
3. The range is the difference between the largest and the smallest value in a set of data
4. Mean deviation is the mean absolute deviation from the center
5. The variance of a set of observations is the average of the squared deviation from the
mean
6. The standard deviation is the square root of the variance.
7. The Standard Deviation is also referred to as the root mean squared deviation from the
mean
8. The Coding Method is used when larger values of the variable are involved in calculation

61
Self-Assessment Question (SAQs) for Study Session 4
Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.

SAQ 4.1 (Tests Learning Outcomes 4.1)


Define variation

SAQ 4.2 (Tests Learning Outcomes 4.2)


What is a range?

SAQ 4.3 (Tests Learning Outcomes 4.3)


What is the formula for calculating a mean deviation in a group data?

SAQ 4.4 (Tests Learning Outcomes 4.4)


What is a variance?

SAQ 4.5 (Tests Learning Outcomes 4.5)


The standard deviation is sometimes referred to as?

a. The root mean squared deviation from the mean (RMSD)


b. The root mean square
c. The cube root mean square of the deviation (CMSD)
d. Standard means of measurement

SAQ 4.6 (Tests Learning Outcomes 4.6)


When is the coding method used in calculations?

62
Notes on SAQ
SAQ 4.1
Variation can be defined as a way to show how data is dispersed, or spread out.

SAQ 4.2
This is the simplest measure of variation. It is the difference between the largest and the smallest
value in a set of data.

SAQ 4.3

f Xi  X
MD X  i 1
N

f
i 1

SAQ 4.4

The variance of a set of observations is the average of the squared deviation from the mean

SAQ 4.5

a.) The root mean squared deviation from the mean (RMSD)

SAQ 4.6

This method is used when larger values of the variable are involved in calculation.

63
References
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications, ISBN: 978-34411-3-2
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,
London: Pitman Books Limited,
Gupta, C. B. (1973) “An Introduction to Statistical Methods” Vikas New Delhi: Publishing
House PVT Ltd...
Moore D.S and Mc cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New York: W.H. Freeman and coy..

Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering, Physical and
Biological Sciences”. Ibadan: A Divine Touch Publications,. ISBN: 978-35606-7-0

64
Study Session 5 Algebraic Treatment of Mean and Variance

Introduction
It is advisable to adjust the values of the mean and variance to check for mistakes, it may also be
desired to combine these statistics without recourse to the individual observation of the variable.
The various methods of doing this will be discussed in this study session.

Learning Outcomes for Study Session 5


When you have studied this session, you should be able to:

5.1 Calculate the pooled mean of two or more variables


5.2 Adjust the values of mean, variances and standard deviation for mistakes

5.1 Pooled Mean and Variance


You have learnt how to compute the mean and variance from univariate data. Sometimes, we
may have information about the mean and variance of two or more variates and you desire to
find the combined mean and variance. This can be achieved without using the individual values
of the variables.

Given two sets of data consisting of n1 and n2 items and X 1 and X 2 and their variance S12

and S 22 respectively with the some mean, then the combined mean is defined by

n1 X 1  n2 X 2
X 12 
n1  n2

and the combined variance is

(n1  1) S12  (n2  1) S 22


S 
2

n1  n2  2

65
Suppose we have
ni, i = 1, 2, ----, k

X
for i = 1, 2, 3, …, K,
ni number of observation in variable i.

X mean of variable i.

S i2 variance of variable i.

Then, the pooled (combined mean) is defined

n1 X 1  n2 X 2  ...n k X k
X 12,   , k 
n1  n2  ...  nk
k

n X i i
 i 1
k

n i 1
i

The pooled (combined variance) variance is given by

 2 
(n  1) S12  (n2  1) S 22      (nk  1) S k2  n1 Xˆ 1  X 12  k
 1

2

     nk Xˆ k  X 12  k 
2

12,    k
n1  n2      nk  k

Example 5.1
The Mean and Standard Deviation of two variables of 100 and 150 items are 50, 5 40, and 6
respectively. Find the Standard Deviation of all the 250 items taken together.
Solution

n1 X 1  n2 X 2 100(50)  (150 x 40)


X 12  
n1  n2 250
= 44
99(5 2 )  149(6 2 )  100(44  50) 2  150(44  40) 2
 2
12 
248
= 55.0
 12  55.6
= 7.46

66
Example 5.2
A survey was conducted at three locations in a community to study a single variable. At each
location, the sample size (ni), the mean X i and standard deviation  i  were given the
following table.

Table 5.1

Location I II III
ni 200 250 300

Xi 95 10 15

i 3 4 5

Obtain the combined mean and standard deviation for the variable in all the three locations
Solution

ni X i  n2 X 2  n3 X 3
Hence X i 23 
ni  n2  n3

200(95)  250(10)  300(15) 26000


X i 23  
200  250  300 750
= 34.7 or 35

n1 ( X  X 2123  n2 X 2  X 123   n3  X 3  X 123   ( n1  1) 12i  (n 2  1) 22  (n3 !) 32


2 2

 i223 
ni  n2  n3  3

20095  34.7   25010  34.7   30015  34.7   199(9)  249(16)  299(25)


2 2 2
 123 
748

727218  152522.5  116427  13300


 123 
747

 123  1349.55
2

 123  1349.55

= 36.74

67
5.2 Adjusting Values of Mean and Standard Deviations for Mistakes
Sometimes mistakes occur in the computation of mean and variance of a set of data when a
correct value in the original data is replaced by an incorrect one. Instead of going through the
entire process to correct such mistakes, some simple algebraic adjustment can be made as shown
in the following examples.

Example 5.3
The mean and standard deviation of a set of 100 observations were worked out as 40 and 5
respectively by a student who by mistake took the value 50 in place of 40 for one observation.
Recalculate the correct mean and standard deviation.

Solution

n = 100; X = 40; 2 = 25

X 
X
n

40 =
X
100

Incorrect: ∑X = 4000
Correct: ∑X = 4000 – 50 + 40 = 3990

3990
Corrected mean X = = 39.90
100

2
=
X 2

X2
n

25 =
X 2

 40 2
100
2500 = ∑X2 — 160,000

 ∑X2 = 162,500

68
Correct ∑X2 = 162,500 – 502 x 402
= 161,600

161600
Correct 2 =  (39.90) 2
100

2399
=
100
= 23.99

 = 23.99

= 4.89

Self-Assessment Question (SAQs) for Study Session 5


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.

SAQ (Tests Learning Outcomes)

1. Find the mean median and mode of the following observation:


5, 6, 10, 15, 22, 16, 6, 10, 6
2. The six numbers 4, 9, 8, 7, 4 and X, have mean of 7. Find the value of X and hence
calculate the coefficient of variation for the six numbers.
3. The arithmetic mean of five observations is 44 and the variance is 8.24. If 3 of the 5
observation are 1, 2 and 6. Find the other two.
4. The mean and standard deviation of 120 items were found by a student to be 60, and 5
respectively. If at the time of calculation, two items were wrongly recorded as 45 and 55,
instead of 54 and 70. Find the correct mean and standard deviation.

69
References
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton

Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.

Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi:Vikas Publishing House
PVT Ltd.
Moore D.S and McCabe G.P (1993): Introduction to the Practice of Statistics, second Edition.
New York: W.H. Freeman and coy

70
Study session 6: Measure of Skewness and Kurtosis

Introduction
A fundamental task in many statistical analyses is to characterize the location and variability of a
data set. A further characterization of the data includes skewness and kurtosis.

In this Study session, you will learn the definition of skewness and kurtosis, you will also learn
how to calculate measure of skewness and kurtosis from simple series and grouped data.

Learning outcomes for study session 6


At the end of this study session you should be able to:

6.1 Define skewness and kurtosis;


6.2 Calculate measure of skewness and kurtosis from simple series and grouped data;
6.3 Determine whether a set of data; is normally distributed, the direction of skewness and the
level of peakedness and Interpret your result.

6.1 Define skewness and kurtosis


 Skewness is a measure of a symetry, or more precisely, the lack of symmetry. A
distribution, or data set, is symmetric if it looks the same to the left and right of the center
point. For univarite data X1, X2, -----, XN, the formula for skewness is
For discrete data,

 X X
N 3
i 1
Skewness:  3 
i

( N  1) s 3

 f X  X
3

3
i
For grouped data 
( N  1) s 3

Where X is the mean, S is the standard deviation, and N is the number of data points.

71
The skewness for a normal distribution is zero, and any symmetric data should have a skewness
near zero.
Negative values for the skewness indicate data that are skewed left and.

Positive values for the skewness indicate data that are skewed right. By skeweness to the left,
we mean that the left tail is long relative to the right tail. Similarly, skeweness to the right means
that the right tail is long relative to the left tail.

 Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution
That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather
rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the
mean rather than a sharp peak. A uniform distribution would be the extreme case. Kurtosis is
the standardized 4m central moment of a distribution.

The histogram is an effective graphical technique for showing both the skewness and kurtosis of
data set.
For univariate data X1, X2, -----, XN, the formula for kurtosis is:

 X  X
N 4
i 1
For discrete data Kurtosis:  4 
i

( N  1) s 4

 f X X
4

4
i
For grouped data 
( N  1) s 4

where X is the mean, s is the standard deviation, and N is the number of data points.

In-text Question
_____________ is a measure of symmetry, or more precisely, the lack of symmetry?

a) Skewness
b) Kurtosis
c) Grouped data
d) Sample series

72
In-text Answer
a) Skewness

6.2 Calculating measure of skewness and kurtosis from simple series and
grouped data

 Excess Kurtosis: The Kurtosis for a standard normal distribution is three. For this reason,
excess kurtosis is defined as

 X X
N 4
i 1 i
For discrete data: Excess Kurtosis: K 4  3
( N  1) s 4

 f X X
4
i
For grouped data: or K 4  3
( N  1) s 4

The standard normal distribution has excess kurtosis of zero. Positive kurtosis indicates a
“peaked” distribution and negative kurtosis indicates a “flat” distribution.

The peakedness of a distribution can be shown as in the diagram below:

In-text Question
A distribution, or data set, is symmetric if it looks the same to the left and right of the center
point. True\ False

a) False
b) True

73
c) None of the above
d) All of the above

In-text Answer
a) True

Diagram 6.1 - Peakedness of a Distribution

A
B

A -------------- Leptokurtic
B -------------- Mesokurtic - Normal
C -------------- Platykurtic

Example 6.1
Twelve numbers were generated from computer are as follows:
10, 43, 67, 89, 70, 80, 62, 80, 03, 42, 71, 35
a. Obtain the measures of skewness and kurtosis.
b. Interpret your result.

74
Solution

Table 6.1

X X i  X X i  X
2
X i  X
3
X i  X
4

03 -51.3 2631.69 -135005.697 6925792.26


10 -44.3 1962.49 -86938.307 3851367.00
35 -19.3 372.49 -7189.057 138748.80
42 -12.3 151.29 -1860.867 22888.66
43 -11.3 127.69 -1442.897 16304.74
52 7.7 59.29 456.533 3515.30
67 12.7 161.29 2048.383 26014.46
70 15.7 246.49 3869.893 60757.32
21 16.7 278.89 4657.463 77779.63
30 25.7 660.49 16974.593 436247.04
30 25.7 660.49 16974.593 436247.04
39 34.7 1204.09 41781.923 1449832.73
652 3516.68 -145673.444 13445494.99

X 
X 
652
n 12
X  54.3
8516.68
S
11
= 27.825

Skewness: 3 
 x i  x) 3 
( N  1) S 3

 145673.444
=
11 x (27.825

 145673.444
=
236972.6385
= -0.6147
That is negatively skewed distribution.

75
N

 x  x
4
i
Kurtosis:  3  i 1

( N  1)S 4

13445494 .99
=
11 x (27.825) 4

13445494.89
=
6593763.668
= 2.039

Excess Kurtosis K  4  3

= 2.039-3
= - 0.961
i.e. platykurtic.

In-test Question
The Kurtosis for a standard normal distribution is three. For this reason, excess kurtosis is
defined as ____________ ?
N

 x  x
4
i
a) 3  i 1

( N  1)S 4

b) 3 
 x i  x) 3 
( N  1) S 3

X 
X 
652
c) n 12
X  54.3

 X X
N 4
i 1 i
d) K4  3
( N  1) s 4

76
In-text Answer

 X X
N 4
i 1 i
d) K 4  3
( N  1) s 4

6.3 Determining whether a set of data; is normally distributed, the


direction of skewness and the level of peakedness
Example 6.2
Given the data below:

Table 6.2
Class f

10-14 1
15-19 4
20-24 8
25-29 19
30-34 35
35-39 20
40-44 7
45-49 5
50-54 1

a. Draw the histogram for the above data.


b. Obtain the measure of i. Skewness
ii. Kurtosis
c. Interpret your result.

77
Solution

Diagram 6.2
40

30

20

10 0 9.5 14.5 19.5 24.5 28.5 34.5 39.5 44.5 49.5 54.5

Table 6.3

Mid-Point F fx i Xi  X X i  X
2
X i  X
3
X i  X
4

Class Xi

10-14 12 1 12 -20.1 404.01 8120.6 163224.04

15-19 17 4 68 -15.1 228.01 3442.95 51988.56

20-24 22 8 176 -10.1 102.3 1030.3 10406.04

25-29 27 19 513 -5.1 26.01 132.55 676.52

78
Table 6.3

Mid-Point F fx i Xi  X X i  X
2
X i  X
3
X i  X
4

Class Xi
10-14 12 1 12 -20.1 404.01 8120.6 163224.04

15-19 17 4 68 -15.1 228.01 3442.95 51988.56

20-24 22 8 176 -10.1 102.3 1030.3 10406.04

25-29 27 19 513 -5.1 26.01 132.55 676.52

30-34 32 35 1120 -0.1 0.01 0.001 0.0001

35-39 37 20 740 4.9 24.01 117.65 576.48

40-44 42 7 294 9.9 48.01 970.299 4605.96

45-49 47 5 235 14.9 222.01 3307.95 49288.44

50-54 52 1 52 19.90 396.01 7880.599 156823.92

100 1500.09 25002.999 442589.96

 fx i = 3210

f = 100

X
 fx i
=
3210
= 32.1
f 100

79
Table 6.4

f X i  X  f X i  X  f X i  X 
2 3 4

404.01 -8120.6 163224.08


912.04 -130771.8 207954.24
816.08 -8242.4 83248.32
494.19 -2520.35 12853.88
0.35 -0.035 0.0035
480.02 2353.00 11529.6
686.07 6792.093 67241.72
110.05 16539.75 246442.2
396.01 7880.599 156823.92
5299.00 910.257 949317.96

 f X  X
2
5299
S i
= = 53.52
N 1 99

S = 7.316

Skewness: 
 f x i  x) 3 
( N  1) S 3

910.257
=
99 x 391.58
= 0.0236

 x  x
4
i
Kurtosis: 4 
( N  1)S 4

949317.96
=
99 x 2864.8

949317.96
=
283615.5
= 3.3

Excess Kurtosis K  4  3

= 0.3

80
Since Skewness = 0.0236; Kurtosis = 3.3; and Excess Kurtosis = 0.3.

This implies that the distribution is near normal. The Kurtosis indicates a flat peak i.e.
leptokurtic

Summary for study session 6


In this study session, you have learnt:

1. The concept of Skewness and Kurtosis.


2. How to distinguish between Kurtosis and excess kurtosis and their interpretations.
3. Useful examples were given to illustrate the different formulae for their computation.

Self-Assessment Questions (SAQs) for Study Session 6


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.

SAQ 6.1-6.2
1. Consider the data in post test question 3 in chapter 4, obtain the measure of skewness and
kurtosis.
2. Consider the data in post test question 1 in chapter 4, obtain a measure of Excess
Kurtosis and interpret your results.
3. Consider the post test question 1 in chapter 5, calculate the measure of Skewness and
interpret your result.

81
Reference
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. , Ibadan: SAAL
PublicationsISBN: 978-34411-3-2

Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London:Arnold &
Stoughton.

Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh


Edition,London:PitmanBooksLimited.

File://C:\DOCUME~1\FACULT~\LOCALS~\Temp\triHINHP.htm

Gupta, C. B. (1973)“An Introduction to Statistical Methods” London:Vikas Publishing House


PVT Ltd. New Delhi:

82
Study Session 7: Methods of Collecting Statistical Data

Introduction
In the previous session, you learnt the various ways in which a set of data can be summarized
and calculated some descriptive statistics, examined the shape and how summaries can be
combined and corrected for errors.
In this Study session, you will learn about the various methods that can be employed in the
collection of statistical data.

Learning outcomes for study session 7


At the end of this lecture, you should be able to:

7.1 Explain the various methods of data collection


7.2 Discuss the problems of data collection in Nigeria.

7.1 The various methods of data collection


Data collection is an activity aimed at getting information to satisfy some decision objectives or
for purpose of scientific inquiry. The process of data collection varies with the nature of inquiry,
objective of the study and characteristic of the unit of inquiry.

83
Methods of Data Collection
There are five broad methods of data collection. They are:

Figure 7.1: methods of data collection

1. Documentary Sources: It is sometimes possible to answer some of the questions a survey is


intended to cover from available data.

Enquiry concerned with the leisure activities of a town population may verily begin by getting
statistical data about the use made of the local libraries, attendances at cinema, membership of
clubs and societies.

A mass of information about the popularly studied social surveys is available in historical
documents, statistical reports, records of institutions and other surveys.

Government departments possess a mass of information relating to individuals. Some of these


are census schedules, employment records, insurance cards, health records etc.
The only difficulty is that a survey researcher can hardly expect to gain access to these materials.

Some materials are collected in form of case records by psychiatrics, social workers etc. which
are of interest to the sociologist and psychologists. Such materials have limitations for the
research workers in that, it can only represent a highly specialized population i.e. only the case
that happen to came before social workers.

84
There are personal documents which can come directly from the informants such as diaries,
autobiographies and surveys. These give insight into personal character, experiences and beliefs
that formal interviewing can hardly achieve.

The possibility of any investigation bias affecting their contents is eliminated. The use of this
method has many difficulties e.g:
a. How to get the documents
b. How to get a representative collection of documents.

Some people are better in writing letters and essays than others but not everybody can produce
documents and they are at their best when unsolicited for. The method of data collection is
usually by copying out the relevant data from the records available.

2. Observation: Observation as a method of data collection is defined as accurate watching


and classic method of scientific enquiry as they occur in nature. The observer positions
himself and observes the activities of life of a community. The observer positioning
himself to observe depends on:
a. The nature and size of the community.
b. What he wishes to observe.
c. His own personality and skill.

An example where this method is suitable is in the case of traffic censuses. Actual measurement
or counting also comes under the heading of observations. Examples occur in statistical quality
control.

Problems
i. If the characteristics of the population are to be inferred from those of sample, the sample
should ideally be randomly selected.
ii. To instruct an investigator to observe people of all types, men and women of different
ages, social class etc. does not make the sample a random one. It does not ensure that the
resultant group is representative.
iii. The observer can hardly be expected to observe and note everything relevant to the
subject.

85
iv. His selection of the aspect of behaviour and entrainment which he notes may follow
certain channels.
v. If what he is studying is so familiar, he may fail to note the normal etc.

In-text Question
___________ as a method of data collection is defined as accurate watching and classic method
of scientific enquiry as they occur in nature.

a) Problems
b) Merits
c) Observation
d) Demerits

In-text Answer
c) Observation

Merits
The advantages of this method are similar to personal interview and the method has some unique
advantages such as:
i. Providing more reliable information.
ii. Supplying of additional and necessary information

Demerits
The disadvantages are also similar to personal interview.
i. It is exceptionally certified.
ii. Highly trained personnel are needed for observation.
iii. Because of scrutiny, it is time consuming.

86
3. Mail or Postal Questionnaires: This is one of the most widely used methods of data
collection mostly in social surveys. Questionnaires are mailed out to respondents who in turn are
expected to send them back through the post when they are duly completed. The choice of this
method is governed by:
a. Limited resources
b. Economic advantages
c. Potential efficiency.

In-text Question
_________ is one of the most widely used methods of data collection mostly in social surveys.

a) Dairy
b) Telephone
c) Interview
d) Mail or Postal Questionnaires

In-text Answer
a) Mail or Postal Questionnaires

Merits
i. It is generally quicker and cheaper than other methods.
ii. It avoids the problems associated with the use of interviewers.

iii. It is useful when information concerning several members of household is required and
allows for some intra-household consultation.

iv. It is useful where questions demanded is considered rather than when immediate answers
are required.

v. Questions of personal or embarrassing nature are answered more willingly and accurately
than when the respondents are together with the interviewer; who is a complete stranger
to them.
vi. The problem of non-contacts in the sense of respondent not being at home is avoided.

87
Demerits
i. The method can only be considered when the questions are sufficiently simple and
straight forward to be understood with the help of the printed instructions and definitions.
It is unsuitable where the objectives of the survey take a good deal of explanation.

ii. The answers to mail questionnaire have to be accepted as final. There is no opportunity
to probe beyond the given answers.

iii. It is inappropriate where spontaneous (unplanned) answers are wanted or where it is


important that the views of one person only are obtained or where it is essential that one
particular person in each household fills the questionnaires and no one else.

iv. The answers cannot be treated as independent since the respondent can see all the
questions before answering any of them.

v. There is no opportunity to supplement the respondent’s answers by observational data,


his house, appearance, manner etc.

Some of the disadvantages of this method can be overcome by combining it with interview
method.

4. Personal Interview: This is the method that is used mainly in most surveys. It could be a
formal interview in which set questions are asked and the answers recorded in a standard form or
a less formal one in which the interviewer is at liberty to vary the sequence of questions, to
explain their meanings, to change the wordings or where he/she may not have a set of questions
at all but only a number of key points around with which to build the interview.

The interviewer should possess some vital qualities such as (a) Honesty, (b) Interest (c)
Accuracy (d) Adaptability (e) Personality and temperament (f) Intelligence and education.

Merits

i. The interviewer is free and has more opportunity to restructure questions whenever it is
necessary to do so.

ii. It allows more accurate information to be obtained by asking the respondent for further
explanation.

88
iii. A skilled interviewer can easily persuade an unwilling respondent. This will increase the
number of responses.

iv. A skilled interviewer will know when to make call backs and then make more effective
efforts.

v. In addition to recording verbal answers, the interviewer can note the non-verbal reactions
of respondents to questions.
vi. It can be used for persons of all educational levels.
vii. It can be used to explore areas in which little information exists.

In-text Question
___________ is the method that is used mainly in most surveys?

a) Intelligence
b) Personal interview
c) Adaptability
d) All of the above

In-text Answer
b) Personal interview

Demerits

i. Personal interviews are expensive to conduct if the sample to be taken is widely scattered
geographically.

ii. Unscrupulous interviewers may be biased by influencing respondent’s answers or records


to please him.

iii. The respondent in order to boast his image to please the interviewer may give biased
answers.

iv. It may be difficult to interview some individuals such as highly income and influential
people who are not always available.

v. If recalls are necessary, and when the sample is large, it will take more time than
necessary to complete the survey.

89
vi. Respondents may give inaccurate or false information due to lapse in memory,
misunderstanding or may be deliberate.

vii. Larger field staffs are needed for interviewing.

5 Telephone: This is the method of collecting data through the telephone like other methods, it
has many advantages especially in industrialized countries. In a developing country like
Nigeria, this method of collecting information cannot be efficient because of the inefficiency
of the telephone system.

Merits

i. It is faster than other methods.

ii. It is cheaper to collect information by phone than personal interview.

iii. It is more flexible than postal questionnaires.

iv. It encourages higher response rate than postal questionnaire.

v. Recall of respondents is quicker and easier than any other method.

vi. It is the best method of access to every difficult respondent.

vii. It facilitates recording of replies without causing any embarrassment to the respondent.

viii. It is very suitable for radio and television surveys.

Apart from the fact that the telephone system is not effective in a developing country and
therefore renders the method unsuitable, it has other demerits.

Demerits

a. Survey by telephone is limited to respondents having telephones – an obvious evidence


of bias.

b. If the population is widely located all over the country, cost consideration will limit
extensive coverage of the country.

c. The interviewer may be biased and as a result, influence the respondent.

90
d. Cost consideration may restrict the number of questions asked or the time given to the
respondents to answer the questions.

e. Answers given may not be treated in confidence as the telephone could be bugged or
even dropped.

7.2 Limitations of Data Collection in Nigeria


Generally, secondary data are limited in scope and information derived from it may not be
satisfactory to all the needs of the researcher. This may also lead to reduction in scope of the
research work or bringing in certain assumptions to fill the loopholes created by insufficient
information.

In-text Question
Collection of data through phone like other method has many advantages. True\ False

In-text Answer
c) True

Summary for study session 7


In this study session, you have learnt:

1. The various methods of data collection.


2. The situations under which each of them can be employed were also highlighted, as well as
their relative merit and demerits.
3. Also the problems usually encountered in the process of collecting statistical data.

Self-Assessment Questions (SAQs) for Study Session 7


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.

SAQ 7.1-7.2
1. What is statistical data collection?

2. What are the merits of personal interview?

3. Discuss the demerits of postal questionnaire method.

4. Observational method of data collection is best in social science research, Discuss.

91
References
Adamu, S. O. (1978): “The Nigerian Statistical System”. Ibadan: University Press.
Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey
methods and Indicators).” Lagos: Victory Ventures ISBN 978-33867-1-9
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Moser, C. A. (1968): “Survey Methods in Social Investigation” London: Heinemann Educational
Books Ltd.,
Osuntogun, E. O. (1997): “Introduction to Social and Economic Statistics” Unpublished paper.

92
Study Session 8: Regression Analysis

Introduction
In this study, you will be introduced to the theory of linear regression analysis. Different types of
relationships shall be shown on the scatter plot and the estimate parameter of the model shall be
obtained by method of least squares. An introduction to the test of significance of the regression
line will also be given.

Learning outcomes for Study Session 8


At the end of this study, you should be able to:

8.1 Discuss the concept of regression analysis;


8.2 Identify the types of regression models
8.3 Estimate the parameters of a regression model; and
8.4 Explain the testing of the significance of the model

8.1 Regression Analysis


Regression analysis is a statistical tool which helps to study the trend and pattern of movement in
one variable in response to changes in another variable on the basis of an assumed relationship
existing between them. Once this pattern is established, it can be used to predict one variable
from the other.

The variable being predicted is usually referred to as the response (dependent) variable and the
other variable is called the explanatory (independent) variable. The values of the explanatory
variable are usually fixed and under the control of the investigator while the values of the
response variable are determined by the values of the explanatory variables.

93
Thus regression analysis attempts to determine how changes in the explanatory variable affect
the response variable. The variables involved are assumed to be measured and recorded as
interval scaled or ratio scale data. If the variables are strictly qualitative (i.e. attributes) the
method of regression cannot be used.
The appropriate method used in studying association between two qualitative variables will be
discussed.

In-Text Question
The values of the explanatory variable are usually fixed. True or False

In-Text Answer
True

8.2 Types of Regression Models


A regression model may be:

Figure 8.1: Types of regression model

A regression model can be simple if there is only one explanatory variable, and multiple if there
are more than one explanation variable.

A regression model is linear if its parameter does not contain any exponents and are not
multiples of other parameters in the model; otherwise, the model is said to be non-linear. The
value of the highest power of a model is called the order of the model.

Scatter Plot
The first step in the study of the relationship between two variable is to draw the scatter diagram.
It portrays the direction, form and strength of any relationship between quantitative variables. It

94
is drawn by plotting the values of the response variable (on the Y-axis) against the values of the
explanatory variable (on the X-axis).
The shape of the scattered points on the graph gives an idea of the type of relationship between
the two variables.

Types of Scatter Diagram


Diagram 8.1

Y Y

O X O X
(c) (d)
Figure 8.1: shows some of the common types of relationships that exist between two variables:

95
Figure 8.1(a) depicts a linear relationship

Figure 8.1(b) and (c) depicts non-linear relationship; figure 8.1 (a) is a quadratic relationship
while figure 8.1 (c) is an exponential relationship, while figure 8.1 (c) shows no
relationship between variables X and Y (i.e. spacious relationship). Since neither a line
nor curve can be fit on the scatter plot.

Please note that:


i. Scatter plot cannot be used for more than two variables.
ii. A non-linear regression model can be made linear through appropriate transformation.

In-Text Question
The following are types of regression models except _______________

A. Simple
B. Multiple
C. Short
D. Non-Linear

In-Text Answer
C. Short

8.3 The Simple Regression Model


The simple linear regression model describing the relationship between the response variable (Y)
and the explanatory variable (X) can be expressed as

Yi   0  1 X i  ei For i = 1, 2, ----, n

Where there are n observation on both X and Y and


i. Yi is the ith observation on Y.
ii. Xi is the ith observation on X

 0 is the intercept (The point at which the regression line cuts the Y-axis i.e. when

X = 0).

96
1 if the slope (regression coefficient) of the line. It gives the rate of change in Y per unit
change in X.

ei is the error term distributed random error term with mean O and variance  2 . The parabeteics
of the model can be estimated by method of Ordinary Least Squares (OLS).

Basic Assumption of OLS


i. The relationship between X and Y is assumed to be linear.
ii. Xi’s are predetermined (fixed) values assumed to be measured without error.
iii. The error term. ei's are independent of X i.e. E(ei X) = 0.
iv. The error term is assumed to be normally distributed with mean zero and variance
2 
i.e. e ~ N 0,  2 
The above assumptions implies that

Yi    X i  ei

is a random variable with the expectation (mean)

E Yi   E  0  1 X i  ei 

=  0  X i

since  and  are constants and E(ei) = 0

Similarly

V(Yi) = Var(0 + 1Xi + ei)

= Var(ei)

=  e2

 Y 
2
i  Yˆ
Where  2
e can be estimated by S 2

n 1

97
8.4 Estimation of Parameters
Suppose there are n pairs of observations on X and Y as
(x1, yi), (x2, y2), -----, (xn, yn).
The assumed linear relationship is
Yi = a + bXi + ei 8.1

Where a and b are estimates of 0 and 1 in the original model. Equation (8.1) can be
expressed as
ei = Yi – a – bXi 8.2

Let Q = e 2
i   Yi  a  bX i  2 8.3

The constant can be obtained by minimizing Q with respect to a and b. i.e.

Q
 2 Yi  a  bX i   0 8.4
a

Q
 2 Yi  a  bX i X i  0 8.5
b
The normal equations 8.4 and 8.5 can be solved simultaneously to obtain

a  Y  bX 8.6
and

b
 XY  nXY 
SS XY
8.7
 X  nX 2 2
SS XX

and  e2 can be estimated by

e 2
i
Se  i 1
8.9
n 1

 Y 
2
i  Yˆ
=
n 1

98
Where Yi is the observed value of Y

Yˆi is the estimated value of Y

and the variance of the intercept is

c Xi
Var  0   8.10
S XX

and
the variance of the regression coefficient is

2
Var  1   8.11
S XX

Coefficient of Determination
This is the proportion of variation in the response variable (Y) that is explained by the
explanatory variable (X).

It is defined by

n 
1  XY  nXY 
SS Re gression
R2    ni 1  8.12
SS Total
 Y 2  nY 2
i 1

Explained variation
=
Total variation

where 0  R 2  1

R 2  0 When b  0 and R 2  1 when all the points fall on the fitted regression line.

8.5 Testing the Significance of the Model


It is always desirable to test the significance of the model. That is to examine whether a
regression line is a good fit. If the line is a good fit, then all the points on the scatter diagram
must fell on the line or lie very close to it.

99
This can be done by examining the residual plot (i.e plot of residual error ei  Yi  Yˆi   against
the data points.

The most objective method is by arranging the sum of square and cross products in an Analysis
of Variance (ANOVA) table, and carry out the Fisher’s test (F-test) or student (t-test) as follows:

Specify the Hypothesis


H0 : b  0 (i.e. no relationship between X and Y)

H1 : b  0 (i.e. relationship exist between X and Y)

Choose  , the level of significance.

Table 8.1

ANOVA TABLE
SOURCE df SS MS F-cal

Regression K–1 SSR = SXY SS XY K 1  MS R MS R


Error n–K– SSY – SSR = SSE MS E
1 SS E n  K  1  MSE
Total n–1
 Y  2

Y  n
i
i
2
 ssY

The critical value is FV1 ,V2 ,

Where V1 = K – 1; V2 = n – K – 1.
 = 0.05 or 0.01

Decision rule

Reject H0 if Fcal  FV1 ,V2 , at  level of significance an conclude there is enough evidence to

show that variables X and Y are related, otherwise accept H0.

100
Example 8.1

The table below shows the weight losses, (in kilogram) (Y) of a sample of person and the
number of months (X), they have been on a special weight reducing diet.

Table 8.2

Y 4 17 14 1 11 22 9 12 4 7

X 7 32 26 1 20 34 17 21 5 12

a. Draw the scatter diagram of the above data.


b. Fit the regression equation of Y on X.
c. Interpret the parameters of your regression model.
d. An individual is known to have been on a special reducing diet for 27 months, estimate
his weight loss in kilograme.
e. Obtain an estimate of the standard error of the model.

Solution

Diagram 8.2
Scatter Plot
30 -

20 -

Weight loss

10 - 0 10 20 30 32 40
No. of months

101
Table 8.3

Y X XY X2 Y2 Yˆ Y  Ŷ 
2

4 7 28 49 16 4.159 0.02528

17 32 544 1024 289 18.309 1.7135

14 26 364 676 196 14.913 0.8336

1 1 1 1 1 0.763 0.0561

11 20 220 400 121 11.517 0.2673

22 34 748 1156 434 19.441 6.648

9 17 153 289 31 9.819 0.6708

12 21 252 441 144 12.083 0.00689

4 5 20 25 16 3.027 0.9467

7 12 84 144 49 6.789 0.000121

101 175 2414 4205 4397 0.06877

Y = 10.1

X = 17.5

b. Regression of Y on X Y = a + bx

b =  XY  nXY
= X 2
 nX 2

2414  10(10.1)(17.5) 646.5


= 
4205  10(17.5) 2 1142.5

= 0.566

102
a = Y  bX
= 10.1 – 0.566(17.5)
= 0.197

Yˆ = 0.197 + 0.566X
c. a = 0.197m when X = 0, Y = 0.197

b = 0.566 implies for every month spent taking special weight reducing diet, there is
an average reduction of 0.57 kilogramme loss in weight.

d. Yˆ = 0.197 + 0.566(27)
= 0.197 + 15.28
= 15.48 kg.

e
i 1
2
i
e. Se =
n 1

 Y 
2
i  Yˆ
=
n 1

 Y  Yˆ 
2
from the working table. i = 11.0688

11.0688
 Se =
9

= 1.109

Example 8.2

A quality control Manager collects 10 samples of iron roods from the production line at regular
interval of time. Each time the average length (Y) and diameter (X) of the rods are measured.
The results are given below.

103
Table 8.4

Average Average
Diameter (X) Length (Y)
in mm. in cm.
18.1 8.8
23.0 9.5
17.5 8.9
20.2 9.1
14.7 8.6
13.8 8.3
15.1 8.5
13.8 8.2
16.1 9.4
12.6 7.2

a. Calculate the linear regression of mean length on Diameter.


b. Is there any evidence to show that the diameter influences the length of the rods.
c. Calculate the standard error of the regression coefficient.
n = 10

X = 164..9 Y = 86.5

X = 16.49 Y = 8.65

X 2
= 813.85 Y 2
= 752.25

 XY = 1441.85

Hypothesis

H0 : 1 = 0

H1 : 1  0

 = 0.05

104
a. b =
 XY  nXY
 X  nX2 2

1441.85  10(16.49)(8.65) 15.465


= 
2813.85  10(16.49) 2 94.649

= 0.163

a = Y  bX
= 8.65 – 0.163(16.49)
= 5.96

Yˆ = 5.96 + 0.163X

b. SSTOTAL =  (Y  Y ) 2
= Y i
2
 nY 2

= 752.25 – 10(8.65)
= 4.025

SSTrt = 
b  XY  nXY 
= 0.163(15.465)
= 2.52

Table 8.5
ANOVA

Source df SS MS Fc

Treatment 1 2.52 2.52 13.40


Error 8 1.50 0.188

TOTAL 9 4.025

F0.95, 1, 18 = 5.32

105
Conclusion: Since Fc > F0.815, 1, 8

we reject H0 and conclude that there are genuine reasons to show that the diameter influences the
length of the rods at 5% level of significance.

e MSE
c. S.E(b) = 
 X  X ) 2   X 2  nX 2
0.188
=
94.649

= 0.045

Summary for Study Session 8


In this study session 8, you have learnt:

1. The theory of regression analysis, and its uses.


2. The difference between linear and non-linear models, simple and multiple regression
models.
3. The method of Ordinary Least Squares (OLS) for estimating the parameters of a simple
linear regression model and the procedure for carrying out the test of significance of a
regression line was given

Self-Assessment Questions (SAQs) for Study Session 8


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.

SAQ 8.1-8.5
1. A test was performed to determine the relationship between the chemical content (Y)
of a particular solution and the crystallization temperature (X) in deg. The following
quantities are calculated.

n = 20, X i = 400; Y i = 220

X i
2
= 8800 X Y i i = 4300

Y i
2
= 2620

106
Assuming a linear relationship Yi    X i  ei

a. Calculate the least squares estimate of  and  each correct to two significant
figures.
b. Test the significance of the fitted model at 5% level of significance.
c. Obtain the standard error of parameter in the model.

d. A previous similar exercise with n = 1.5 shows a regression coefficient of 1  of 0.10
with a standard error of 0.008. Test the hypothesis that the slope of your regression
model is the same as that of the previous exercise at 5% level of significant.

2. Twelve students took two papers in the same subject and the marks in percentages were
as follows:

S/No. 1 2 3 4 5 6 7 8 9 10 11 12

Paper I 65 73 42 52 84 60 70 79 60 83 57 7

Paper II 78 88 60 73 92 77 84 89 70 99 73 8

a. Construct a scatter diagram for the above data.


b. Calculate the regression equation of paper II on paper I.
c. Two boys were each absent for one paper. One score 63 on paper I, the other 81 on
paper II. Estimate the marks of these students in the paper they did not take.
d. Obtain the standard error on your regression coefficient in (b) above.
e. Construct a 95% confidence information for your regression coefficience in (b) above.

3. A random sample of ten families had the following income and food expenditure (in N
per week).

Families A B C D E F G H I J
Family Income 20 30 33 40 15 13 26 38 35 43

Family 7 9 8 11 5 4 8 10 9 10
Expenditure

107
a. Estimate the regression line of food expenditure on income and interpret your results.
b. Obtain the regression line of income on food expenditure and interpret the result.

4. The following results have been obtained from a sample of 11 observations on the value
of sales (Y) of a firm and the corresponding prices (X).

X  519.18 , Y  217.82 , X 2
 3134543 , X Y i i  1296836

Y 2
 539512

a. Estimate the regression line at sales on price and interpret the results
b. What is the part of the variation in sales which is not explained by the regression line?

3. The following table includes the gross national product (X) and the demand for food (Y)
measured in arbitrary units, in an underdeveloped country over the ten year period 1960
– 1969.

1960 1961 1962 1963 1964 1965 1966 1967 1968 1969

Y 6 7 8 10 8 9 10 9 11 10

X 50 52 55 59 57 58 62 65 68 70

(a) Estimate the food function Y = b0 + b1X + U


(b) What is the meaning of this result
(c) Compute the coefficient of determination and find the explained and unexplained
variation in the food expenditure.
(d) Find the regression of X on Y.

108
References
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.

Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton

Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.

Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,


London: Pitman Books Limited.

Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House
PVT Ltd.

Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New Delhi: W.H. Freeman and coy.

109
Study Session 9: Correlation and Association

Introduction
So far in study eight, you have learnt how to measure the direction and strength of the
relationship between the explanatory variable and the response variable for the purpose of
predicting one from the other. However, in this study, you will learn how to measure the relation
or association between two variables without distinction between the two variables and not for
the purpose of prediction.

Learning outcomes for Study Session 9


At the end of this study, you should be able to:

9.1 Explain the meaning of correlation


9.2 Explain coefficient of rank correlation

9.1 Correlation
Correlation refers to the relationship or association between two or more variables while
correlation coefficient is a quantity that measures the strength of the linear relationship between
two qualitative variables. The measure of relationship between two attributes (qualitative
variable) is usually referred to as association. This will be discussed in the next section.

Production Moment Correlation Coefficient


Suppose we have n observations on two variables x and y denoted by
(x1, y1) (x2, y2)… (xn, yn)
The correlation coefficient r for variables X and Y computed from n cases is

r
 XY  nXY
 X  nX  Y
2 2 2
 nY 2 

110

 X  X Y  Y 
 X  X   Y  Y 
2

where r ranges from -1 to +1.


If r = 0, the two variables are uncorrelated.

If r = +1, x and y are said to be directly or positively correlated and the regression line is upward
sloping on the Scatter plot.

If r = -1, x and y are said to be inversely or negatively correlated and the regression is downward
sloping on the Scatter plot.

If  < r < .5, x and y are said to be positively weakly correlated.

If 0.5< r < 0, x and y are said to be strongly positively correlated.


If -0.5< r < 0, x and y are said to be weakly negatively correlated.
If -1< r < -0.5, x and y are said to be strongly negatively correlated.
Note that r is also referred to as the product moment correlation coefficient.
It can be shown from lecture eight that

S XY
b
SX
S XY
and r
S X SY

S X2
Therefore r b
SY2

Where b is the regression coefficient.


Example 1:
Consider example 8.1. Calculate the product moment correlation coefficient for this data.
Solution:

n = 10, X = 10.1, Y = 17.5

 XY  nXY = 646.5

X 2
 nX 2 = 1142.5

111
Y 2
 nY 2 = 376.9

646.5
 r
(1142.5)(376.9)

646.5
=
656.21
= 0.985
Alternatively

S X2
r b
S Y2

33.80
= (0.566)
19.41
= 0.985

9.2 Coefficient of Rank Correlation


This is a measure of the strength of relationship between two qualitative variables (or attributes)
It is also used when the exact measurement of qualitative variables may not be accurate,
impossible or impracticable.

To obtain the rank correlation coefficient, the observed values of the variables are replaced by
their respective ranks either in ascending or descending order of magnitude.
The coefficient of rank correlation is given by

6 d 2
R 1
n(n 2  1)

Where d = difference of rank for any pair of variables – 1 < R< 1 and the interpretation is the
same as in product moment correlation coefficient.
If there are ties the average of the ranks are assigned to the units involved.

112
Example 9.3

Two judges were asked to assess twelve beauty contestants in a beauty contest. The twelve on
contestants were ranked according to their performance as follows:

Table 9.1

Judge 1 2 3 4 5 6 7 8 9 10 11 12

A 11 9 7 10 5 1 4 12 8 3 2 6

B 5 7 11 12 6 4 8 9 10 2 1 3

Is there any agreement in the two judges?


Solution

Table 9.2
n = 12

1 2 3 4 5 6 7 8 9 10 11 12

d 6 2 -4 -2 -1 -3 -4 -3 -2 1 1 3

d2 12 4 16 4 1 9 16 9 4 1 1 9

d 2
 86

6 d 2
R 1
n(n 2  1)

6(86)
= 1
12(144  1)

= 1 – 0.30
= 0.70
Comment: There is a fairly strong agreement in the opinion of the Judges.

113
Example 9.4

A study was conducted to determine the relationship between level of smoking measures by the
number of sticks of cigarette smoked per day (X) and a Tercim index of health (Y). The
following data were obtained on a random sample of 10 male smokers.

Table 9.3

X 8 20 15 12 15 9 16 10 12 8

Y 4 5 5 7 10 13 8 6 3 8

Calculate the spearman rank correlation coefficient and comment on your result.

Solution

Table 9.4

RX 9.5 1 3.5 5.5 3.5 8 2 7 5.5 9.5

RY 9 7.5 7.5 5 2 1 3.5 6 10 3.5

d 0.5 -6.5 -4 0.5 1.5 7 -1.5 1 -45 6

d2 0.25 42.25 16 0.25 2.25 49 2.25 1 20.25 36

d 2
 169.5

6 d 2
R 1
n(n 2  1)

6(169.5)
= 1
10(100  1)

= 1 – 1.027
= 0.027

114
Comment
The above result shows that there is weak negative association between smoking habit and the
report health index.

Summary for Study Session 9


In Study Session 9, you have learnt about:

1. The concept of correlation.


2. The Distinction between correlations
3. The association between qualitative variables and attributes.
4. The method of interpretation of coefficient

Self-Assessment Questions (SAQs) for Study Session 9


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.

SAQ 9.1-9.2
A group of sportsmen take part in a competition which includes two gymnasium test; squat
jumps and chins. The score for each exercise is the number performed in one minute. The score
of eight sportsmen taken from this group are given below:

Sportsmen A B C D E F G H

Squat jumps 47 72 60 44 56 63 71 64

Chins 25 48 30 40 27 35 30 34

a. Calculate the Spearman coefficient of rank correlation between these two sets of scores.
b. The overall winner of the gymnasium tests is the sportsman with the highest total score
when the number of squat jumps is added to the number of chins.
Determine the total scores and state which sportsman was the winner.
c. The rank correlation between the total scores and the number of squat jumps is 0.86
for the data above. Calculate the rank correlation between the total score and the total

115
score and the number of chins. If to save time, only one exercise was to be used in
future, state, giving a reason which one you would recommend to be used.

d. Consider the data in example 8.2

i. Calculate the coefficient of Spearman’s rank correlation


ii. Comment on your result.

References
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications ISBN: 978-34411-3-2

Connor, L. R and Morrell,( ) A. J. “Statistics in Theory and Practice”. Seventh Edition,


London:Pitman Books Limited.

Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering , Physical and
Biological Sciences”. Ibadan:A Divine Touch Publications. ISBN: 978- 35606-7-0

Brookes C B and Dick W.F (1969): An Introduction to Statistical Method. Second Edition
Published by H.E.B.Paperback.

Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: Arnold
& Stoughton

Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, second
Edition. New York: W.H. Freeman and coy.

116
Study Session 10: Proportions, Rates and Indices

Introduction
Rates, ratio and indices have become very important in the descriptive analysis of certain events
and characteristics. They are especially useful in the study of vital characteristics such as price,
death, birth, population growth epidemics, etc.

In this study, you will be introduced to the three concepts, their uses and applications using some
sample data with particular emphasis on price indices.

Learning Outcomes for Study Session 10


At the end of this study, you should be able to:

10.1 Explain the meaning of the terms proportion, rate and indices;

10.2 Explain items to be taken into consideration when constructing an index number

10.3 Discuss the different methods of construction of price index

10.4 Identify the uses of consumer price index

117
10.1 Proportion, Rates and indices
Definition: Proportion is the ratio of a number of items with certain characteristics (X) and the
total number of items exposed to such characteristics (N).
It is defined as

n( X )
P( X ) 
N
The above expresses the chance of occurrences of such characteristics. (i.e. Probability of event
x).
Example

If the voting age population (people 18 years and above) in a ward consists of 550 males and 600
females. What is the proportion of males?
Solution
n (males) = 550
Total population: N = 550 + 600 = 1150

n(males )
Proportion of males: P( Males ) 
N

550
= = 0.478
1150
Rates
When proportion refers to the number of events or cases occurring during certain period of time,
it becomes a rate and is usually expressed as so many per 1000. Thus we refer to birth rate as the
number of birth per 1000 population in a year.

So also we have death rate, migration rate marriage rate etc. Some examples shall be given to
illustrate this concept later.

Index Number
An Index is a real number that measures the rate of increase or decrease in wage, production
value, quantity, price, or volume of a certain phenomenon in the current period relative to as
specific period in the past. (a base period). It is usually measured in percentage.

118
An Index number is a device for estimating trends in prices, wages, production and other
economic variables.

In its simplest form, an index number represents a special kind of average or a weighted average,
compiled from a sample of items judged to be representative of a whole.In this study, our focus
shall be on the construction of consumer price index, since the principle and methods that will be
discussed apply equally to indices of sales, production, wage, value, quantity indices.

In-Text Question
When proportion refers to the number of events or cases occurring during certain period of time,
it becomes a _______

A. Rate
B. An index
C. Ratio
D. Map

In-Text Answer
A. Rate

10.2 Consideration for an Index Number


Quite a number of methods and formulae are used in the computation of index numbers; there
are however, a number of criterions that must be satisfied.

A good index number:


a. Should be simple in conception.
b. Should be easily interpreted. So that the man on the street can understand an index that
tries to measure the changing cost of the things he bought in a particular year.

Just as we mentioned in the earlier part of this lecture, an index number is a special kind of
average that considers the prices of many commodities expressed in different units or the
quantities measured also in different units. The commodities could also of different weights in

119
the “basket” of goods considered for the index. All these constitute the problems usually
encountered in the construction of an index number.

Thus in the construction of price index, the following factor are considered:

Figure 10.1: factors that determine price index

a. Choice of Item
Decision should be taken on the item to be included in an index. Such commodity to be included
should be (i) relevant, (ii) representative (iii) reliable and (iv) comparable over a period of time.

b. Source of Data

Decision should also be taken on the source of data for the items composing the “basket” to be
used in the construction of index number, should the data prices of commodities be collected
from a local market, a supermarket or an urban market. Great care should be taken to ensure that
prices are collected from population market that is patronized by different category of people and
where majority of the selected commodity can be found.

120
c. The Base Period
A base year is a reference period. The chosen year should generally be a fairly “normal” year,
free of occurrence of unusual events such as war, famine, prolonged strike or hyper-inflation. If
it is difficult to select a year in particular, the average of a series of years can be taken.

d. The Weight
Different weights are used in different parts of the country for a particular commodity. For
instance “congo” is used in the Western part of Nigeria, ‘mudu’ in the North and ‘tin’ in the
East. For the purpose of constructing and index number the weight in different region need to be
harmonized to a single unit.

In-Text Question
Choice of item can determine price index. True or False

In-Text Answer
True

10.3 Methods of Construction of Price Index


There are different methods of constructing a price index. Some of them are given below;
Let Pn represent the price at the current year
P0 represents the price at the base year
qn represent the quantity at the current year
qo represent the quantity at the base year
1. Price Relative: is the simplest method of calculating an index number. It is defined as the
prices in the current year expressed as a percentage of the price in the based period for single
commodity. Base period is always assumed = 100

Pn
PR  x 100
P0

2. Simple Aggregate Method: This method considers the price of basket of goods and
services in the current years relative to that of the based period. It is denoted by:

121
SAM 
P n
x 100
P 0

Limitation: It attaches equal weight to all commodities. It does not take into account the relative
importance of the commodities.

3. Simple Average Relative Method: This is the sum of the price relative divided by the
number of items considered. It is denoted by

SAP 
P n P0
x 100 where S is the number of items
N
Its limitation is same as in (2)

4. Weight Simple Average Relative Method: To circumvent the problems of assigning equal
weight to different items, the weighted simple aggregate price index is given as

WSAR 
W P n P0
x 100
W

Example 10.1
The following are the prices of commodity, A, B and C in 1975 and 1985.
Using 1974 as the base year
Table 10.1

Commodity 1975 1985

A 40 50

B 12 35

C 45 95

Calculate i. Price relative for each item


ii. Simple aggregative price index

122
Solution

Table 10.2

Commodity 1975 1985 Pn/Po x 100


A 40 50 50/40 x 100 = 125

B 12 35 35/12 x 100 = 291.7

C 45 95 95/45 x 100 = 211.1

Total 97 180

SAPR aggregate =
P n

P 0

180
= x 100
87

SAP (S. Average)=


P n P0
x 100
N

6.28
= x 100
3
Example 10.2

Given the prices of some staple foods in 1980 and 1996 with the corresponding weight. Using
1980 = 100

Table 10.3

Staple Weight Price


Foods 1980 1996
Elubo 3 1.25 10.50

Gari 5 4.0 12.50

Rice 1 35.0 75.0

Beans 3 12.0 38.0

Yam 2 5.001 8.50

123
Compute i. the price relatives
ii. simple aggregate price index
iii. weight average price index

Table 10.4

Staple Weight Price


Foods 1980 1996 Pn/Po x 100 W(Pn/Po)

Elubo 3 1.25 10.50 840 25.20

Gari 5 4.0 12.50 32 1.6


Rice 1 35.0 75.0 214 2.14

Beans 3 12.0 38.0 316 9.48

Yam 2 5.001 8.50 170 3.4

Total 14 57.25 144.5 41.85

Simple Aggregate Price Index =


Pn
x 100
P0

144.5
= x 100
57.25
= 252.14

Weight Average Relative index =


WP P n 0
x 100
W
= 41.85 x 100
= 298.7

5. Laspeyer’s Price Index: This is a kind of weight method of constructing an index


number. It assumes that the pattern of consumption has not changed over the years with
change in price. It is denoted by:

PL =
P q n 0
x 100
P q 0 0

124
Limitation: Since the base year quantities reflects the price of out modeled purchasing pattern. It
gives undue weight to items that has increased in price. Therefore Laspeyer’s price index tends
to overestimate.

6. Paashe’s Price index: This method assumes that the consumption pattern of the
consumer has changed in the current year. It is denoted by:

Pp =
P q n n
x 100
P q 0 n

Limitation: Some people tend to spend less on goods that have risen in price, the current
weighting procedure (Paashes) gives undue weight to items that have reduce in price, it tends to
understate the rise in prices. Hence, the underestimate on the price index.

7. Fisher’s Ideal Index Number: This method overcomes the problems of Paashes and
Laspeyer’s. This is considered as the most efficient method of constructing an index
number. It is the geometric mean of the Laspeyer’s and Paashes price indices denoted by:

P1 .Pp

Fisher’s Ideal Index =


P q .P q
n 0 n n
x 100
P q P q
0 0 n n

8. Marshal-Eldgeworth Price Index: This method takes into account the pattern of
consumption in the current and base periods. It uses the arithmetic mean of the base and
current period quantities as weight. It is given by:

M  Ep 
 P q n 0  qn 
x 100
 P q 0 0  qn 

Example 10.3

The prices and quantity demanded of commodities A, B and C in the current and base years are
given below

125
Table 10.5

Commodity 1960 1960 1970 1970


Po Qo Pn Qn

Price Quality Price Quality

A 4 50 10 40
B 3 10 9 2
C 2 5 4 2

Construct index number of price from the following data using


i. Laspeyer’s method
ii. Pashe’s method
iii. Marshal-Edgeworth method and
iv. Fisher’s ideal index number
Solution

Table 10.6

Commodity 1960 1960 1970 1970 PnQn PoQn PnQn


Po Qo Pn Qn

Price Quantity Price Quantity

A 4 50 10 40

B 3 10 9 2

C 2 5 4 2

Total 240 610 426

i. Laspeyer’s P1 =
= 610/240 x 100
= 254.2

126
ii. Passshe’s Pp =
= 426/170 x 100
= 250.6

iii. M-E =
= [(610 + 426+170)] x 100
= 252.7

iv. Fisher’s =
=
= 252.4

10.4 Uses of Consumer Price Index


1. Consumer price indices among others are used to measure change in retail prices of
specific quantity of goods and services in a given geographical region over a period of
time.
2. It helps in wage and salary negotiation and adjustments of allowances.
3. Government agencies use consumer price indices to formulate wage policy, price control
policy, taxation and general economic policy.
4. Changes in purchasing power and real income can be measured using the consumer price
indices.
5. Use in international comparison
6. Construction of Human Suffering Index.
7. Construction of cost of living index.

127
In-Text Question
Changes in purchasing power and real income can be measured using the consumer price
indices. True or False

In-Text Answer
True

Summary for Study Session 10


In study session 10, you have learnt about:

1. The various methods of constructing an index number


2. The problem associated with the construction of index number
3. The uses of the consumer price index

Self-Assessment Questions (SAQs) for Study Session 10


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.

SAQ 10.1 -10.4


1. a. Explain what is meant by an index number
b. What are the uses of consumer price index?
c. The price relatives for palm oil and kerosene are shown in the table

Commodity Price Relative

1961 1962

Palm Oil 100 108

Kerosene 100 114

128
Assuming that palm oil is twice as important as kerosene, what is the price index for 1962
taking 1961 = 100.

2. Five feed components are to be used in the construction of an animal feedstuff index
number. From the figures given in the following table, calculate a Laspeyer’s price index
taking 1964 = 100.

Component 1964 1970

Price Consumption Price Consumption


per ton (tons) per ton (tons)
A 40 3,600 41 2,750
B 39 2,750 53 1,500
C 38 2,050 35 2,350
D 37 500 30 750
E 36 1,475 24 2,850

3. Giving the following data on commodities A, B, C and D.

Base Year Current Year


Commodity Po qo Pn qn
A 10 12 12 15
B 7 15 5 20
C 5 24 9 20
D 16 5 14 5

Show that Fisher’s ideal index is 115.7

129
References
Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications. ISBN: 978-34411-3-2
Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey
methods and Indicators).” Lagos:Victory Ventures. ISBN 978-33867-1-9

Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,


London: Pitman Books Limited,

Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third Edition. London: Arnold &
Stoughton.
Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics. Second edition.
New York: W.H. Freeman and coy.

130
Study Session 11: Time Series Analysis

Introduction
Time series analysis is the application of time series technique to time structured data usually
referred to as time series data. Time series data is the record of observations measuring certain
quantity of interest at regular or irregular interval of time.

The observations may be recorded daily, weekly, quarterly, yearly or bi-annually. It is a


realization or sample function from a certain stochastic process. Time series occur in many
fields such as, Agriculture, Engineering, Business and Economics, Geophysics, Medical
Sciences, Meteorology, Quality control, Social Sciences, and so on.

In this study session, you will learn about the time series data, methods of analysis of time series
data and the components of a time series

Learning Outcomes for Study Session 11


At the end of this study, you should be able to:
11.1 Define and identify a time series data;
11.2 Identify the methods of analysis of time series data
11.3 Estimate and isolate the components of a time series

11.1 Time Series Analysis


The goal of time series is to identify a model within a given class of flexible model which can
reasonably approximately express a time-structured relationship of the process that generates the
data. The original use of time series analysis was primarily as an aid to forecasting. In the recent
time, the task has grown to an extent that time series analyst develop reasonably simple models

131
capable of describing the system that generate the time structured data; making reliable forecast
for the future and testing of hypotheses.

Uses of Time Series Models


Time series analysis is the study of the time-structured relationship in a variable. This involves
the use of the basic tools to analyze a given time series data with a view to:
 Construct simple mathematical systems that explain the time-structured relationship in
the economic and social series in a concise way.
 Use the model to explain the behavior of the series and make reliable forecast for the
future on the basis of the dynamic dependence of the series on the past values.

Thus time series provides a basis for economic and business planning, production and system
planning, control and optimization of industrial process. The intrinsic nature of a time series is
that its observations are dependent or correlated and the order of the observations is therefore
dependent.

Since life must be understood looking backwards and must be lived by looking forward, time
series provides useful tools that helps to predict the future by approximating models that use past
data.

Discrete time series is one where observations are taken at discrete specific time intervals,
usually equally spaced e.g. interest rates, yields, volume of sales and production. Such series
arise from fields such as Agriculture, Business circles etc.

Continuous time series are observation taken at any time t (t  T) in the index set T. This type
of series are common in the Engineering, Geophysics and Medical Sciences.

In-Text Question
Discrete time series is one where observations are taken at discrete specific time intervals,
usually equally spaced. True or False

In-Text Answer
True

132
1.2 Methods of Analysis of Time Series data
Time series data can be analyzed using either the deterministic method or Dynamic method.

Deterministic Method
A time series is said to be deterministic if future values are determined exactly by some
mathematical function. For example
(i) = +
(ii) X = Cos(2t)
Where a and b are constants and t is time that is fixed.

(iii) X t  Tt  S t  Ct  I t
Where Tt is the trend component; St is the seasonal component; Ct is the cyclical component
and it is the irregular component.

Dynamic (Non-deterministic) Method


A time series is said to be non-deterministic if future values can only be determined in terms of a
probability distribution guided by some assumptions. For example:
(i) = + +
(ii) X = A Sin(t + )
Where is normally distributed with mean zero and variance unity, A is a constant and  is a
random variable from a uniform distribution on the interval [- ,] independent of A.

This method involves the use of Autocorrelation function (ACF) and Partial autocorrelation
function (PACF) and correlogram in discrete domain. It also involve the Fourier transforms in
Frequency domain Analysis in frequency domain is carried out using the extension of Fourier
method and spectral density function

In-Text Question
A time series is said to be non-deterministic if future values can only be determined in terms of a
probability distribution guided by some assumptions. True or False

In-Text Answer
True

133
Deterministic Time series Analysis
The analysis of time series depends on the type of system that generates the data. Analysis in
time domain refers to the analysis of discrete time series

Simple Descriptive Analysis


Most social and economic data including data generated in medicine are time structured. They
need to be summarized with a view to make inference about the system that generates the data.

Time Plot
The first and most important diagnostic tool of time series data is the time plot. It is a graphical
representation of a time series data. It is constructed by plotting the observation  xt  on the
vertical axis against time ' t ' on the horizontal axis. When properly drawn, it shows up the
important features of the series such as trend, seasonality, discontinuities and outliers. The time
plot of the data gives an idea of the type of model that is suitable for the data.
It could also indicate whether it would be necessary to transform the observed data to achieve
certain stable conditions suitable for meaningful analysis and inference.

Time Plot of Xt
35
30
25
20
Xt
15
10
5
0
t 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96
t

Fig. 1.1: Time Plot of a series

134
11.3 Components of Time Series
Movements in a time structured data are governed by some peculiar and inherent forces which
may be characterized by their regularity/ periodicity and their effect n the entire series. The
forces could also be due to changes in the social, economic, psychological or environmental
characteristic in the system.
The patterns generated by these forces are referred to as components of time series. Some of the
components are: the trend (Tt), Seasonal movement (St), cyclical movement (Ct) and the
Irregular movement (It).

Trend or Secular Movement


This is the long-term movement in a series in the same direction over a long period of time. It is
usually characterize a continuous increase or decrease in the values on a variable over time. This
movement is generally referred to as secular Variation or Secular Movement. A line can be
freely drawn by hand through the plotted points on the graph of such time series stretching over a
long period; such a line called the trend. It is denoted by (Tt). The time plot below shows trend in
a series.

Fig. 1.2: Time plot showing Trend

Trend can be upward or downward. Upward trend is displayed in the time plot. This type of plot
is expected from sales of a commodity where increase is always expected.

135
Seasonal Variation
This refers to identical or almost identical patterns, which a time series appears to follow during
corresponding months of successive years due to mainly recurring event that takes place
annually. The movement appears to be periodic (exhibit variation at a fixed time within a given
interval if time). Many time series, such as sales figures and temperature readings, exhibit
variation, which are periodic annually.
There are factors responsible for this repetitive pattern year after year and the major factor is
weather condition. During winter, more woolen clothes are sold in UK and some other part of the
world. Also, regardless of increasing trend in the sales of ice cream, there is more sales of ice
cream during summer than winter.
Seasonal variation is denoted by (St). Time plot showing of monthly number of rainfall is given
below. Season is completed within one year, therefore a complete cycle is detected in the time
plot below. Seasonal variation is also found in a quarterly data. In that case, a complete cycle
will be completed within four quarters that make a year.

Fig. 1.3 : Time plot showing seasonal variation

136
Cyclical Variation
Cyclical Variation refers to as a long term oscillation about the trend which may or not be

periodic due to some other physical causes. The movement may or not exactly follow similar

pattern after equal interval of time. Examples include daily variation in temperature and rainfall

as well as some social and economic variables. A cyclical variation is denoted by

Fig. 1.4: Time Plot showing Cyclical Variation

Irregular Variation
This refer to erratic or sporadic movement of time series due to occurrence of random per chance
event, which are unforeseen, hence, it cannot be isolated directly. They are not deterministic.
These variations may or not be random.
Though it is assumed that these chance events produce short time variation, however, they can be
very intense and may result in a new cyclical or other variation. Included among these random
factors are such events as strikes, flood, volcanic eruption, earthquake, fire outbreak, sudden
change in government policy and so on. It is denoted by it.

Method of Combining Components


The task of the statistician is to segregate each of these factors in so far as this is possible: By
isolating or removing individual components, the impact of each of the components may be
assessed. It may happen that not all of the components may be present.

137
Traditionally, it is possibly to decompose time series into the trend, seasonal, cyclical and
irregular components. Using either of

Additive model: X t  Tt  S t  C t  I t

or Multiplicative model: X t  Tt  S t  Ct  I t
where Tt is the trend component; St is the seasonal component; Ct is the cyclical component and
It is the irregular component

The resulting trend equation can be used for forecasting while the original data can be de-
seasonalized.

For example, the trend can be estimated using either

(i) k-point moving average

(i) semi-average,

(ii) Least square’s method.

Assuming all these methods are familiar to us, the least squares method uses the normal equation
X t  a  bt   t with the assumption that the error term  t  are independent and not serially
correlated. Otherwise, the regression equation is spurious i.e. the parameters of the models are
biased, and inconsistent due to the presence of a lagged dependent variable, the estimated OLS
standard error is invalid.

Decomposition of a time series can be achieved using any f the following models:

(a) Additive model

Xt = Tt + St + Ct + It

(b) Multiplicative model

Xt = Tt . St . Ct . It

138
(c) Mixed model

Xt = Tt StCt + It

or Xt = Tt + St . Ct . It

We shall concentrate first on (a) and (b).

The additive model assumes that the actual values are the sum of the four separate effects. This
assumption is probably true when short periods are involved or where the rate of growth or
decline in the trend is small as may be shown in the time plot.

The multiplicative model suggests that the actual values are the product of the separate effects.
This model is indicated when there is a marked (or sharp) growth or decline in a time series data
as may be shown in the time plot.

Decomposition of the Components


Either of these models may be used to effect the decomposition of the time series. The idea is to
decompose a time series into each of the basic components, analyze each component separately
and then recombine them in order to describe the variation in the series as a whole.
The process involves systematic evaluation of each component from the data. The first stage is
usually to estimate the trend and eliminate it from each time period from the actual data by
subtraction or division to give a de-trended series.

De-trended Series can be obtained using:

Additive Model: Xt – Tt = St + Ct + It or

Xt
Multiplicative Model:  S t Ct I t
Tt

Estimation of Seasonal indices


The first step in the estimation of seasonal effect is to obtain the deviation from the trend Xt - Tt
(for additive model) or (or ratio to trend) X t Tt for (multiplicative model).
The de-trended series is averaged day by day, month by month etc. to produce an estimate of the
seasonal components Depending on whether the seasonal effect is thought to be additive or

139
multiplicative, the deviations are arranged in a table with a view to obtaining the average
otherwise referred to as seasonal indices St.

K
For additive model, the condition S
i 1
i  0 is imposed. That is the sum of seasonal effects

(indices) over the quarters add up to zero because if there were no seasonal effect, we expect Xt
– St = 0. If the means does not sum up to zero, the mean is then averaged among the quarters /
months / day / weeks, thus the seasonal effects are adjusted by subtracting (or adding) the
1
average from the mean to obtain the adjusted means (i.e. seasonal effects). i.e
n
 d i  Si ;
m
S i  0 but if S i  m then
K
 g therefore  S i  g  0

(1.3)

For multiplicative model, the condition S j  S is imposed where S is the number of quarters

in quarterly series. That is the sum of the seasonal effect over the year is S. The ratio of the
actual values (Xt) and the trend (Tt) is obtained as X t Tt because, if there were no seasonal

1 S
effect we expect X t Tt  1 for each time period. Thus S j  1.
S j 1

n
 S
j 1
j S (1.4)

The averaging procedure which produces the seasonal components follows the same pattern as in
the additive model except that the adjustments to the averages which corrects for rounding to S.
This is achieved by summing the averages and multiplying the resultant quotient by the
unadjusted averages.

1
Let S j  C , then the adjustment is
S
 S j . Thus the de-trended, de-seasonalized series can
be obtained by eliminating the trend (Tt) and seasonal components (St) for each time period from
the actual data by subtraction or division depending or whether the additive or multiplication
model was used.

140
Additive model Xt – Tt – St = Ct + It

Xt
Multiplicative model  Ct I t
Tt S t

De- seasonalized series is obtained after the seasonally adjusted data has been calculated. The
residual ratio is obtained either by dividing these seasonally adjusted figures Xt S t  by the

trend values or by dividing the ratio de-trended series X t S m  X t  by the respective seasonal
indices.

Finally, the cyclical variation (Ct ) can be found by smoothing the joint Ct and It components and
is eliminated as before.

Residual irregular components (It) can be obtained by subtraction or division:

Additive model Xt – Tt – St - Ct = It

Xt
Multiplicative model  It
Tt S t C t

Although the general method of decomposition has included the four possible components which
make up a time series, it should be noted that it is not a rule for all the four to be present. If
annual data are being used, there can be no seasonal component. Similarly, if short periods of
time are involved, the cyclical components can be ignored. In both cases one of the steps
outlined in the decomposition of time series above may be omitted.

Prediction / Forecasting
The essence of decomposing a time series is for a statistician to measure the effect of each
component and to make meaningful and reliable forecast; taking into consideration the effect of
the component on the forecast values for different time periods.
Thus, if a multiplicative model was used, a sensible predictor for the period K might be

Xˆ t ( k )  Tˆt  k Sˆ t  k

141
where Tˆt and Ŝ t are the estimated trend and seasonal effects respectively.

Similarly if the additive model was used the predictor for period k might be

Xˆ t ( k )  Tˆt  k  Sˆ t  k

Example

The data below gives the monthly sales of umbrella in XYZ company from 2004 – 2011

Table 11.1

2004 2005 2006 2007 2008 2009 2010 2011

JAN 10 15 10 8 10 8 9 10

FEB 18 12 10 9 9 12 13 8

MAR 22 13 9 12 10 10 3 11

APRIL 8 15 20 14 12 15 8 13

MAY 16 11 10 19 10 12 15 8

JUN 10 16 18 20 18 11 15 11

JUL 18 22 16 25 28 16 18 15

AUG 20 30 20 25 30 13 19 17

SEP 15 20 21 17 15 9 10 11

OCT 10 15 18 15 17 18 18 4

NOV 14 25 16 15 15 22 23 15

DEC 11 10 14 7 7 10 9 8

142
(a) Use a suitable average to decompose the series into trend and seasonal component, hence or
otherwise forecast the sales for 2012 – 2013 using the additive model.
(b) Which is the most appropriate model in the sense of providing the better forecast?
Solution:

The first thing to do is to construct the time plot in order to view the maximum and minimum
values, examine the existence of outliers and fluctuations.

Table 11.2 Showing the Computations of the Trend, Seasonal indices and De-
seasonalized data.

Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8

Year/ Sales 6 Point-MT Add-in- Moving Dev. from Seasonal De-


pairs Average Trend Indices Seasonalized
Month
data

2005 JAN 10 -4.44 14.4

FEB 18 -3.97 22.0

MAR 22 -4.62 26.6

APRIL 8 -0.45 8.4

143
MAY 16 -2.15 18.2

JUN 10 172 349 14.54 -4.54 0.66 9.3

JUL 18 177 348 14.50 3.50 5.72 12.3

AUG 20 171 333 13.88 6.13 7.85 12.2

SEP 15 162 331 13.79 1.21 0.74 14.3

OCT 10 169 333 13.88 -3.88 1.33 8.7

NOV 14 164 334 13.92 0.08 4.09 9.9

DEC 11 170 344 14.33 -3.33 -4.76 15.8

2006 JAN 15 174 358 14.92 0.08 -4.44 19.4

FEB 12 184 373 15.54 -3.54 -3.97 16.0

MAR 13 189 383 15.96 -2.96 -4.62 17.6

APRIL 15 194 399 16.63 -1.63 -0.45 15.4

MAY 11 205 409 17.04 -6.04 -2.15 13.2

JUN 16 204 403 16.79 -0.79 0.66 15.3

JUL 22 199 396 16.50 5.50 5.72 16.3

AUG 30 197 390 16.25 13.75 7.85 22.2

SEP 20 193 391 16.29 3.71 0.74 19.3

OCT 15 198 395 16.46 -1.46 1.33 13.7

NOV 25 197 396 16.50 8.50 4.09 20.9

DEC 10 199 392 16.33 -6.33 -4.76 14.8

2007 JAN 10 193 376 15.67 -5.67 -4.44 14.4

FEB 10 183 367 15.29 -5.29 -3.97 14.0

MAR 9 184 371 15.46 -6.46 -4.62 13.6

144
APRIL 20 187 365 15.21 4.79 -0.45 20.4

MAY 10 178 360 15.00 -5.00 -2.15 12.2

JUN 18 182 362 15.08 2.92 0.66 17.3

JUL 16 180 359 14.96 1.04 5.72 10.3

AUG 20 179 361 15.04 4.96 7.85 12.2

SEP 21 182 358 14.92 6.08 0.74 20.3

OCT 18 176 361 15.04 2.96 1.33 16.7

NOV 16 185 372 15.50 0.50 4.09 11.9

DEC 14 187 383 15.96 -1.96 -4.76 18.8

2008 JAN 8 196 397 16.54 -8.54 -4.44 12.4

FEB 9 201 398 16.58 -7.58 -3.97 13.0

MAR 12 197 391 16.29 -4.29 -4.62 16.6

Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8

Year/ Sales 6 Point-MT Add-in- Moving Dev. from Seasonal De-


pairs Average Trend Indices Seasonalized
Month
data

APRIL 14 194 387 16.13 -2.13 -0.45 14.4

MAY 19 193 379 15.79 3.21 -2.15 21.2

JUN 20 186 374 15.58 4.42 0.66 19.3

JUL 25 188 376 15.67 9.33 5.72 19.3

AUG 25 188 374 15.58 9.42 7.85 17.2

SEP 17 186 370 15.42 1.58 0.74 16.3

OCT 15 184 359 14.96 0.04 1.33 13.7

NOV 15 175 348 14.50 0.50 4.09 10.9

145
DEC 7 173 349 14.54 -7.54 -4.76 11.8

2009 JAN 10 176 357 14.88 -4.88 -4.44 14.4

FEB 9 181 360 15.00 -6.00 -3.97 13.0

MAR 10 179 360 15.00 -5.00 -4.62 14.6

APRIL 12 181 362 15.08 -3.08 -0.45 12.4

MAY 10 181 362 15.08 -5.08 -2.15 12.2

JUN 18 181 360 15.00 3.00 0.66 17.3

JUL 28 179 361 15.04 12.96 5.72 22.3

AUG 30 182 364 15.17 14.83 7.85 22.2

SEP 15 182 367 15.29 -0.29 0.74 14.3

OCT 17 185 372 15.50 1.50 1.33 15.7

NOV 15 187 367 15.29 -0.29 4.09 10.9

DEC 7 180 348 14.50 -7.50 -4.76 11.8

2010 JAN 8 168 319 13.29 -5.29 -4.44 12.4

FEB 12 151 296 12.33 -0.33 -3.97 16.0

MAR 10 145 291 12.13 -2.13 -4.62 14.6

APRIL 15 146 299 12.46 2.54 -0.45 15.4

MAY 12 153 309 12.88 -0.88 -2.15 14.2

JUN 11 156 313 13.04 -2.04 0.66 10.3

JUL 16 157 315 13.13 2.88 5.72 10.3

AUG 13 158 309 12.88 0.13 7.85 5.2

SEP 9 151 295 12.29 -3.29 0.74 8.3

OCT 18 144 291 12.13 5.88 1.33 16.7

146
NOV 22 147 298 12.42 9.58 4.09 17.9

DEC 10 151 304 12.67 -2.67 -4.76 14.8

2011 JAN 9 153 312 13.00 -4.00 -4.44 13.4

FEB 13 159 319 13.29 -0.29 -3.97 17.0

MAR 3 160 320 13.33 -10.33 -4.62 7.6

APRIL 8 160 321 13.38 -5.38 -0.45 8.4

MAY 15 161 321 13.38 1.63 -2.15 17.2

JUN 15 160 321 13.38 1.63 0.66 14.3

JUL 18 161 317 13.21 4.79 5.72 12.3

AUG 19 156 320 13.33 5.67 7.85 11.2

SEP 10 164 333 13.88 -3.88 0.74 9.3

Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8

Year/ Sales 6 Point-MT Add-in- Moving Dev. from Seasonal De-


pairs Average Trend Indices Seasonalized
Month
data

OCT 18 169 331 13.79 4.21 1.33 16.7

NOV 23 162 320 13.33 9.67 4.09 18.9

DEC 9 158 313 13.04 -4.04 -4.76 13.8

2012 JAN 10 155 308 12.83 -2.83 -4.44 14.4

FEB 8 153 307 12.79 -4.79 -3.97 12.0

MAR 11 154 294 12.25 -1.25 -4.62 15.6

APRIL 13 140 272 11.33 1.67 -0.45 13.4

MAY 8 132 263 10.96 -2.96 -2.15 10.2

JUN 11 131 0.66 10.3

147
JUL 15 5.72 9.3

AUG 17 7.85 9.2

SEP 11 0.74 10.3

OCT 4 1.33 2.7

NOV 15 4.09 10.9

DEC 8 -4.76 12.8

Table 11.3 Showing Seasonal Indices

Month/

Year JAN FEB MAR APRIL MAY JUN JUL AUG SEP OCT NOV DEC

2005 - - - - - -4.54 3.50 6.13 1.21 -3.88 0.08 -3.33

2006 0.08 -3.54 -2.96 -1.63 -6.04 -0.79 5.50 13.75 3.71 -1.46 8.50 -6.33

2007 -5.67 -5.29 -6.46 4.79 -5.00 2.92 1.04 4.96 6.08 2.96 0.50 -1.96

2008 -8.54 -7.58 -4.29 -2.13 3.21 4.42 9.33 9.42 1.58 0.04 0.50 -7.54

2009 -4.88 -6.00 -5.00 -3.08 -5.08 3.00 12.96 14.83 -0.29 1.50 -0.29 -7.50

2010 -5.29 -0.33 -2.13 2.54 -0.88 -2.04 2.88 0.13 -3.29 5.88 9.58 -2.67

2011 -4.00 -0.29 -10.33 -5.38 1.63 1.63 4.79 5.67 -3.88 4.21 9.67 -4.04

2012 -2.83 -4.79 -1.25 1.67 -2.96 - - - - - - -

- -
Total -31.13 -27.83 -32.42 -3.21 15.13 4.58 40.00 54.88 5.13 9.25 28.54 33.38

AVG -4.45 -3.98 -4.63 -0.46 -2.16 0.65 5.71 7.84 0.73 1.32 4.08 -4.77 -0.10

Adjustment 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 -0.01

S.I -4.44 -3.97 -4.62 -0.45 -2.15 0.66 5.72 7.85 0.74 1.33 4.09 -4.76 0.00

148
35
30
25
20
15
10
5
0
APRIL

APRIL

APRIL

APRIL

APRIL

APRIL

APRIL

APRIL
OCT

OCT

OCT

OCT

OCT

OCT

OCT

OCT
2004 JAN

2005 JAN

2006 JAN

2007 JAN

2008 JAN

2009 JAN

2010 JAN

2011 JAN
JUL

JUL

JUL

JUL

JUL

JUL

JUL

JUL
Sales of Umbrella Moving Average De- Seasonalized data

Fig. 1.5: Time Plot of Sales of Umbrella, Moving average and De-seasonalized data

Summary for Study Session 11


In this study, you have learn about:

1. The components of time series were described with charts for their illustration
2. The additive and multiplicative method of analysis of time series data with examples.
3. The procedure for construction of seasonal indices and de-seasonalized data. Some useful
examples were given to illustrate the techniques.

Self-Assessment Questions (SAQs) for Study Session 11


Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.

SAQ 11.1 -11.3


1. Explain clearly the reasons for analyzing a time series data
2. Sixteen successive observation of a given time series are:

1.6, 0.8, 1.2, 0.5, 0.9, 1.1, 1.1, 0.6, 1.5,

0.8, 0.9, 1.2, 0.5, 1.3, 0.8, 1.2

149
(i) Obtain the time-plot of the observation
(ii) Use a 3-point moving average to the trend values.
3. For the following time series:

Year Yt

1990 2.4

1991 3.6

1992 5.4

1993 7.8

1994 11.6

1995 17.3

(i) Fit a linear trend to the above data and


(ii) Fit a Quadratic trend.

References
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House
PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New Delhi: W.H. Freeman and coy.
Shittu, O. I. and Yaya, O. S. (2011): “Introduction to Time Series Analysis”, Babs-Tunde
Intercontinental Print, Nigeria. ISBN 978-33867-1-9. pp. 282

150

You might also like