CM201

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 188

CENTRE FOR DISTANCE AND

ONLINE EDUCATION

B.Com (III Semester)


Business Statistics
(CM-201)

ALIGARH MUSLIM UNIVERSITY


ALIGARH, INDIA
SYLLABUS
BUSINESS STATISTICS (CM-201)
B.Com. IIIrd Semester (CBCS)

Credit –04
Max.Marks-100
Assignment – 30 Marks
Examination - 70 Marks

Objective: This course enables the students to gain understanding of


statistical techniques as are applicable to business.
Course outcomes
At the end of the course students are able to:
1. Apply statistical techniques for business applications.
2. Have a strong foundation in the principles of statistics.
3. Understand the importance, uses and limitations of statistical methods.
UNIT-I MEASURES OF CENTRAL TENDENCY& DISPERSION: Concept of
Central Tendency: Mean, Median and Mode.
Dispersion: Range, Inter Quartile Range, Quartile Deviation, Mean
Deviation and Standard Deviation, Coefficient of variation, Lorenz Curve.
Skewness: Concept and its Measures.
UNIT-II ANALYSIS OF TIME SERIES ANDFORECASTING:
Time Series: Meaning and importance, Causes of variation in time
series data, Components of a time series: Determination of trend -
Moving averages method and method of least squares (including
linear, second degree, parabolic, and exponential trend). Computation
of Seasonal Indices: Simple average method, ratio-to-trend method,
ratio-to-moving average method.
Forecasting: Concept and Methods of Forecasting.
UNIT - III CORRELATION AND REGRESSION ANALYSIS:
Correlation: Meaning of Correlation, Simple, Multiple and Partial;
Linear & Non- linear, Correlation & Causation, Scatter diagram,
Pearson’s co-efficient of correlation, Calculation & Properties (proofs
not required), Correlation, Standard and Probable error, Rank
Correlation.
Regression: Principle of Least Squares and Regression lines,
Regression equations and estimations, Properties of Regression
Coefficients, Relationship between Correlation and Regression
Coefficients; Standard Error of Estimate.
UNIT - IV INDEX NUMBER: Meaning, Types and Uses; Methods of constructing
Price and Quantity indices (simple and aggregate); Tests of Adequacy;
Chain-base index numbers; Base shifting, Splicing, and deflating;
Problems in constructing index numbers.
THEORY OF PROBABILITY: Concept; The three approaches to
defining probability; Addition and multiplication laws of probability,
Types of events, Conditional probability; Baye’s Theorem;
Mathematical Expectation, Concept of Combination andPermutation.
Suggested Readings
1. Ahmad, M.M., ‘Probability and Probability Distributions’, AMU Press ,Aligarh.
2. Hooda, R.P .: Statistics for Business and Economics; Macmillan, New Delhi.
3. Ya-lun Chou: Statistical Analysis with Business and Economic Applications, Holt; Rinehart &
Winster, New York.
4. lewin and Rubin: Statistics for Management; Prentice-Hall of India, New Delhi.
5. Hoel & Jessen: Basic Statistics for Business and Economics; John Wiley and Sons, New York.
6. Saur S. Ajay & Gaur S.S., Statistical Method for Practice & Research Response Books, NewDelhi.
7. Patri D.N., Statistical Methods, Kalyani Publication, New Delhi,2011.
8. Elthans D.N., Fundamental of Statistics
9. Shukla S.M. and Sahai S.P., Business Statistics, Sahitya Bhawan Publication,2012.
10. Douglas, Statistical Techniques in Business of Economics Lind, Tata McGrew Hill, New Delhi,2010.
CONTENT
UNIT-I
MEASURES OF CENTRAL TENDENCY& DISPERSION:
Structure Page No : 1-56
1.1 Objectives
1.2 Concept of Central Tendency Mean, Median and Mode.
1.3 Dispersion: Range, Inter Quartile Range, Quartile Deviation, Mean Deviation and
Standard Deviation, Coefficient of variation, Lorenz Curve.
1.4 Skewness: Concept and its Measures.
1.5 Summary
1.6 Questions
1.7 Suggested Reading
UNIT-II
ANALYSIS OF TIME SERIES AND FORECASTING
Structure Page No : 57-101
2.1 Objectives
2.2 Introduction
2.3 Time Series: Meaning
2.4 Components of a time series
2.5 Methods of estimating the trends
2.6 Forecasting: Concept and Methods of Forecasting.
2.7 Summary
2.8 Questions
2.9 Suggested Reading
UNIT - III
CORRELATION AND REGRESSION ANALYSIS:
Structure Page No : 101-154
3.1 Objectives
3.2 Introduction
3.3 Correlation: Meaning of Correlation
o Utility and Importance of Correlation
o Correlation and Cause and Effect Relationship
o Types of Correlation
o Degree of Correlation
o Methods of Determining Correlation
 Graphical Method
 Mathematical Methods
o Probable error
3.4 Regression:
o Meaning and definition
o Utility of Regression Analysis
o Types of Regressions
o Difference between Correlation and Regression
o Regression Lines
o Functions or uses of Regressions Lines
o Regression equations
o Regression Coefficient
 Properties of Regression Coefficients-
o Some Important points relating to Regression Analysis
o Standard Error of the Estimate
o Ratio of Variation

3.5 Summary
3.6 Questions
3.7 Suggested Reading
UNIT - IV
INDEX NUMBER & PROBABIITY
Structure Page No : 155-198
4.1 Objectives
4.2 Index Number Meaning, Types and Uses;
o Methods of constructing Price and Quantity indices
o Problems Relating to Methods of Base Year
o Problems in constructing index numbers.
4.3 THEORY OF PROBABILITY:
o Meaning and definitions of Probability

o Types of Probability

o Basic concept of Probability

o Methods to Use in Solving Probability Problems

o Approaches of Assigning Probabilities


4.4 Summary
4.5 Questions
4.6 Suggested Reading
UNIT-I

MEASURES OF CENTRAL TENDENCY& DISPERSION:

Structure

1.1 Objectives
1.2 Concept of Central Tendency Mean, Median and Mode.
1.3 Dispersion: Range, Inter Quartile Range, Quartile Deviation, Mean Deviation and
Standard Deviation, Coefficient of variation, Lorenz Curve.
1.4 Skewness: Concept and its Measures.
1.5 Summary
1.6 Questions
1.7 Suggested Reading
1.1 Objectives
The objective of the lesson is to make you understand:
1. The different kinds of statistical averages.
2. The concept and calculation of Arithmetic mean.
3. The concept and calculation of median.
4. The utilities of averages.
1.2 Concept of Central Tendency Mean, Median and Mode.
Collection of data and organizing them in a proper format and style in tables, diagrams and
graphs etc. does not provide sound workable conclusions and give desired information. It is
essential to present the collected data in some condensed form so as to make them
understandable, comparable and worthy of scientific treatment and practical use. For this purpose
a central value, which represents the whole mass of data is worked out. This value is called
Central Tendency or Central Value or an „Average‟. In other words, it is the expected value of
the variable or it is a value in a data around which the items of data tend to concentrate or cluster.
Average or Central Tendency gives us the gist of the huge mass of data. These are the values
which lie between the two extreme observations i.e., the largest and smallest observations of the
distribution. An average is a single value within the range of data which is used to represent all
the values in the series. It is the reason that averages are sometimes called „Measures of Central
Tendency‟ or „Measures of Location‟.
Utility of Averages
Some of the main uses of averages are:

1
a. Distribution is explained in a precise manner.
b. Useful for comparative study of distributions.
c. Useful for measuring other statistical measures like dispersion, skewness, kurtosis
etc.

Kinds of Statistical Averages


The Averages are broadly of three types:
a. Mathematical Averages,
b. Positional Averages and
c. Commercial Averages;
Mathematical Averages can be further sub-divided in to:
a. Arithmetic mean or Averages (X)
b. Geometric mean (G)
c. Harmonic means (H)

Positional Averages can be reclassified in to Median (M) and Mode (Z) and Commercial
Averages are further classified in to:
a. Moving Averages,
b. Progressive Average and
c. Composite Average

In this unit, we shall study Mathematical Average and the positional Averages. Let us
start with the Arithmetic mean which is one of the most important mathematical Averages.
Arithmetic Mean
It is the most widely used measure for representing the entire data. Arithmetic Mean or
mean is the number which is obtained by adding the values of all the items of a series and
dividing the total by the number of items.
In case of discrete and continuous series the values of the frequencies are taken into
consideration. There are two types of „Mean‟ (i) Simple Arithmetic Mean and (ii) Weighted
Mean.
Calculation of Simple Arithmetic Mean
Arithmetic mean can be calculated for three different types of series i.e. individual,
discreet and continuous. In all the series we follow three methods. They are: Direct method, short
cut method and step deviation method. Let us understand them one by one.

1. Individual Series:

2
(a) Direct Method: Individual observations are when frequencies are not given. The
calculation of arithmetic mean in case of individual observations is very simple. In this case, add
the different values of the data and divide the sum by the total number of items. Symbolically,

X1  X 2  X 3  ..........X n X
Χ Or X 
N N
Where: X = any observation or variable,

X = Arithmetic Mean,
N = Total number of observations and

 X = Sum of all observations of X, i.e., X1, x2, X3 ….Xn.


Steps for above:

(i) Add all values of the variable X and get  X .


(ii) Find out total number of items i.e., N.

X
X
(iii) Divide this total by the total number of items i.e.,
N
(b) Short-Cut Method: The arithmetic mean can be calculated by using an arbitrary origin

X A 
dx
or assumed mean. The formula for calculating arithmetic mean is:
N
Where: A = Assumed mean

dx = Deviation of items form assumed mean i.e. (X-A)

Steps
(i) Take any variable (X) as an assumed mean (A),
(ii) Find the deviations of X variables from the assumed mean (X-A) and denote them as
„dx‟.
(iii) Obtain the sum of these deviations i.e., ∑dx.
(iv) Find the total number of items (i.e. N).

X A 
dx
(v) Apply the formula
N
We use this method if the values of X are large and have the same common factor. This
makes the calculation simpler and less time consuming.
(c) Step Deviation Method: The formula used under this method is: .

3
X A
 dx' x C , here „C‟ is the common factor present in all the deviated values.
N
Steps for this are:
(i) Take some variable in the series as assumed mean. (A)
(ii) Take deviations (dx) from X by applying the formula (X-A).
(iii) Now divide (X-A) or dx by the common factor „C‟ and hence apply the formula
XA
and get dx‟.
C

 A   x C.
dx'
(iv) Now apply the formula X
N
Example1:

The daily wages of ten factory workers are given below. Calculate the Arithmetic mean by
direct, short-cut and step deviation methods.
Daily Wages: 30 40 100 60 50 50 70 55 75 60

Let us solve this example by the three different methods:

(a) Direct method: According to the data given above; X = 30, 40, 100, 60, 50, 50, 70, 55, 75,
60 and N = 10,   X  590

X
Following the formula X = 590/10 = 59
N
Arithmetic mean by direct method = 59

(b) Short cut method: Following this method, first we select any of the „X‟ variables as an
assumed mean (A). Suppose, here we take 50 as A. now let us calculate the deviations (dx) of X
variables from the assumed mean:
X Dx (X-A)
30 -20
40 -10
100 50
60 10
50 0

4
50 0
70 20 Here N = 10, ∑dx = 90
55 5
75 25
60 10

A   dx  90
dx
X = 50 + 90/10
N
= 50+ 9 = 59.

(c) Step deviation method: This method involves a step further to the short cut method i.e.
calculation of dx‟ by dividing the dx by a common factor:
X dx dx‟ XA
Here; N = 10, C = 5, dx = X-A, dx‟ = ,
30 -20 -4 C
40 -10 -2 ∑dx‟ = 18,
100 50 10

X A  xC
60 10 2 dx'
Applying the formula
50 0 0 N
50 0 0
X  50+ (18/10 x 5) = 50 +9 =59
70 20 4
55 5 1
75 25 5
60 10 2

2. Discrete Series: In discrete series there are frequencies to be considered while


calculating the mean. Here also, arithmetic mean may be calculated by applying either (a) Direct
Method (b) Short-Cut Method or (c) Step-Deviation Method.
a. Direct Method. The formula for calculating mean is:

X  fX
N
Where: f = frequency,
X = variable

5
N= sum of simple frequency =∑f.
Steps for the above method are:
(i) Multiply the simple frequency of each item with its respective variable and total
them which is denoted as ∑fX,
(ii) Find out sum of frequencies i.e., ∑f or N and
(iii) Divide the total obtained (∑fX) by the number of observations N or ∑f) to get the
required Arithmetic Mean.
This method becomes time consuming if either X or f is large or in fractions. In that case
short cut method is used.
(b) Short-cut Method. According to this method, the formula is

X  A 
fdx
whereA  Assumed Mean, dx = (X-A);
N
f = simple frequency, ∑f or N = Total number of observations
Steps:
(i) Take any variable as an assumed mean (A)
(ii) Take the deviations of the variable X from the assumed mean (X-A) and denote the
deviations by „dx‟.
(iii) Multiply these deviations (dx) by their respective frequencies (f) and obtain the total
i.e., ∑fdx.

(iv) Divide the total obtained in (iii) by total frequency i.e.,


 fdx or  fdx
f N

X A 
fdx
(v) Put the values in the formula :
N
(c) Step-Deviation Method: According to this method the formula is:

X A 
dx'
xC
N
Where: A = Assumed mean,
XA
dx‟ = ,
C
f = Simple frequency,
∑f = N Total observations.

6
Steps under this method are:
(i) Take assumed mean (A).
(ii) Take deviation of the variable X from the assumed mean (X-A) and denote by dx.
dx X  A
Divide dx by common factor i.e., C and we get dx‟ = 
C C
(iii) Now multiply dx‟ with its respective frequency and obtain fdx‟ and sum it up to get
∑fdx‟.

X A 
dx'
(iv) Apply the formula xC
N
Example 2. Calculate the Arithmetic mean from the following data, by applying all three
methods:
Shoe size 6 7 8 9 10 11 12
No. of pair 2 5 10 4 2 5 2

(a) Direct method:

Shoe Size(X) Pair (f) fX


6 2 12
7 5 35
8 10 80
9 4 36
10 2 20
11 5 55
12 2 24
N or  f = 30  fX =262
Here N=30 and  fX =262.

X  fX
N
 X  262/30 =8.73

7
(b) Short Cut Method: Select any X variable as assumed mean and calculate the deviations.
Let 9 be assumed mean (A).

Shoe Size(X) Pair (f) dx = (X-A) Fdx


6 2 6-9 = -3 -6
7 5 7-9 = -2 -10
8 10 8-9 = -1 -10
9 4 9-9 = 0 0
10 2 10-9 = 1 2
11 5 11-9 = 2 10
12 2 12-9 = 3 6
N or  f = 30  fdx = -8

X A 
fdx
On applying the formula = 9+ (-8/30) = 8.74
N
( c) Step Deviation method: Let the common factor be 10,

Shoe Size(X) Pair (f) dx = (X-A) dx’= dx/c fdx’


6 2 6-9 = -3 -.03 -.06
7 5 7-9 = -2 -.02 -.10
8 10 8-9 = -1 -.01 -.10
9 4 9-9 = 0 0 0
10 2 10-9 = 1 .01 .02
11 5 11-9 = 2 .02 .10
12 2 12-9 = 3 .03 .06
N or  f = 30  fdx‟= -.08

X A 
dx'
On applying the formula x C = 9+ (-.08/30x 10) = 8.76
N
3. Continuous series
In continuous series, arithmetic mean may be calculated by applying (a) Direct Method
(b) Short-cut Method, (c) Step-deviation or Coding Method.

8
a. Direct Method: Here the formula remains same except for the mid values of the class
intervals become the X variable.

Steps
L1  L2
I. Find out the mid-value of each class (X). Mid-value =
2
Where L1 = lower class limit and L2 = upper class limit,
II. Find out N by finding the total of the frequencies (N=∑f),
III. Multiply each mid-value by its corresponding frequency to get „fX‟,
IV. Find out ∑fX and

V. Find out Arithmetic mean by applying the formula X 


 fX
N
(b) Short-cut Method
In case short-cut method is used, then we apply the following formula to calculate
arithmetic mean.

X A
 fdX
N
Where: A = assumed mean,

dm = Deviations of mid-points from assumed mean i.e., dx = (X-A),

N = total number of observations.

Steps:

(i) Find out mid-value of each class,

(ii) Take an assumed mean (A) from the mid values,

(iii) From the mid-value of each class deduct the assumed mean i.e., (X-A) and find out
deviations (dX).

(iv) Multiply the respective frequencies of each class by these deviations and obtain the
total i.e., ∑fdX.

(v) Apply the formula X  A 


 fdX
N

9
(c) Step Deviation or Coding Method: In step-deviation method, the calculations of short-
cut method are further simplified. In this method, we take the common factor from the data and
then multiply the result with the common factor; the formula is:

X A
 fdX' x C
N

Where: N = number of items

 X  A
dX‟ =  ,
 C 
A = assumed mean,

f = Simple frequency and

C = common factor.
Merits and limitations of Arithmetic Mean

Merits: Arithmetic mean is the most commonly used average in practice, because:
(1) It is very simple to understand and calculate.
(2) Arithmetic mean is affected by the value of each and every item in the series.
(3) Arithmetic mean is defined by a rigid mathematical formula. Whichever formula is
used, we get the same result.
(4) Arithmetic mean is useful for algebraic treatment. It is better than median, Mode,
Geometric mean or harmonic Mean.
(5) Arithmetic Mean is relatively stable. It does not fluctuate much when repeated
samples are taken from one and the same universe.
(6) Arithmetic mean is the centre of gravity, balancing the values on either side of it.
(7) Arithmetic mean is a calculated value and is not based on position in the series.

Demerits: Arithmetic mean suffers from the following defects:


1. The value of arithmetic mean depends on each and every item of the series. The value
average is affected by the extreme items, either very small or very large. The impact of
extreme items is more, if the number of items is too small or too large.
2. In open-end classes, the value of mean cannot be calculated without making assumption-
regarding the size of the class interval of the open-end classes. In case of open-end

10
classes, value of median and mode can be computed without making any assumption
pertaining to the size of class-interval.
3. The value of arithmetic mean would be effective only if the distribution of the variable is
normal. But in case the distribution is U-shaped then mean is not likely to serve a useful
purpose. So it is not a good measure always.
4. Sometimes it gives very absurd results. For example, the average number of students per
class is 24.75; it is an absurd result, as students cannot be infractions.
5. Not useful in qualitative data analysis. The arithmetic mean cannot be used for qualitative
analysis as qualitative attributes like beauty, honesty etc are not measurable in terms of
numbers.
6. It can be a non-existent figure. Some-times the arithmetic mean can be a figure which
does not exist in the series i.e., a fictitious average.
7. The value of the arithmetic mean cannot be determined graphically.
8. It cannot be determined when some items of the series are missing.

Median
Median is a positional average which refers to the middle or the central most value of a
distribution when the series is arranged in ascending or descending order. In order words, median
is a value which divides the series into two equal parts. According to Connor, “The Median is
that value of the variable which divides the group in two equal parts, one part comprising all the
values greater and the other all values less than median”.
If the total number of items are in odd numbers than there is no problem in knowing the
median. If the items are in even number for e.g. 4 or 10, then there will be no actual value
exactly in the centre of the series. In that case, the median figure will be arbitrary and it may not
actually lie in the series. So median will be determined in the manner explained below.
Calculation of Median
a. Individual Observations or series:

For finding the value of median in the given series, first of all series should be arranged
in ascending or in descending order and then the following formula should be used:

 N  1  N  1
Median = Size of the   th item or M  Size of the   th item
 2   2 

Steps to be followed are:

11
(i) Arrange the data in ascending or in descending order.

(ii) In odd numbers-series i.e., 3, 5, 7, 9 etc. when we add 1 to total number, we get the
value which is divided by 2. In this case median value will actually lie in the series. In
case of even number series, we have to further go into details to know the exact value
of median. This median value may not actually fall in the series. The calculation of
median can be calculated as under:

i. Odd Number Series

If number of items is odd, then the median is the middle value after the items have been
arranged in ascending or in descending order according to its magnitude.

Example 16: Calculate the value of median from the following figure

X: 35 45 90 101 124 75 150 175 300


Solution:
First of all, arrange the above variable in ascending order.
Calculation of Median
X
 N  1
35 M = Size of the   th item
45  2 
 9  1
75 = Size of the  th item
90  2 
 10 
101 = Size of the   th item
124 2
M = Size of the 5th item
150
Size of the 5th item in the series is 101.
175 Thus M = 101
300
N=9

ii. Even Number Series


In case of even number of observations, median is obtained as the arithmetic mean of the
middle observations after they are arranged in ascending or in descending order of its magnitude.

12
Example 17: Calculate the value of median from the following figure:
X 20 35 10 84 56 12 55 28 15 66

Solution: First of all, arrange the above variable in ascending order.


X 10 12 15 20 28 35 55 56 66 84

Calculation of Median

 N  1
M = Size of the   th item
 2 
 10  1   11 
= Size of the   th item =   th item
 2  2
28  35
M = Size of the 5.5th item i.e. = 31.5
2
Thus M = 31.5
b. Discrete Series

The discrete series involves frequencies. In order to find out the median in such a case, it
is necessary to divide the total frequency into two equal parts. The total frequency is found out
with the help of cumulative frequency.

Steps to be followed are:

(i) Arrange the data in ascending or in descending order.

(ii) Calculate cumulative frequencies.

(iii) Find out the value of the middle item by applying the formula:

 N 1
Median = Size of the   th item.
 2 

(iv) Find out the total in the cumulative frequency column which is either equal to

 N  1
 th or next higher tha n that.
 2 
(v) Locate the value of the variable corresponding to the cumulative frequency. This
value of the variable is the value of the median.

This can be made clear with the help of an example.

13
Example 18: Determine the median from the following data.
Size: 15 20 25 30 35 40 45
Frequency: 12 23 45 16 14 55 28

Solution:
Calculation of Median
Size (x) Frequency (f) Cumulative Frequency (cf)
15 12 12=12
20 23 12+23 =35
25 45 12+23+45=80
30 16 12+23+45+16=96
35 14 12+23+45+16+14=110
40 55 12+23+45+16+14+55=165
45 28 12+23+45+16+14+55+28=193

 N  1  193  1   194 
M = Size of the   th item = Size of the  th item = Size of the   th item =
 2   2   2 

97th item.
M = Size of the 97th item, Size of the 97th item is 35.
Thus Median = 35.
c. Continuous Series
In continuous series, median cannot be located in a straight-forward method. In this case,
the median lies in class-interval i.e., between lower and upper limit of a class interval. In order to
find the exact value, we have to assume that value in each class is uniformly distributed in the
class interval.

Steps followed are:


(i) Arrange the data in ascending order.
(ii) Calculate Cumulative frequency.

N
(iii) Apply the formula, Median = Size of the   th item
2

14
(iv) In order to find the class interval containing the modal value, look at the cumulative

N
frequency column and find the total which is either equal to   or next higher than
2
that and ascertain the value of the class interval corresponding to this.
(v) Once the class interval is determined, then apply the following formula in order to
find the exact modal value:
N
 cf
M = L1 + 2 xi
f
Where M = Median, L1 = Lower limit of the median class, cf = cumulative frequency of
the class preceding the median class or sum of the frequencies of all classes lower than the
median class.
f = simple frequency of the median
i = class interval of the median class.
If the series is arranged in descending order, the mode will be calculated by an alternative
formula:
N
 cf
M = L2 - 2 x i , where L2 = upper class limit of the median class.
f

In case of inclusive class intervals, the series should be converted in to exclusive class intervals,
so that the true lower limit may be used in the formula.
Example 19: Compute the median from the following series.

X 10-15 15-20 20-25 25-30 30-35


Frequency 4 6 12 8 2
Solution:
Calculation of Median
X f cf
10-15 4 4
15-20 6 10
20-25 12 22
25-30 8 30
30-35 2 32

15
N   32  th
M = Size of the   the item = Size of the   th item = 16 item.
2   2
M = Size of the 16th item group (or class interval) and 16th item lies in 20 - 25 group. Therefore,
(20 -25) is the Median class. Let us now interpolate the value:
Here L1 = 20, N = 32, cf = 10, C = 5, f = 12
N 32
 cf  10
As per the formula M = L1 + 2 x i, modal value  20  2 x5
f 12

16  10 6
M = 20 + x 5  20  x 5  22.5
12 12
Graphical Location of Median
Medina value of a series can also be determined through the graphic method presenting
the data in form of ogives. This may be done in two ways:
1. Presenting data graphically in the form of “Less than” or “More than” ogives.
2. Presenting data graphically and drawing “Less than‟ and More than” ogives
simultaneously.

Median through, Less than or ‘More than’ ogives


In this method a frequency distribution is first converted into a „less than‟ and „More
than‟ cumulative series and data are presented graphically to make a „less than‟ or „more than‟
ogive. N/2nd item of the series is determined from that point (on the Y axis of the graph) a
perpendicular is drawn to the right to cut the cumulative frequency curve. The median value of
the series is one where the cumulative frequency curve is cut corresponding to the X-axis.
Some specific problems relating to Median:
Some specific problems arising in the computation of median and examples based on
these problems are as follows:-

(i) Location of median in an individual series in case of some unknown values: If certain
values are unknown in an individual series, median can be calculated provided (a) unknown
value is not the median value and (b) the sequence or order of unknown values is known.

Example 20: In a batch of 12 students 4 students failed in a test. The marks of 8 students who
passed were:-
9 6 7 8 8 9 6 5

16
What was the Medina of the marks of all 12 students?
Solution:
Medina of the marks of 12 students has been asked, while known values are only 8.
However, it is clear that 4 unknown values would come prior to 8 known values because the
unknown value are related to the students who failed and their marks will definitely be less than
the marks of 8 students who passed. So the values will be arranged as under:

O P Q R 5 6 6 7 8 8 9 9

O,P,Q and R are assumed to be the students who failed.

N1 12 1
M.No. =   6.5th item
2 2

66
6.5th item = So Median = 6
2
(ii) Zero Frequency is Discrete Series:- If in discrete series frequencies of one or more items
are zero and on account of this median number is lying in two or more cumulative frequencies,
the value of such first cumulative frequency would be median.
Example 21: Calculate the value of the median in the following series:

X 5 6 7 8 10
F 4 3 0 0 6

Solution:
X F c.f.
5 4 4
6 3 7
7 0 7
8 0 7
10 6 13
N 1
M.No. = th item
2
13  1
= th item  7th item
2

17
Median No. 7 lies in three cumulative frequencies. The value corresponding to first such
cumulative frequency is 6. So Median = 6.
(iii) Median No. as somewhere between succeeding number to any cumulative frequency in
discrete series- If median no. is in the form of such fraction, which is somewhere between
succeeding number to any cumulative frequency, the value of median will be calculated by
dividing the sum of the item of such cumulative frequency and the next item by 2 as elaborated
in the following example.
Example 22: Find out Medina form the following data:-

Marks in Test 2 3 4 5 6 7 8 9 10
No. of Students 9 6 2 2 2 4 3 3 3
Solution: Calculation of Median
X f c.f.
2 9 9
3 6 15
4 2 17
5 2 19
6 2 21
7 4 25
8 3 28
9 3 31
10 3 34
N1 34  1
M.No. =  = 17.5 item
2 2
45
17.5th item lies in the succeeding number to cumulative frequency 17. So Median =  4.5
2
Explanation – In this example, there are 17 items related to values from 2 to 4 and 17 items
related to values from 5 to 10. Median is the value which divides the series into two equal parts.
So 4.5 is the value, which divides 17 items on each side.
(iv) Median in an inclusive series- In the case of inclusive class-intervals, the series should be
converted into exclusive class-intervals, so that the true lower limit may be used for L1 in the
formula.
Example 23: Find out Median from the following data:-

18
Marks 10-14 15-19 20-24 25-29 30-34
No. of students 5 8 15 10 4
Solution:
Class Interval f c.f.
9.5-14.5 5 5
14.5-19.5 8 13
19.5-24.5 15 28
24.5-29.5 10 38
29.5-34.5 4 42
N 42
Median No. (m) = th item = th item = 21st item
2 2

Median No. 21 lies in c.f. 28. So 19.5-24.5 is median class. By applying the formula-

i
M = L1 + ( M  c)
f

5 5
= 19.5 + (21  13)  19.5  x 8  22.17
15 15

(v) Median in Unequal class-intervals- If class intervals are unequal, the frequencies need not
to be adjusted to make the class-intervals equal unless it is specified in the question and the same
formula can be applied as discussed earlier.

Example 24: Find out the Median from the following frequency distribution in unequal class-
intervals-
Class f Class f
0-4 10 12-15 15
4-6 15 15-20 10
6-12 30 20-40 6
Solution:
Class f c.f.
0-4 10 10
4-6 15 25
6-12 30 55

19
12-15 15 70
15-20 10 80
20-40 6 86
N 86
M.No. =   43rd item
2 2
M. No. 43 lies in c.f. 55. So median class is 6-12 and by applying formula in this class we get-

i
M = L1 + (m  c)
f

6
=6+ (43  25)
30

= 6 + 3.6 = 9.6

(vi) Zero frequency in continuous series- If in continuous series any frequency is zero
and median no. lies in c.f. corresponding to zero frequency, the class interval having
zero frequency is eliminated and is adjusted equally with the proceeding and
succeeding class intervals as explained in the following example:

Example 25: Find out the value of median from the following frequency distribution:
Class f Class f
0-5 3 20-25 0
5-10 4 25-30 14
10-15 6 30-35 6
15-20 12 35-40 5

Solution:
Class f c.f.
0-5 3 3
5-10 4 7
10-15 6 13
15-20 12 25
20-25 0 25
25-30 14 39
30-35 6 45

20
35-40 5 50

N 50
M. No. =   25thitem
2 2

25th item lies in c.f. of two class intervals. Viz., 15-20 and 20 – 25. The class interval
having zero frequency (20-25) will be eliminated and it will be adjusted equally with the class-
intervals of 15-20 and 25-30 as follows

Class f c.f.
0-5 3 3
5-10 4 7
10-15 6 13
15-22.5 12 25
22.5-30 14 39
30-35 6 45
35-40 5 50

Now M. No. 25 lies in c.f. 25 and its corresponding class is 15-22.5. By applying
formula:

i
M = L1 + (m  c)
f

7.5
= 15 + (25  13)
12

= 15 + 7.5 = 22.5

Merits of Median
1. Like an ideal average median is rigidly defined.
2. It is easy to understand and calculate.
3. It can be located by inspection in many cases.
4. It can be calculated even if the values of the extreme items are not known, but the
number of items should be known.
5. Median is not influenced by the values of extreme items. Sometimes, it is more
representative than arithmetic mean. Median is a positional average and is not

21
affected by extreme items. So it is very useful in case of skewed distribution, J-
shaped or inverted J-shaped distributions.
6. Median is best suited to those areas where direct quantitative measurement is not
possible. It is not possible to measure intelligence directly. But it is possible to
arrange group of persons in ascending or descending order of intelligence to locate,
who is mot intelligent person.
7. Median can be computed while dealing with a distribution with open-end classes
where as arithmetic mean cannot.

Demerits of Median
1. When there are big variations between the values of different items, then median is
not a representative average of a series.
2. Median is not suitable for further algebraic treatment. For example, we cannot find
out the total values of the items, if we known their number and median. It is easily
known through arithmetic mean.
3. Median, in continuous series, has to be computed through interpolation. It is assumed
that in class interval, frequencies are uniformly spread over, but actually it may not be
true.
4. If big or small items in a series are to receive greater importance then median would
be an unsuitable average.
5. Median is more affected by fluctuation of sampling than the arithmetic average.
6. To put items in the ascending or descending order, sometimes, is not easy.
7. In case of even number series for an ungrouped data, median cannot be determined
accurately, we can only estimate it.
Mode
Mode like median is also a positional measure. The most frequently occurring item of the
series is known as mode. That means the item which is repeated maximum number of times in
the series will be the mode of the series. It helps in determining the popularity of a commodity.
If one value occurs more frequently than any other value, the distribution is called
unimodal. In the case two different values have equal and maximum frequencies associated with
them, the distribution is known as bimodal. In the similar manner we can extend the definition to
tri-modal or Multimodal distribution. If all values of the series are unique, (a case of individual
observations) in that case no mode will exist or more will be indeterminate.

22
Definition
In the words of Croxton and Cowden, “The mode of a distribution is the value at the
point around which the items tend to be most heavily concentrated. It may be regarded as the
most typical of a series of values”.

Methods of Calculation of Mode

The value of mode can be calculated by the following main method:


1. Locating the most frequently repeated value in the array by grouping method,
2. Calculation of Mode by interpolation,
3. Locating the mode by graphic method, and
4. Estimating the mode from Mean and Median.
Calculation of Mode
(a) Individual Observations
The value occurring maximum number of times is the modal value. This can be known
by inspection. In case the number of items is large, the series can be converted into discrete or
continuous series, where mode can be found out. In case of individual observations, mode can be
determined by inspection or by converting them into discrete series.
Example 1: Calculate mode from the following data of the marks of the students:
Sr. No 1 2 3 4 5 6 7 8 9 10
Marks 14 22 34 18 19 34 22 34 14 10
Obtained
Solution:
By Inspection:
It can be observed that 34 occur most frequently, that is, 3 times. Hence modal value is
34 marks. The other method is that the data can be converted into Discrete Series and then the
modal value shall be found out as follows:
Marks (X) F
10 1
14 2
18 1
19 1
22 2
34 3

23
From the above array it can be observed that the frequency of 34 is 3, which is highest.
Hence the mode is 34.
(b) Discrete Series
In discrete series, mode can be known either by inspection method or by grouping
method. Inspection Method means to look to that value of the series around which the items are
most heavily concentrated. It will be true:
a. If there is a gradual rise or fall in the sequence of frequencies.
b. If the highest frequency and the next highest frequency are not too close i.e.,
difference should be more than four.
c. If maximum frequency occurs in the very beginning or at the very end.
d. If maximum frequency is repeated.

Example 2:
X: 6 7 8 9 10 11
f: 5 10 15 17 14 10

In the above series, it is clear that the model size is 9. This value 9 has occurred the
maximum number of times i.e., 17. But an error can be committed if the difference between the
frequency preceding it or succeeding it is very small and the items are heavily concentrated on
either side. In that case it is desirable to apply grouping method to prepare a Grouping Table and
an “Analysis-Table‟ to determine mode. Let us learn how to prepare a Grouping Table:

A grouping table has six columns:

Column I: The original frequencies are taken and the maximum frequency is encircled.

Column II: Frequencies are added in two‟s.

Column III: Leave the first item, and add the frequencies in two‟s.

Column IV: The frequencies are added in three‟s.

Column V: Leave the first frequency, and add the remaining in three‟s.

Column VI: Leave the first two frequencies and add the frequencies in three‟s.

In each case take the maximum total and put it in a circle or a box or underline. Once the
Grouping Table is prepared, an analysis table is drawn out of it. In all the six cases, maximum
frequency is taken and entered in the relevant box. The whole procedure can be made clear with
the help of example 2:

Grouping table:

24
X f I II III IV V VI
6 5
15
7 10 30
25
8 15 42
32
9 17 46
31
10 14 41
24
11 10

Col. No. 6 7 8 9 10 11
I x
II x x
III x x
IV x x x
V X x x
VI x x x
Total 1 3 6 3 1

Since the value 9 has occurred maximum number of times i.e. 6, the modal value is 9.
(c) Continuous Series
In a continuous series, the determination of mode requires one more step than that used in
discrete series. Like discrete series, the modal class i.e., the one with maximum concentration is
found out by the process of grouping. Like median, in case of mode too, we have to interpolate
the value of mode in continuous series. But this method is not exact when the size of the class
interval is changed. In that case the modal class will also change.

The mode is calculated by any one of the following two ways:


(i) By adding to the lower limit of the class (incase the series is in ascending order).
Similarly,

25
f1  f 0 Δ1
Z = L1 + (L 2  L1 ) or Z = L1 + xC
2f 1  f 0  f 2 Δ1  Δ 2
Δ1  f1  f 0 (ignoring sign)
Δ 2  f1  f 2 (ignoring sign)
(ii) By subtracting from the upper limit of the modal class (incase the series is in descending
order).

f1  f 2  Δ2 
Z = L2- (L 2  L1 ) or Z  L 2    x C
2f 1  f 0  f 2 Δ
 1  Δ 2 

Where: C = Class interval or magnitude


L1 = Lower limit of the modal class
f1 = frequency of the modal class

f0 = frequency of the class preceding the modal class

f2 = frequency of the class succeeding the modal class.

Points to Remember
a. Classes should be converted to exclusive, if they are in inclusive class intervals.
b. Length of classes should be equal.
c. Series should be in ascending or descending order.
d. If series is cumulative, then convert it into continuous series.
e. If first class is the modal class, the f0 will be taken as zero. Similarly if last class is
modal class then f2 is taken zero.

If the modal value lies in a class other than the one containing the maximum frequency,
in that case the following method is suggested.

f2
Z = L1 + x C (Important)
f0  f2
Merits and Limitations of Mode
Merits
We know that an average must possess some ideal qualities. Out of these many ideals,
mode possesses only a few. The main merits of mode are:
(i) It is simple to calculate. This means, it can be determined without much mathematical
calculations. In most of the cases, it can be located by inspection.

26
(ii) It is commonly understood and is used by people in their day to day life. The average
size of garments, shoes, average number of accidents etc. are the common instances.
(iii) Mode is the most common item of a series; it is not an isolated example like the
median. Unlike mean, it cannot be a value which is not found in the series.
(iv) Mode is not affected by the values of extreme items and as such it is preferred over
mean.
(v) For ascertaining mode, it is not necessary to know the value of all the items in a
series. What we need is point of maximum concentration which determines mode.
(vi) Mode can be determined in open-end classes without knowing the class limits.
(vii) Mode can be used to describe qualitative phenomenon.
(viii) Mode can also be determined graphically.

Demerits or Limitations

The main demerits or limitations are:


(i) Mode cannot be determined always. There can be bimodal or trimodal or multimodal
series as well.
(ii) Mode is not capable for algebraic treatment, as we can do in case of mean. For
instance, from the moral values and sizes of two or more series, we cannot find mode
of combined series as we can do in case of mean.
(iii) Modal value is not based on each and every item of the series. Even in case of
continuous frequency distribution formula, mode depends on the frequencies of
modal class f1 and the classes preceding (f0) and succeeding (f2).
(iv) Mode is also not rigidly defined. There are different methods of calculating mode, but
all of them do not render the same results. Mode is ill-defined if the maximum
frequency is repeated or if the maximum frequency occurs either in the beginning or
at the end of the distribution or if the distribution is irregular. In such cases, the value
of mode is located by the method of grouping. If the grouping method gives two
values of mode, then it is called bimodal distribution. If grouping results in more than
two modes, it is called multi-modal distribution. In such cases mode can be estimated
only by empirical relation, i.e. Mode = 3 Median – 2 Mean
(v) It is also stated that mode is ill-defined and indeterminate.
As compared to mean, mode is affected to a large extent by the fluctuations of
sampling

27
1.3 Dispersion: Range, Inter Quartile Range, Quartile Deviation, Mean Deviation
and Standard Deviation, Coefficient of variation, Lorenz Curve.
Dispersion indicates the measure of the extent to which individual items differ. It indicates lack
of uniformity in the size of items. According to Brooks and Dick “Dispersion or spread is the
degree of the scatter or variation of the variability”. Since measures of dispersion give an average
of differences of various items from an average, they are termed as averages of the second order.

Objectives of Measuring Dispersion

The purposes or objectives to measure dispersion or variation are as follows:


1. To Measure the Reliability of an Average: Dispersion tells us how far an
average is representative of the mass of data. When the dispersion is small, the
average is a typical value in the sense that it is a good estimate of the average
in the universe from which data have been taken.
2. To Serve as a Basic for Control of the Variability. In order to control the
variation or dispersion of a phenomenon it is necessary to determine the nature
and cause of variation. The measurement of inequality in the distribution of
income and wealth requires the measures of variation. Similarly variations in
body temperature, blood pressure etc., are noted for proper diagnosis.
3. To compare two or more Series with Regard to their Variability: The study
of dispersion is essential for determining the degree of consistency, uniformity,
reliability etc., A low degree of variation means more uniformity, consistency,
reliability of data, whereas a high degree of variation lacks uniformity,
consistency, reliability etc.
4. To facilitate the use of Other Statistical Techniques: The study of dispersion
helps in the application of various statistical tools like correlation, regression,
statistical quality control etc.
Characteristics of a Good Measure of Dispersion

A good measure of dispersion should possess the following characteristics:


i. It should be simple to understand;
ii. It should be easy to calculate;
iii. It should be rigidly defined;

28
iv. It should be based on each and every item of the distribution;
v. It should be suitable for algebraic and arithmetical manipulation;
vi. It should have sampling stability;
vii. It should not be unduly affected by extreme items.

Measures of Dispersion – Absolute and Relative


Absolute Measures and Relative Measure
The absolute measures of dispersion can be compared with one another only if the two
belong to the same population and are expressed in the same units like Inches, Kilograms,
Rupees etc. Absolute measures of dispersion do not help us if the series are of different
population or units of measurement. In order to make them comparable a measure of relative
dispersion is needed by dividing the absolute measure of dispersion by a measure of central
tendency, say, mean, median, mode etc.

The relative measures of dispersion can be found only by calculating:


A. Positional Measures
 Coefficient of Range
 Coefficient of Quartile Deviation
B. Calculated Measures
 Coefficient of Mean Deviation
 Coefficient of Standard Deviation
 Coefficient of Variation
 Lorenz Curve
Let us study them one by one in detail:
A. Positional Measures

Range

Range is the simplest measure of dispersion. The difference between the highest value
and lowest of a series is known as range. It is defined as the difference between the two extreme
items of the distribution. In other words, range is the difference between the highest and lowest
values of the distribution.

Absolute Range = Highest Value – Lowest Value

or R = H.V. – L.V.

29
The relative measure corresponding to range is called the coefficient of range which is
obtained by applying the formula.

Coefficien t of Range 
 H.V.  L.V
or Ratio of the Range  
H.V.  L.V.
or coefficien t of the scatternes s

If the average of the two distributions is close to each other, a comparison of the ranges
shows that the distribution with the smaller range has less dispersion. The average of that
distribution is more typical of the group.

Individual Series:

Example1: Find the range and its coefficient for the following observations:

65, 72, 102, 39, 84, 79, 27, 40, 155 and 60.

Solution:

Calculation of Range of daily wages

Range = H.V. – L.V. Where: H.V. = Highest Value

= 155 - 27 = 128 L.V. = Lowest Value

H.V.  L.V. 155  27 128


Coefficient of Range =    0.703
H.V.  L.V. 155  27 182

Discrete Series: In this series range is measured on the basis of smallest and largest value. There
is no effect of frequencies on the measurement.

Example2: Find the Range and Coefficient of Range of the following distribution:

Daily wages : 300 400 500 600 700 800 900 1000
No. of workers: 35 30 20 10 6 3 2 1
Solution:

Calculation of Range and Coefficient of Range

Range = H.V. – L.V.

= 1000-300 = Rs. 700 where H.V. = Highest Value

L.V. = Lowest Value

30
H.V. - L.V. 1000  300 700
Coefficient of Range =    0.54 approx
H.V.  L.V. 1000  300 1300
Continuous series: In this series the range is the difference between the upper limit of the
highest class (L) and the lower limit of the lowest class (S). If the series has inclusive intervals
they are changed to exclusive class intervals for finding out the values of L and S.

Quartile Deviation

Range is a crude measure because it takes into account only two extreme values i.e., the
largest and the smallest. The effect of extreme values on range can be avoided if we use the
measure of inter-quartile range. The Inter-quartile range is equal to the difference between the
third and the first quartiles.

Inter quartile range = Q3 – Q1. But this is not a common measure of dispersion.

Semi Inter quartile or Quartile Deviation


By using quartile deviations, the dependence on extreme items can be avoided. Inter
quartile range is expressed as the difference between the first and third quartiles. The lower
quarter of data i.e., up to Q1 and upper quarter i.e., after Q3 are considered only. The interval
between Q1 and Q3 includes 50% of the frequencies. Thus extreme items at either end of the
series cannot influence the value of quartile deviation. It is only the middle half of the data i.e.,
Q3 – Q1 which is needed for calculating quartile deviation. Thus quartile deviation is half of the
difference between Q3 and Q1 of the series and hence it is also known as semi inters quartile
range:
Q3  Q1
Thus Quartile Deviation (Q.D.) =
2
The quartile deviation gives the average amount by which the two quartiles differ from
median. For a normal distribution we will have:

Q3 – M = M - Q1

Where: M = Median, Q1 = Lower quartile and Q3 = Upper quartile


This means M + Q.D. covers exactly 50% of the items since 25% of the items are below
Q1 and 25% items are above Q3.

31
Coefficient of Quartile Deviation
Quartile Deviation is an absolute measure of dispersion. Its relative measure is the
coefficient of Quartile Deviation. It is shown as
Q3  Q1
Coefficient of Quartile Deviation =
Q3  Q1
Coefficient of quartile deviation is studied to compare the degree of variation in the
series.
Example3: Find out the value of quartile deviation, its coefficient from the following data:
Salary (Rs.) 210 680 620 400 310 340 120 160 280 520 870
Solution:
Calculation of Quartile Deviation and its coefficient Income is arranged in an ascending
order as follows:
120 160 210 280 310 340 400 520 620 680 870

 n  1  11  1 
Q1 = Size of the   th item = Size of the   th item
 4   4 
= Size of the 3rd item = 210

2( N  1)
Q3 = Size of the th item
4
3(11  1)  3x12  36
= Size of the th item =   th item  th item  620
4  4  4

Q Q
Q.D. = 3 1  620  210  410  Rs. 205
2 2 2

Q Q
Coefficient of Q.D. = 3 1  620  210  410  .494 approx.
Q Q 620  210 830
3 1

Example4: Calculate the Quartile Deviation and its coefficient from the following data:

Marks: 0-10 10-20 20-30 30-40 40-50


No. of Students 4 15 28 16 7

32
Solution: Calculation of Quartile Deviation and its coefficient
Marks No. of Students Cumulative Frequency
(x) (f) (cf)
0-10 4 4
10-20 15 19
20-30 28 47
30-40 16 63
40-50 7 70

N  70 
Q1 class = Size of the  th item  Size of the  th item  17.5th item
4 4
 Q1 class = 10-20
N
 cf
Q1 = L1 + 4 xC
f
N
L1 = 10, = 17.5, cf = 4, f = 15, C = 10
4
17.5  4 135
 Q1 = 10 + x10  10   10  9  19
15 15

 3N 
Q3 class = Size of the   th item
 4 
 3 x 70 
= Size of the  th item
 4 
= Size of the 52.5th item
 Q3 Class 30 – 40

3N
 cf
 Q3 = L1 + 4 xC
f
52.5  47 55
= 30 + x10  30   30  3.44  33.44
16 16
Q Q
3 1  33.44  19  7.22 marks
Q.D. =
2 2

33
Q Q
Coefficient of Q.D. = 3 1  33.44  19  14.44  0.274
Q Q 33.44  19 52.44
3 1
Range and Quartile Deviation suffer from the limitation that they are based on two values
of a series. In case of range, two extremes are taken into account. But in case of quartile
deviation Q1 and Q3 are taken into consideration. Other values of the series are ignored. The
measure which takes average into account for measuring dispersion is called „average deviation‟
which will be an ideal measure.
According to Clark and Schkade, “Average deviation is the average amount of scatter of
the items in a distribution from either the mean or the median, ignoring the signs of the
deviations. The average that is taken of the scatter is an arithmetic mean, which accounts for the
fact that this measure is often called the mean deviation”

Mean Deviation

a. Calculation of Mean Deviation by direct method in Individual Series:

Steps:

(i) Compute mean or median or mode of the series.


(ii) Find the deviations of each item from mean or median or mode and add them by

ignoring plus and minus signs and obtain ∑ d , called „modulus d‟.

(iii) Apply the formula  (M.D) =


d
N
(iv) For coefficient of M.D, which is a relative measure we will take deviations from mean
i.e.
M.D
Coefficient of MD =
X
If deviations are taken from mode from
M .D
Coefficient of M.D =
Mode
If deviations are taken from median, then
M .D
Coefficient of M.D =
Median
Example5: The following are the rates charged by a transport company for various types of
transport services provided by him: Rs. 300, 200, 700, 1000, 600, 400, 800. Calculate the mean

34
deviation and its coefficient.

Rent (Rs.) Deviations from Median = 600


(Ascending order) (signs ignored)
X d
200
400
300
300
400
200
600
0
700
100
800
200
1000
400

d  1600

 N  1  7  1
Median = Size of the   th item = Size of the   th item
 2   2 
= Size of the 4th item = 600

 Median = 600

 d  1600
M.D =  

 Rs.228 .57
 N  7

M .D.m 228.57
Coefficient of M.D. =   .38
Median 600

Short-cut Method- The process of calculation of M.D. in individual series by short-cut method:

(i) On the basis of median- (a) The items are arranged in ascending order and the value of
median is obtained. (b) The values greater than median (ΣXA) and the values less than median
(ΣXB) are added by separately. (c) Finally, M.D. is obtained by applying the following formula -

M.D. (Median) or δM =
X A XB
N

(ii) On the basis of mean- (a) Arithmetic mean ( X ) of the series is calculated. (b) The total
of the values greater than mean (ΣXA) and the total of the values less than mean (ΣXB) are

35
obtained. (c) Number of items greater than mean (NA) and number of items less than mean (NB)
are also found out. (d) M.D. is computed by applying the following formula:

M.D. (Mean) δ x =
X A   X B  (N A  N B ) X
N

Let us solve the above example following this method:

Rent (Rs.) (X) Calculation from Median (M)


(Ascending order) M = 600
200 200 

300 300   XB = 900
400 400 

600 700 

700 800   XA = 2500
800 1000 
1000

Since Median value is 600, it will be left out. This median divides the data in to two parts i.e.

 XA and  XB . The mean deviation will be :


M.D. (Median) or δM =
 X   X = 2500  900  228.57
A B

N 7

If mean is used then: Mean =


X = 4000
= 571.43,
N 7
571.43 lie between 400 and 600
Rent (Rs.) (X) Calculation from Median (M)
(Ascending order) M = 600
200 200 

300 300   XB = 900
400 400 

600 600 
700 700 
 XA  3100 ,
800 800 
1000 
1000

36
NA= 4 and NB= 3

Applying the formula: M.D. (Mean) δ x =


X A   X B  (N A  N B ) X
N

3100  900  (4  3)571.43


=
7

= 232.65

b. Computation of Mean Deviation – Discrete Series

fd
In discrete series, the formula MD = where
N

d Denotes deviations form Mean, Median or Mode ignoring  signs.

Steps to be followed are:

(1) Calculate mean or median or mode.

(2) Take deviation of the variables from mean or median or mode and add them ignoring

 signs and denote it by d .

(3) Multiply d by the respective frequencies and obtain its total i.e., ∑f d .

(4) Divide the total ∑f d by the total number of items i.e.

M.D =
f d
N
M.D
(5) For relative measure Calculate Coefficient of M.D i.e.
X or Median or Mode

c. Calculation of Mean Deviation by direct method – Continuous Series

Calculation of mean deviation in continuous series is just like discrete series. Here we get
midpoint of different classes. The deviations are taken from mean, median or mode. The formula
for calculating M.D or coefficient of M.D is the same as in case of discrete series.

f d f d
M.D = or xC
N N

37
M.D
Coefficient of MD =
X or Median

Example7: Calculate the mean deviation and its coefficient both form mean and median for the
following data:

Marks : 0-20 20-40 40-60 60-80 80-100


No. of Students 10 16 30 32 12
Solution:
Calculation of MD and its Coefficient from Mean
Marks F Mid- XA fdx' X X f d
= dx‟
Value X C
d
0-20 10 10 -2 -20 44 440
20-40 16 30 -1 -16 24 384
40-60 30 50 0 0 4 120
60-80 32 70 +1 32 16 512
80-100 12 90 +2 24 36 432
N = 100 ∑fdx‟ = 20 ∑f d = 1888

 fdx' 20
XA x C  50  x 20  54
N 100

f d 1888
M.D =   18.88 or 19 marks approx.
N 100

M.D 18.88
Coefficient of M.D =   0.349
X 54

Calculation of M.D and its Coefficient from Median

Marks Mid-Value f Cf d f d
X
0-20 10 10 10 46 460
20-40 30 16 26 26 416

38
40-60 50 30 56 6 180
60-80 70 32 88 14 448
80-100 90 12 100 34 408
N or ∑f =  f d  1912
100

N
Median class = Size of the  th item
2
 100 
= Size of the   th item or 50th items which fall in 40-60 group.
 2 
Interpolating for Median

N
 cf
50 - 26 48
Median = L1 + 2 x C  40  x 20  40   40  16 = 56
f 30 3

f D 1912
M.D =   19.12 or 19 marks approx.
N 100

M.D 19.12
Coefficient of M.D =   0.341
M 56
Calculation of M.D. by Short-cut Method in Discrete and Continuous Series-

The process of calculation of M.D. by short-cut method in discrete and continuous series is
follows:-

(1) Firstly, that average (Mean or Median) is calculated, on the basis of which M.D. is to
be computed.

(2) In case of discrete series values (X) and in case of continuous series mid-values are
multiplied by the respective frequencies (fX).

(3) The total of products of values or mid-values greater than the average multiplied by
their respective frequencies is called ΣfXA and the total of products of values or mid-
values less than the average multiplied by their respective frequencies is known as
ΣfXB.

(4) Then, we obtain ΣfA by adding the frequencies relating to values greater than the
average and ΣfB by adding the frequencies relating to values less than the average.
39
(5) Finally, the following formula is used:-

M.D. (Mean) =
 fX A   fX B  ( f A   f B ) X
N

M.D. (Median) =
 fX A   fX B  ( f A   f B ) M
N

Example8: From the following marks of 60 students, calculate the Mean Deviation from Mean
and Median:

Marks 0-10 10-20 20-30 30-40 40-50 50-60


No.of students 6 7 12 20 10 5
Solution:

Marks M.V f cf fX
0-10 5 6 6 30
10-20 15 7 13 105
20-30 25 12 25 300
30-40 35 20 45 700
40-50 45 10 55 450
50-60 55 5 60 275
1860

Mean =
 fX  1860 = 31
N 60

N 60
Median No. =  = 30th item
2 2

10
M= 30+ (30-25) = 32.5
20

M.D. (Mean) =
 fX A   fX B  ( f A   f B ) X
N

1425  435  (35  25)31 680


=   11.33
60 60

40
M.D. (Median) =
 fX A   fX B  ( f A   f B ) M
N

1425  435  (35  25)32.5 665


=   11.08
60 60

Merits of Mean Deviation


1. It is relatively simple to calculate and easy to understand. It is very close to arithmetic
mean. In order to give information to persons having no knowledge of statistics the
measure of mean deviation is more useful.
2. It is based on all the items of the items of the series. Any small change in the series
would affect the values of mean deviation.
3. It is less affected by the extreme items as compared to standard deviation.
4. Mean deviation is useful for comparison because the deviations are taken form actual
mean or median or mode.

Demerits
1. In mean deviation plus minus signs are ignored, which is not justified mathematically.
This limitation makes mean deviation useless for further mathematical treatment.
2. The mean deviation calculated from mode is not reliable, because mode in many
cases in indeterminate. Even the mean deviation calculated from median is not
reliable. If we use arithmetic mean, then it loses scientific character because the sum
of deviations from mean is greater than the sum of deviations from median when plus
minus signs are ignored.
3. If mean, median and mode are in fractions then the calculation of mean deviation
becomes cumbersome.
4. It is generally not useful for statistical inferences as it is not a satisfactory measure
when taken from mode or dealing with a skewed distribution. Theoretically, mean,
deviation gives us the best results when deviations are taken from Median. But
median is not a satisfactory measure when the distribution has large variations.
5. From mean deviations of different groups of series, it is not possible to find out the
combined mean deviation of all groups taken together. It means, mean deviation is
not capable of further algebraic treatment.
6. It is rarely used for sociological studies.
7. In case of open end classes mean deviation cannot be calculated.

41
8. It has tendency to increase with the sample size though not in the same ratio.

It is widely used in preparing common reports where people do not known much about
the statistical techniques. It is useful for small samples also where no detailed analysis is
required.
Standard Deviation is one of the best methods of measuring Dispersion. Firstly it takes
into consideration all variables and secondly it does not ignore + or – signs. Its main purpose is

to overcome the zero sum, ∑( X  X ) = 0. Instead of ignoring the  signs as in the calculation
of mean deviation, we can make them all positive by squaring them. But by the squaring
operation, the deviation from the mean will not give a zero sum but a positive number, and each
deviation will contribute to the sum of squares regardless of sign. Then in order to compensate
for squaring the deviations the square root is taken. Thus standard deviation is “the square root of
the mean of the deviations squared, or the root-mean square deviation from the mean”. Standard
deviation is denoted by the Greek letter sigma (σ).

2 2
 (X  X)
or σ  
d
σ= where d 2   (X  X) 2
N N

Coefficient of Standard Deviation

In order to compare the variability in two or more series, relative measure of standard
deviation is calculated. It is called “Coefficient of Standard Deviation”, which is calculated by

dividing standard deviation (σ) by the ( X ) of the data. Symbolically, Coefficient of Standard


Deviation (S.D.) =
X
a. Calculation of Standard Deviation – Individual Series

i. Calculation of Standard Deviation from Actual Mean. When deviations are taken
from actual mean the formula used is:

2
d
σ
N

where  = Standard deviation, ∑d2 = sum of the squared deviation from actual mean and
N = total number of observations. Steps to be followed here are:

(i) Calculate actual mean of the series i.e., X


42
(ii) Take the deviations of the items from the mean to find d = (X- X )

(iii) Square the deviations and obtain their total i.e., ∑d2.

(iv) Apply the formula :  


d 2

N
ii. Calculation of Standard Deviation from Assumed Mean: Sometimes actual mean
may come in fractions e.g., 31.123. In that case it becomes difficult to take deviations. The
square of these deviations further becomes difficult. In this case in order to avoid such
complications, we take some value as assumed mean.

2 2
 dx   dx 
σ  
N  N 
Steps:
(i) Find the total number of items i.e., N.
(ii) Take deviations from assumed mean and obtain dx = (X-A). Take the total of these
deviations i.e., ∑dx,
(iii) Square these deviations and obtain total i.e., ∑dx2

2 2
 dx   dx 
(iv) Apply the formula σ   
N  N 

(v) Divide the result of step (iv) by X , the result would be the coefficient of standard
deviation.

Coefficient of Variation
Coefficient of variation or coefficient of variability is a relative measure of dispersion. It
has been developed by Karl Pearson. The coefficient of variation is used in such problems where
we want to compare are variability of two or more than two series. A group which has more
variability as compared to the other or has more coefficient of variation, the consistency would
be less and vice versa.
Higher C.V.  Lower consistency, reliability and uniformity
Lower C.V.  Higher consistency, reliability and uniformity

43
According to Prof. Karl Pearson “coefficient of variation is the percentage variation in mean,
standard deviation being considered as the total variation in the mean”. If we wish to compare
the variability of distribution, we have to compute coefficient of variation.
σ
Coefficient of Variation or CV = x 100
X
Variance
By variance we mean the square of standard deviation. The term was first used by R.A.
Fisher in 1913. The measure of variation is liable for further quantitative analysis. If we are
dealing with a phenomenon affected by a number of variables, in that case variance helps us in
separating the effects of different factors.

Variance is square of Standard Deviation i.e.,

Variance =  2 or σ  Variance or σ
2

Smaller the values of  , the lesser the variability or greater the consistency and vice-
2

versa.

If Deviations are taken from Actual Mean, variance will be:


2
 
d2 d
2
  d
,  σ  
2
σ  2
N  N  N
 
In case the Deviation is taken from the assumed mean, the variance will be:

2 2 2 2
 fdx   fdx  2  fdx -   fdx 
σ    ,σ   
N  N  N  N 
In case of step deviation, the variance will be:

2 2  fdx' 2 2
 fdx'   fdx'  2     fdx'  
σ    x C , σ    x 2
N  N   N  N  
 

Lorenz Curve
Lorenz curve is a graphic method of studying dispersion. It is also called a „Dispersion
curve‟. The curve was devised by Dr. Max O‟ Lorenz. Hence, it is famous by the name of Lorenz
curve. It is cumulative frequency curve based on percentages.

44
Technique of Lorenz Curve-
(1) The values or if class intervals are given, their mid-values are made cumulative. Then
taking the last cumulative total as equal to 100, percentages to different cumulative
values are found out.
(2) Similarly, frequencies are also made cumulated and taking the last cumulative
frequency as equal to 100, percentages to other cumulative frequencies are found.
(3) Generally, percentages of cumulative values are shown on X-axis and the percentages
of cumulative frequencies on Y-axis. However, it is not a hard and fast rule and they
may be shown on reversed axis also.
(4) Generally, X-axis begins form 100 to 0 and Y-axis from 0 to 100.
(5) A line is drawn from 0 of X-axis to the 100 of Y-axis and this straight line is called
the line of equal distribution.
(6) Finally, the points of the percentages of cumulative values and corresponding
cumulative frequency are plotted and the curve drawn by joining these points is called
Lorenz curve.
Method of studying Dispersion by Lorenz Curve-
(1) If the Lorenz curve lies on the line of equal distribution, there is no dispersion and the
distribution is proportionately equal.
(2) The greater is the distance between the Lorenz curve and line of equal distribution,
the greater will be the dispersion. Similarly, the nearer the curve, the lesser will be the
dispersion.
(3) If two or more Lorenz curves are drawn on the same graph, the curve that is at a
greater distance from the line of equal distribution will represent greater variability in
the series.

1.4 Skewness: Concept and its Measures.

Skewness is a statistical measure which explains the shape of distribution. The word
„Skewness‟ means „lack of symmetry‟. In other words if frequency distribution on either side of
the central value is not symmetrical, it will be called Skewness.

“Measures of skewness tell us the direction and the extent of skewness. In symmetrical
distribution the mean, median and mode are identical. The more the mean moves away from the
mode, the larger the asymmetry or skewness”.

Skewness is a statistical measure which explains the asymmetrical nature and its degree

45
in frequency distribution of a series. In the study of skewness two words „symmetrical‟ and
asymmetrical‟ are widely used. Hence, it will be appropriate to understand these two terms in
detail.

1. Symmetrical or Normal Distribution: In this distribution frequencies increase and


decrease in a regular order i.e., the spread of frequencies will be the same on both sides of the
center point. The following figures illustrate such distribution:

X: 10 11 12 13 14

f: 4 6 10 6 4

It is clear from this example that the central frequency is 10 and on both sides of its
frequencies are 6 and 4 respectively. The main features of a symmetrical distribution are: (a) The
curve prepared on this basis is bell-shaped. (b) The values of mean, median and mode are
identical. (c) The difference between Q3 and M and M and Q1 is equal i.e., Q3 – M = M –Q1. (d)
The skewness is zero.

2. Asymmetrical or Skewed Distribution: In such a distribution there is no uniformity or


regularity in the order of increase and decrease of frequencies. The main features of such
distribution are – (a) The curve is stretched more to one side than to the other. In other words, it
has a longer tail to one side (left or right), (b) The values of mean, median and mode are not
identical, (c) Median does not exist at a equi-distance from Q1 and Q3 (d) Skewness exists in
such a distribution.

Types of Skewness

It is worth mentioning that asymmetrical distribution may be of two types:

(i) Positively skewed and (ii) Negatively skewed

(i) Positively Skewed Distribution: -A distribution, in which more than half of the area
under curve is to the right side of the mode, is a positively skewed distribution. Under such a
distribution mean is greater than the median and the median is greater than the mode (
(X  M  Z) . The difference between Q3 and median is greater than the difference between
median and Q1 (Q3 – M>M-Q1). It can be illustrated by the following table:

X f
10 10

46
11 60
12 50
13 40
14 30
15 20
16 10

It is clear from this table and figure shown above that (a) The curve is more titled to the
right, (b) Mean (12.55) > Median (12) > Mode (11). (c) The value of Q3 and Q1 are 14 and 11.
Thus Q3 – M>M-Q1 i.e., 14-12>12-11.

(ii) Negative Skewed Distribution-In a negatively skewed distribution, more than half of the
area under the distribution curve is to the left side of the mode. Under such a
distribution mean is less than the median and median is less than the mode (
X  M  Z) . The difference between Q3 and median is less than the difference
between median and Q1 (Q3 – M < M – Q1). It has been explained in the following
illustration:

X f
10 10
11 20
12 30
13 40
14 50
15 60
16 10
It is clear from the about example that (a) The curve is more tilted to the left side,
(b) Mean (13:04) < Median (14) Mode (15). (c) Q3 – M = (15-14) < M – Q1 = (14-12).

Test of Skewness

The following tests may be applied in order to ascertain whether a distribution is skewed
or not:
1. On the basis of averages:
If the values of mean, median and mode are identical, there is no skewness. The greater is
the difference between these three; more will be the skewness in distribution. If the value

47
of mean is greater than the value of mode, skewness is positive and on the contrary the
skewness will be negative.
2. On the basis of deviations:
If the sum of positive deviations from median is equal to the sum of negative association,
there is no skewness. On the contrary if these sums are not equal, there will skewness in
the distribution.
3. Distance of quartiles from median:
If the quartiles (Q3 and Q1) are equidistant from median, skewness is zero. If they are not
equidistant, there will be skewness.
4. Frequencies on either side of mode:
If the sum of frequencies on either side of modal value of equal, distribution is
symmetrical.
5. Shape of the curve:
When the data are plotted on a graph paper and the curve is bell shaped there is no
skewness. If it is titled either to the left or the right, there is skewness.

Measures of Skewness

To find out the nature (positive or negative) and extent of skewness in a series statistical
measures of skewness are calculated. These measures can be absolute or relative.

Absolute measures of skewness-

Absolute measures of skewness explain two things-(i) what is the extent of skewness?
and (ii) whether skewness is positive or negative? Various absolute measure of skewness are as
fol1ows :

(1) SK=Mean – Mode or X  Z


(2) SK = 3 (Mean -Median) or 3 ( X  M)
(3) SK = Q3 + Q1 - 2M

Of the above measures, formula No. 1 and 2 are also cal1ed as 'first measure of skewness'
or 'position of average method', while the formula No.3 is cal1ed the "second measure of
skewness'. According to formula No.1:

(a) If Mean = Mode (No skewness)


(b) If Mean > Mode (Positive skewness)
(c) If Mean < Mode (Negative skewness)
48
Formula (2) is an alternate of formula (1) and it is used when mode is il1-defined.
Formula (3) is based on quartiles and median. It means that

(a) If Q3 + Q1 = 2M (No. skewness)


(b) If Q3 + Q1 > 2M (Positive skewness)
(c) If Q3 + Q1 < 2M (Negative skewness)

It is worth mentioning that practical1y these absolute measures are not very useful on
account of fol1owing limitations:

1) These measures are expressed in terms of original units of data. Hence, they cannot be
compared. For example, heights of students are given in centimeters and weight in kgs,
absolute measures of skewness wil1 also be in cm. and kgs., which are not comparable.

2) Even if the distribution are having the same units of measurement in two or more series,
absolute measures may be confusing because it is just possible that despite the difference
in absolute measurement, the frequency curves are similarly skewed or there is difference
in shape of frequency curves though the difference between the mean and mode is equal.

Relative measures of skewness-

The relative measures of skewness, also known as coefficients of skewness are very
useful in comparing the skewness between two or more distributions. Moreover, in relative
measures of skewness, the disturbing factor of dispersion or variation is eliminated by dividing
the absolute measure of skewness by a suitable measure of dispersion. Some of the important
relative measures of skewness are as fol1ows-

(1) Karl Pearson‟s Coefficient of Skewness.


(2) Bowley's Coefficient of Skewness,
(3) Kel1eys Coefficient of Skewness.
Karl Pearson's Coefficient of Skewness
It is also known as „Pearsonian Coefficient of Skewness'. This formula is based on the
fact that when a distribution drifts away from symmetry, its mean, median and mode tend to
deviate from each other. For the computation of this coefficient the difference between mean
and. mode is divided by the standard deviation. Symbolical1y-
Mean  Mode X  z
(I) J= or
S.D. σ
If in a particular frequency distribution, it is difficult to determine precisely the mode or

49
mode is ill-defined, then following alternative formula be used by substituting the mode by
median:
3(Mean  Median) 3(X  M)
( II ) J= or
S.D. σ
Technically, these two formulas may be called as „Pearsonian first order coefficient of
skewness‟ and „Pearsonian second order coefficient of skewness‟ respectively.
Bowley’s Coefficient of Skewness
The formula of coefficient of skewness propounded by Prof. Bowley is based on quartiles
and thus it is also known as „Quartile coefficient of skewness‟. It is calculated as follows:-

Q  Q  2M
Jq = 3 1
Q Q
3 1

The logic behind is measure is the relationship between the value of quartiles. In a
symmetrical distribution, Q3 and Q1 are equi-distant from median (i.e., Q2) and skewness is zero.
When there is lack of symmetry in a given distribution (Q3 – M) will not be equal to (M-Q1). If
Q3 – M exceeds M – Q1, skewness tends to be positive because in such a case Q3 will be farther
from M and Q1 is from M. On the other hand, if Q3 – M is less than M –Q1, skewness is negative
because Q1 will be farther from M than Q3 is from M.

Note:

1. The absolute measure related to quartile coefficient of skewness is known as second


measure or quartile measure of skewness. Its formula is:

SK = (Q3 – M) – (M-Q1) = Q3 + Q1 – 2M

2. Under following two circumstances, Bowley‟s Coefficient of Skewness is considered


better than Karl Pearson‟s coefficient of skewness: - (a) when mode is ill-defined and
extreme observations are prominent in the series and (b) when class intervals are not
equal or the distribution is open-end.

3. Karl Pearson‟s and Bowley‟s coefficients of skewness are not comparable. There may
be significant variation in both these coefficients. It is also possible for them to have
opposite signs (one positive and other negative). However, in the absence of
skewness, coefficient is denoted by zero value by both these formulae.

4. The limits of Bowley‟s coefficient of skewness are – 1 to +1.

50
5. An important limitation of Bowley‟s formula is that it is based on middle 50% items
of the series, while Karl Pearson‟s formula considers all values of the series.

Kelley’s Coefficient of Skewness:

Kelley‟s Coefficient of Skewness is based on percentiles as given below:-

P  P  2P
J (Kelley) = 90 10 50
P P
90 10

This measure is rarely used in practice. However, it has its theoretical attraction. This
measure is also called as „Percentile Coefficient of Skewness‟. It can be modified ads under
also:-

D  D  2M
J (Kelley) = 9 1
D D
9 1

1.5 Summary
Dispersion shows the importance of a frequency distribution i.e., the variability in two or
more than two series. A measure of dispersion is useful in examining representation of an
average, finding a place of an individual item in the series and is capable of further mathematical
calculations. We know that an ideal measure of dispersion should possess the following
properties.

a. An ideal measure should be easy to calculate and simple to understand.


b. It should be rigidly defined.
c. It should be based on each and every item of the series.
d. It should be capable of further mathematical treatment.
e. It should not be affected by the extreme items of the series.
f. It should have sampling stability

There are five important methods of measuring dispersion: - Range, Q.D, M.D, S.D and
Lorenz curve. Which method to use will depend upon three aspects i.e. Characteristics of each
measurement, nature of data and the objective of measuring variation. After examining the above
qualities of an ideal measure of dispersion and comparing the relative merits and demerits, the
standard deviation seems to be the best measure of dispersion.

51
1.6 Questions

1. What is meant by Dispersion? Explain different methods of computing dispersion with


their comparative usefulness.
2. What is a Lorenz curve and how do you construct it.
3. How combined Standard Deviation is calculated?
4. Find the Arithmetic mean and the standard deviation:

i. Wages: 100, 200, 300, 400, 500, 600, 700, 800.

b. No. of workers: 8 10 12 22 6 13 9 10.

1.7 Suggested Reading


2. S.L. Aggarwal & S. L. Bhardwaj, “Business Statistics” New Delhi.
3. Lewin and Rubin, “Statistics for Management; Prentice-Hall of India, New Delhi.
4. Dr. Gupta K. L. – „BUSINESS STATISTICS‟

52
UNIT-II

ANALYSIS OF TIME SERIES AND FORECASTING

Structure

2.1 Objectives
2.2 Introduction
2.3 Time Series: Meaning
2.4 Components of a time series
2.5 Methods of estimating the trends
2.6 Forecasting: Concept and Methods of Forecasting.
2.7 Summary
2.8 Questions
2.9 Suggested Reading
2.1 Objectives

The objective of the lesson is to make you understand:

1. The concept and utility of time series analysis.

2. The different types of trends.

3. Measurements of different types of trends

2.2 Introduction

Time factor is a continuous dynamic affair, and the sequence of fluctuations goes on in
economic, business, political, social and various other areas with chronological movement.
These fluctuations affect various economic aspects such as trade, commerce, industry,
employment, price, etc. Hence, the study and analysis of movements linked with time factor has
become very useful and for this purpose the technique of analysis of time series is used.

2.3 Time Series Meaning

Time series refers to such a series in which statistical data are presented on the basis of time of
occurrence or in a chronological order. The measurement of time may be year, month, week,
day, hour or even minutes or seconds. Technically, in a time series the data represents the
movement related to time factor.

53
2.4 Components of a time series

Time series is influenced collectively by a large variety of factors and forces. The effects of these
forces can be classified in some definite categories. These categories are called the components
of time series. The main components of time series may be classified as given below:

Original Data (O or Y)

Long-term Movement Short-time Fluctuations


Or (O-T)
Trend
Or
Secular Trend Regular Short-time Irregular or
(T) Fluctuations (S+C) Random Fluctuations (I)

Seasonal Variations (S) Cyclical Fluctuations (C)

These various components of time series have been examined in detail in the following
paragraphs:

2.4. 1 Secular Trend or Long-term Movement or Trend

Some times the movement or change is in a particular direction for a very long period and
this tendency is called long-trends. For example, that despite the measures taken by our
government in our country, the long-term trend of population is of ever increasing. There are
certain such facts also, in which tendency moves to one direction only such as continuous
increase in prices, continuous decline in death rate, etc.

There are two main objectives of measuring secular trend:

1. Knowledge of other components- The first objective of measuring trend is to make it


distinct from the other components, such as short-term fluctuations, seasonal variations,

54
cyclical fluctuations, etc.

2. Estimation of future-The second and most important objective of measuring trend is to


project the curve into the future as a long-term forecast.

Some important characteristics of secular trend are as follows:

(a) Three Aspects- There may be three aspects of long-term trend: - Upward trend,
Downward trend and Stable trend. An upward trend is usually observed in time series
relating to prices, population etc, while a downward trend is noticed in data of
illiteracy, death rate, etc. The stable trend is found in pure and natural sciences such
as tendency of temperature of an individual as 98.6%.

(b) Different Trends during different periods- It is not necessary that the trend should
be in the same direction throughout the given period. It may be possible that different
tendencies of increase, decrease or stability are observed in different periods of time.
However, the overall tendency may be upward, downward or stable.

(c) Relative Concept- The term „long period of time‟ is a relative concept, which is
influenced by the characteristics of the series.

The symbol of „T‟ is used for denoting long-term in the formulae relating to analysis of
time series.

2.4. 2. Regular Short-term Oscillations-

Most of the time series are influenced by such factors or forces which repeat themselves
periodically. The variations arising out on account of such regular or periodical repetitions are
called regular short-time oscillations, which may be classified into the following categories:-

(a) Seasonal variations: it means the movements which are regular and repetitive in nature and
operates in a sequence and periodic manner over a span of twelve months. This span can be a
day, week, month or any time period. Its main reasons are as follows:

(i) Climate- The climatic changes or weather conditions play an important role in
seasonal movements. E.g. sale of woolen clothes, cottons, soft drinks, air
conditioners, etc. depends upon climatic changes.

(ii) Customs and conventions- Customs and conventions also have their impact on
seasonal variations. For instance, event managers are in big demand during marriage

55
seasons in India. Similarly festivals in the community increases the demand for
various things needed for those events.

(iii) Conditions of specific time- Seasonal variations also occur on account of different
conditions at different times during a year. E.g. rush in the trains and tourists spots
during the vacations in schools, demand for new uniforms when the new sessions in
the schools starts.

Characteristics- The main characteristics of seasonal variations are as follows:-

i. Regular movement – Seasonal variations occur regularly almost at the same time
and about the same proportion within a period of less than one year.

ii. Swings in both directions- These variations may swing to any direction i.e. upward
or downward.

iii. Easy forecast – Seasonal variations can easily be forecasted. Various economic and
business activities are operated on the basis of these forecasting. Consumers,
producers and sellers give due consideration to these variations while taking decisions
about their operations. These variations are denoted by letter „S; in the analysis of
time series.

(b) Cyclical Fluctuations- Cyclical fluctuations also occur periodically like seasonal
variations, but the period of their reoccurrence is more than a year. These fluctuations are called
cyclical because they occur in cyclical nature and in this cycle there are four stages: (i)
Prosperity, (ii) Recession, (iii) Depression and (iv) Recovery. There is no definite period of
cyclical fluctuations. This period may vary from 3 to 10 years. For examples after every three
yeas there is tendency of bumper crops of mangoes or after every four years production of
sugarcane reaches to the peak, etc.

Cyclical fluctuation is denoted by the symbol „C‟.

2.4. 3. Irregular or Random Fluctuations-

Irregular or random fluctuations occur accidentally in time series. For instance, decline in
the demand for the food grains in a particular year due to earth quake and mass deaths, decrease
in production due to sudden break down of the plant, etc. The following facts are important about
the characteristics of these fluctuations:-

(i) No Forecasting- These fluctuations cannot be predicted or forecasted.

56
(ii) No Definite Pattern-Irregular fluctuations have no regular period or time of their
occurrence. That is why, they are called irregular. They are uncertain.

(iii) Short-term- Generally, they occur as short-term, variations but sometimes their
effect may be so intense that they give rise to new cyclical or other movements.

(iv) Coverage- Irregular fluctuations cover all such variations in a time series, which are
not covered within the gamut of trend, seasonal and cyclical movements.

Irregular fluctuations are denoted by the symbol „I‟.

2.4. 4 Difference between Seasonal Variations and Cyclical Fluctuations

The main difference between seasonal variations and cyclical fluctuations are as follows:-

1. Period of reoccurrence- Seasonal variations reoccur within a period of one year, while
the cycle of cyclical fluctuations completes generally between 3 to 10 years.

2. Regularity – Seasonal variations are almost regular from the point of view of period and
sequence, while in cyclical fluctuation the order of prosperity, recession, depression and
recovery is definite, but the duration of each stage is not definite and goes on changing
from time to time.

3. Accuracy of Measurement – Seasonal variations can be measured more precisely and


accurately in comparison to cyclical fluctuations.

4. Causes- Seasonal variations occur on account of changes in season or needs of particular


time, while cyclical fluctuations take place due to inflation, deflation or sudden change in
demand.

5. Relationship with business- Seasonal variations occurs in different order in different


businesses but cyclical fluctuations influence all business uniformly.

2.5 Computation of the Time Series

Original data (O) given in time series include four components: (1) Trend or T, (2)
Seasonal Variations or S, (3) Cyclical Fluctuations or C, (4) Irregular Fluctuations or I. The
measurement, analysis and study of these components are called analysis of time series. Broadly,
the following two aspects are covered in analysis of time series.

(1) Identifying or determining the various forces or influences whose interaction produces
the variations in the time series;

57
(2) Isolating, studying, analyzing and measuring the components independently, i.e., by
holding other things constant.

The measurement of four components of time series is based on following two models:

1. Additive Model – This model is based on the assumption that the sum of four
components is equal to original value, i.e., O = T + S + C + I. This model assumes all
components as residual, on the basis of which short-term fluctuations (S + C + I) can be found
out by deducting trend (T), from original data (O) or O – T = S + C + I. Similarly, cyclical and
irregular fluctuations can be found out by deducting seasonal variations from short-term
fluctuations, i.e., O – T – S = C + I. if seasonal and cyclical fluctuations are isolated from short-
term fluctuations (O – T), irregular fluctuations can be measured i.e., O – T – (S + C) = O – T –
S – C = I.

2. Multiplicative Model- This model assumes original data as a multiplication of four


components, i.e, O = T x S x C x I. In the analysis of time series this model is used in measuring
O
and isolating short-term fluctuations, i.e., =S x C x I,
T

O O
= S x I and =I
TxC TxCxS

It should be remembered that in this model trend is expressed in terms of original data
while other three components are expressed as rate or indices fluctuating above or below unity.

It is also worth mentioning in the context of models of time series that in practice mixture
of additive and multiplicative both (known as mixed models) may also be adopted. Some of the
mixed models are given below:-

O = TCS + I O = T + SCI

O = TC + SI O = T + S + CI

2.6 Methods of estimating the trends

The following are the four important methods which are used in estimating the trend:

1) Free-hand Curve Method

2) Semi-average Method

58
3) Moving-average Method

4) Method of Least Squares

I Free-hand Curve Method

This method is also called as „Graphic Method‟ or „Curve fitting by inspection‟. The
procedure for knowing the trend by this method is as follows:-

1. First of all, the original values of a time series are plotted on a graph paper and a
histogram is obtained by joining these points.

2. After this a smooth curve is drawn through these points keeping in view the direction of f
luctuations in such a way that the curve represents the general tendency of the data.

It is evident that by free hand curve method an effort is made to draw such a smooth
curve, so that seasonal, cyclical and irregular fluctuations in the original data may be eliminated.
Here it is important to note that different persons are likely to draw different curves from the
same data. Hence, the following points must be kept in mind while drawing a freehand smooth
curve: (1) The curve should be smooth (2) The number of points above the trend curve should be
more or less equal to the number of points below it. (3) if there is cyclical fluctuation in data, the
cycles above the curve and below the curve should be almost equal.

It is the simplest and most flexible method of studying trends without mathematical
calculations. At the same time, it is highly subjective because there is all possibility of drawing
different curves by different persons for the same set of data. It will be dangerous to use it for
forecasting or making predictions.

The conclusion is that despite being simple and flexible this method is not scientific and
much reliable. So in practice, its use is very limited.

Example:

Show the secular trend from the following data by free-hand curve method:-
Year 1995 1996 1997 1998 1999 2000
Production (Rs.) 60 68 64 70 62 72

59
Production (Rs.)

74
72
70
68
66
64
62
60
58
1994 1995 1996 1997 1998 1999 2000 2001

II Semi-average Method

The procedure of finding out trend by semi-average method is as follows-

1. The First step is that the values of time series are divided into two equal parts. For
example, if values of six years are given, then the values of first three years will be kept in the
first part and the values of the remaining three years in the second part. It is important to note
that if the number of values is even, they will be divided exactly into two equal parts but if the
number of values is odd, the median value (value of the middle year) is left out and the
remaining values are divided into two equal parts. For example, if values of 11 years are given,
 11  1 
then the value of the middle year   6th year  will be left out and the series will be
 2 
divided into two equal parts on the basis of values from first year and seventh year to eleventh
year.

2. After dividing the given series into two equal parts, we compute the arithmetic mean of
time-series values for each half separately. These means are called semi-average.
Alternatively, median may also be calculated.

3. After making these mathematical calculations we draw a curve by plotting the original
data on graph paper.

4. Both the semi-averages are plotted as point against the middle point of the

60
respective time periods covered by each part. For example, there is a period of five years
5 1
in each part, the first 5 years, i.e., = 3 rd year and the second point against the
2
median point of last five years i.e., 8th year.

5. Finally, a straight line is drawn by joining the two points of semi-averages and this line

represents the trend of data.

Example 1: Fit a trend by the method of semi-averages to the data given below:-

(Prod: in Rs. Crores)


Year 1992 1993 1994 1995 1996 1997 1998 1999 2000
Production 9 12 13 14 17 18 20 22 24

Solution:

Here the number of years is 9. So the middle, i.e. 5th year (1996) will be ignored and the
time series will be divided into two equal parts:-first from 1992 to 1995 and the other from 1997
to 2000.

Year Sales 4 yearly Semi- Semi-Averages


(Rs. Crores) totals
1992
1993
1994
9
12
13
 48
48
4
 12

1995 14
1996 17
1997
1998
1999
18
20
22
 84
84
4
 21

2000 24

The first semi-average 12 will be plotted in the middle of the year 1993 and 1994.
Similarly, the second semi-average 21 will be plotted against the middle of the year 1998 and
1999.

61
30

25

20

15

10

0
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001

Calculation of Trend Values

Semi-average for the middle of 1998 and 1999 = 21

Semi-average for the middle of 1993 and 1994 = 12

Difference for 5 years = 9

9
Annual Increase Rate = = 1.8
5

Year Trend Value Year Trend Value


1992 12 - 2.7 = 9.3 1997 21 – 2.7 = 18.3
1993 12 -. 9 = 11.1 1998 21 - .9 = 20.1
1994 12 + .9 = 12.9 1999 21 = .9 = 21.9
1995 12 + 2.7 = 14.7 2000 21 = 2.7 = 23.7
1996 12 + 4.5 = 16.5

This method is easy to understand and apply; there is objectivity and certainty in trend
line. The trend line can be extended on either side in order to obtain past or future estimates. But
this method can be used appropriately only when there is linear or approximately linear
relationship between the plotted points. If there are certain extreme values (very low or very
high), they will influence the value of semi-averages and in such a case trend line may not
represent the values correctly.

62
It is evident that semi-average method is more appropriate in comparison to free hand
curve method; even then it is not much reliable particularly when there are certain extreme
values or lack of linear relationship.

(III) Moving-average Method

Moving average method is a simple and flexible device of reducing fluctuations and
obtaining trend values with a fair degree of accuracy. It consists in obtaining a series of moving
averages (arithmetic means) of successive overlapping groups or sections of the time series. For
example, there are six years a, b, c d, e and f and three year‟s moving average is to be computed.
It will be done as follows:

a bc bcd cde def


, , ,
3 3 3 3

The basic question to be decided in this method is that what should be the period of
moving average, i.e., three yearly, four yearly, five yearly, etc. This decision is taken on the basis
of size of data and fluctuations therein. From the point of view of calculation of moving
averages, the questions can be divided in two categories: - (1) when period is odd, and (2) when
period is even.

1. Odd Period Moving Averages- It means moving averages of odd period of years, i.e., 3,
5, 7, 9, 11…….years. Its procedure can be explained as below on the assumption that three
yearly moving averages are to be calculated.

(i) First of all, three yearly moving totals will be obtained. The total of first three years
will be placed against the centre of three years, i.e., second year.

(ii) After it, total of next three years (second, third and fourth) will be placed against third
year, total of succeeding three years (third, fourth and fifth) will be placed against fourth year
and this process will continue till the value of the last year is included in the total.

(iii) Moving averages will be obtained by dividing each moving total by 3. It is important
that moving averages will not be obtained for first and last year in case of 3 yearly moving
averages and first two and last two years in case of 5 yearly moving averages.

Example 2: from the following data, let us obtain trend value by three yearly moving averages:

Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Sales 4 6 8 12 16 19 24 28 32 40

63
Solution:

Year Sales Three yearly moving Three yearly moving


total averages
1995 4
1996 6 4+6+8 = 18 6
1997 8 6+8+12 = 26 8.66
1998 12 8+12+16 = 36 12
1999 16 12+16+19 = 47 15.66
2000 19 16+19+24 = 59 19.66
2001 24 19+24+28 = 71 23.66
2002 28 24+28+32 = 84 28
2003 32 28+32+40 = 100 33.33
2004 40

2. Even Period Moving Averages- If the moving average is to be calculated on the basis of
even period i.e., 2, 4, 6 years, then averages are calculated after centering the moving totals.
Suppose, four yearly moving totals are to be calculated, the following procedure would be
adopted:-

(i) First of all, four yearly moving totals will be obtained. The first total will be of first four
years, the next total of four years excluding first year and this process will be repeated. The first
total will be placed between second and third year, second total between third and fourth year
and so on.

(ii) After it, these moving totals will be centered. For this purpose two period moving totals
will be obtained.

(ii) Two period moving totals will be divided by 8.

Example 3: From the above data let us obtain the four yearly moving averages and determine
the trend values;

Solution:

Year Sales Four yearly moving Two periods moving Moving


total totals centered averages *

64
1995 4
1996 6 4+6+8 +12= 30
1997 8 6+8+12 +16 = 42 72 9
1998 12 8+12+16 +19= 55 97 12.1
1999 16 12+16+19 +24= 71 126 15.75
2000 19 16+19+24+28 = 87 158 19.75
2001 24 19+24+28+32=103 190 23.75
2002 28 24+28+32+40=124 227 28.3
2003 32
2004 40
* Moving Averages = Two periods moving totals centered / 8

Use of Weights in the calculation of Moving Averages-

Weights may also be used in the calculation of moving averages. Its objective is to assign
different importance to the values of different years. It has been explained in the following
example:-

Example 4: In example no. 2, let us calculate weighted moving average of period 3, the weights
being 1, 3 and 2 respectively;

Solution:

Year Sales Three yearly moving total Three yearly moving averages
1995 4
1996 6 (4x1+6x3+8x2) = 38 12.66
1997 8 (6x1+8x3+12x2) = 54 18
1998 12 (8x1+12x3+16x2) = 76 25.33
1999 16 (12x1+16x3+19x2) = 98 32.66
2000 19 (16x1+19x3+24x2) = 121 40.33
2001 24 (19x1+24x3+28x2) = 147 49
2002 28 (24x1+28x3+32x2 )= 172 57.33
2003 32 (28x1+32x3+40x2) = 204 68
2004 40

65
Decision regarding period of Moving Averages

Often the question arises that what should be the period of moving average? In fact, the
fluctuations in the original data are smoothened and the intensity of irregular fluctuations is
minimized by using moving averages. So, greater the period of moving averages, there will be,
lesser the intensity of irregular fluctuations. Thus, the argument of longer period of moving
average is given from the view of reducing the impact of irregular fluctuations. But at the same
time, with the increase in period of average, the distance of trend values from original values
increases. Therefore, practically, the optimum period of moving average is that which coincides
with the period of cycle existing in time series. Such a period eliminates cyclical variations,
minimizes irregular fluctuations and presents the best possible values of trend.

This method is simple to understand and apply since it does not require complex
mathematical calculations. It is better than free hand curve method. This method is more flexible
enough in comparison to both methods i.e. semi-average and least squares. The period of moving
averages coincides with the period of cyclical fluctuations in the data, such fluctuations are
automatically eliminated. This method is used not only for determining trend values but also for
seasonal, cyclical and irregular variations.

An important limitation of this method is that trend values for some years in the
beginning and some at the end cannot be obtained. For instance, in three yearly moving averages,
the trend value for first and last year and in four yearly moving average trend values for first two
and last two years are not obtained.

Since trend values are not expressed on the basis of functional relationship, this method is
not helpful in forecasting and predicting the values on the basis of time.

(IV) Methods of Least Squares

This method is considered as one of the best methods for obtaining trend values. Under
this method the line of best fit is obtained for the time series on the assumption of least squares
by using algebraic equations. This line may be in the form of straight line or parabolic curve.
This method is called the method of least squares because the sum of squares of the deviations of
various points of trend line from original data would be the least as compared to the sums of
squares of the deviations obtained by using any other line.

The determination of trend on the basis of the method of least squares can be categorized
into three parts:- (1) Fitting a Straight Line Trend, (2) Fitting a Parabolic or Non-linear Trend

66
and (3) Semi-logarithmic or Exponential Curve.

1. Fitting a Straight Line Trend-

For fitting a straight line trend on the basis of method of least squares, the following
equation is used:-

Yc = a + bX (Yc = required trend value)

X = unit of time

„a‟ and „b‟ are constants. If the first year is taken as origin, then „a‟ is the difference
between the point of origin (O) and the point where the trend line and Y-axis intersect. If the
middle year of the time series is taken as origin, then „a‟ means the arithmetic mean of time
series. The constant „b‟ indicates the slope of trend line. It is also called growth rate or decline
rate because it tells the change in trend line (Y) for each unit change in time (X).

The following two equations are used to determine the values of constants „a‟ and „b‟:-

Y = Na + b X ………..(i)
…………..(ii)
XY = aX + bX2

The use of equations in the above form is called the long method of least squares. However, if
deviations are taken exactly from the middle year of time series, the value of X becomes zero
and in such a case the above equations may be summarized as follows:-

a=
Y b=
 XY
N X 2

The use of summarized version of equations is called the short method of least squares. If
nothing contrary is mentioned in the questions and it is possible to take deviations exactly from
the middle year, then short method should be preferred because its calculation procedure is very
simple.

a. Least Squares Short Method – The procedure of this method is as follows:-

1. In total 6 columns are drawn, which will be for year, value (Y), deviations (X), XY, X2
and trend values (Yc) respectively.

2. First of all, time deviations are taken for all other years from the exact middle or median
year and these deviations are shown in the column of X. It should be checked that sum of

67
deviations is equal to zero or X=0.

3. Each deviation is squared up (X2) and by adding these squares we obtain X2.

4. XY is obtained by adding the multiplication (XY) of values (Y) and deviations (X).

5. The value of constant „a‟ is computed by the formula


Y . Here N stands for the
N
numbers of years (time units) involved.

6. The value of „b‟ is found by applying the formula:


 XY
X 2

7. Trend values (Yc) are obtained by applying the formula a+bX against each year. It should
be checked that sum of the original values (Y) must be equal to the sum of the trend
values (Yc or Y).

Example 5: Let us Fit a straight line trend by the method of least square in the given data:

Year 2000 2001 2002 2003 2004


Values 22 50 68 75 35

Solution:

Year Y X from X2 XY Trend


2002 Y = a+bX
2000 22 -2 4 -44 50+5.1 x-2 =39.8
2001 50 -1 1 -50 50+5.1 x-1= 44.9
2002 68 0 0 0 50+5.1 x 0 = 50
2003 75 1 1 75 50+5.1 x1= 55.1
2004 35 2 4 70 50+5.1 x 2 = 60.2
N=5 Y   X 0 X 2
 10  XY  51
250

a=
Y = 250
= 50, b=
 XY = 51  5.1
N 5  X 10 2

b. Least Squares Long Method:- The calculation process in this method is as follows:-

1. First of all, deviations are taken assuming any year or unit of time as origin. There is no

68
restriction in respect of origin in this method. If some instruction is given in the question, that
should be followed, otherwise calculations can be simplified by taking mid-point in time as
origin. If we want to avoid negative signs, the points of time or years (X) are denoted by natural
numbers like 1, 2, 3, 4…..which means that the year or unit of time just preceding to first year or
unit of time has been assumed as the point of origin. It should be remembered that assumption of
the point of origin at any point does not influence the trend values. However, the values of „a‟ or
„b‟ or both may differ.

2. X is obtained by adding deviations (X).

3. Deviations are squared up (X2) and their total (X2) is obtained.

4. The deviations (X) are multiplied with the respective values (Y) and the total is found out
(XY).

5. Original values (Y) are added to get Y.

6. After these calculations the value of „a‟ and „b‟ is obtained by using the following two
equations and by solving them by the process of simultaneous equations:-

Y = Na + bX

XY = aX + bX2

7. Value of „a‟ and „b‟ are placed in the trend equation (Yc = a + bX) and the trend values are
found for the various values of X‟s.

Example 6: Let us Fit a straight line trend by the long method of least square in the same data as
that of example 5:

Year Y X from X2 XY Trend *


1999 Y = a+bX
2000 22 1 1 22 34.7+5.1x1=39.8
2001 50 2 4 100 34.7+5.1x2=44.9
2002 68 3 9 204 34.7+5.1 x 3 = 50
2003 75 4 16 300 34.7+5.1x4= 55.1
2004 35 5 25 175 34.7+5.1x5 =60.2
N=5 Y  X  X 2
 55  XY 
250 15 801

69
*Let us now calculate the values of „a‟ and „b‟ by solving the two equations simultaneously: Y
= Na + bX ---------- (i) ------- 250 = 5a + 15b

XY = aX + bX2 ----------- (ii) ------ 801 = 15a + 55b

Multiply eq. (i) by 3 and we get 750 = 15a + 45b ------------- (i)

801 = 15a + 55b------------ (ii)

On changing the signs (-) (-) (-)

------------------------------

- 51 = -10b

 b = 5.1

On substituting the value of „b‟ in equation (i)

250 = 5a + (15x 5.1)

250 – 76.5 = 5a

173 .5
 a= = 34.7
5

Trend values = Y = a+bX Y = 34.7 + 5.1X

Conversion of Annual Trend Equations into Monthly Trend Equations:- If annual trend
equation is known, it can be converted into monthly trend equation by dividing the computed
constant „a‟ and 12 and the value of „b‟ is divided by 144 „a‟ is divided by 12 it is converted into
monthly value (dividing by 12) and then again by 12 to make it monthly increment value. Thus
the annual trend equation Y = a + b x is converted into monthly trend equation as follows:-Y =
a b
 X
12 144

Shifting the Trend Origin- While computing trend values, a certain year or unit of time) is
assumed as point of origin. At times it may be necessary to change the origin of the trend
equation to some other point in the series. There is no need of making all calculations again but
there will be only the following adjustment to shift the origin of the trend equations:-

Yc = a + b ( X k)

70
Where „k‟ is the number of time units shifted. If the origin is shifted forward in time, k
will be positive, if shifted backward in time, k will be negative.

Example7: Supposed to shift the origin from 1999 to 2002 in the above example, means to shift
the origin 3 years forward, i.e. k = +3

Yc = a + b (X k)

= 34.7 + 5.1 (X + 3)

= 34.7 + 5.1X + 15.3

Yc = 50 + 5.1X

2. Fitting a Parabolic or Non-linear Trend

There may be many such conditions in economic and business fields, in which a straight
line trend may not represent the long term tendency in its reality. In such cases the best
alternative is to try a parabolic curve of definite powers (second, third or fourth etc.). In this
context the equation of second degree parabola or parabolic curve of the second degree is as
follows:-

Y = a + bX + cX2

Where „a‟ is the Y intercept, „b‟ is the slope of the curve at the origin and „c‟ is the rate of
change in the slope. The values of a, b and c can be determined by solving the following three
equations:-

Y = Na + bX + cX2 …….(i)

XY = aX + bX2 + cX3 ……..(ii)

X2Y = aX2 + bX3 + cX4 …….(iii)

It time origin is taken from the middle point of the time series, the value of X would be
zero and the above equations and reduced to the following:

Y = Na + cX2 ………(i)

XY = bX2 ……..(ii)

X2Y = aX4 ………(iii)

Note : When X = 0, X3 will also be zero.

71
Example 8: Let us fit a second degree parabola for the following data:

Year 2000 2001 2002 2003 2004


Value 10 12 15 20 24

Solution:

Year Y X from X2 X3 X4 XY X2Y


2002
2000 10 -2 4 -8 16 -20 40
2001 12 -1 1 -1 1 -12 12
2002 15 0 0 0 0 0 0
2003 20 1 1 1 1 20 20
2004 24 2 4 8 16 48 96
N=5 Y X 0 X 3
0
 81 X 2
X 4
 XY X 2
Y
 10  34  36  168

In the above calculations X and X 3


are equal to zero therefore simplified equations will

be used:

Y = Na + cX2 or 81 = 5a + 10c ……(i)

XY = bX2 or 36 = 10b …..(ii)

X2Y = aX4 or 168 = 34a ……(iii)

As per equation (ii) 36 = 10b or b = 3.6

Let us solve the two equations (i) and (iii) simultaneously;

81 = 5a + 10c ……(i)

168 = 34a ……(iii)

Multiply the equation (i) by 68 and (iii) by 10

(81 = 5a + 10c) x 68 = 5508 = 340a + 680c ……(i)

(168 = 34a) x 10 = 1680 = 340a ……(iii)

72
On changing sings (-) (-)

-------------------------------------------

3828 = 680c

c  3828/680 = 5.63

Substitute the value of c in equation (i) to find the value of a

81 = 5a + 10c or 81 = 5a + 10 x 5.63 ……(i)

81 – 56.3 = 5a

 a  4.94

 Y = a + bX + cX2

Y = 4.94 + 3.6X + 5.63X2

Year X calculation Trend values (Yc)


2000 -2 4.94 + 3.6 x -2 + 5.63 x 4 20.26
2001 -1 4.94 + 3.6 x -1 + 5.63 x 1 6.97
2002 0 4.94 + 3.6 x 0 + 5.63 x 0 4.94
2003 1 4.94 + 3.6 x 1+ 5.63 x1 14.17
2004 2 4.94 + 3.6 x 2+ 5.63 x 4 34.66

3. Semi-logarithmic or Exponential Curve:

If the time series is increasing or decreasing by a constant percentage rather than a


constant stable amount the use of semi-logarithmic or exponential curve is considered
appropriate. The equation of this curve is:

Y = abx  log Y = log a + X log b

The following equations are used to find out the values of a and b:-

 log Y = N log a + log b x X …..(i)

(X log y) = log a x X + log b x X2 …..(ii)

If middle year is taken as origin, the equations are reduced to

73
 log Y = N log a or log a =
 log Y …(i)
N

(X log Y) = log b X2 or log b =


 (X log Y) ….(ii)
X 2

Least Squares Method is one of the best among all the methods. It is completely
objectives, in which trend values are calculated on the basis of well-defined mathematical
principles and formulae. There is no possibility of personal bias in this method. The equation
establishes a functional relationship in between x and y series and through this relationship
forecasting can easily be made for future values. The trend line obtained by this method is the
line of best fit, because the sum of positive and negative deviations of original data from this line
is zero and sum of squares of deviations of original data from this line is zero and sum of squares
of deviations is minimum.

But at the same time this method is tedious and complicated from the point of
mathematical calculation. If the selection of trend equation (linear, parabolic or some other type)
is not proper, it may lead to fallacious results. Prediction in this method is based on long-term
trend and the impact of seasonal, cyclical or irregular variations is ignored.
Measurement of Short-Time Fluctuations

The values of a time series are the combined results of long-term trend and short-term
fluctuations. Thus, if it is assumed that there is no irregular or random fluctuation in data, short
term fluctuations can be obtained by deducting trend values from the original values. The trend
values are obtained either by moving average method or least squares method (already discussed
in detail).

 Measurement of Seasonal Variations

The measurement of seasonal variations is of paramount importance and use in the


analyses of time series related to economic and business world. It, on the one hand, does help in
short-term planning of business activities and on the other hand measures can be taken to avoid
losses from such variations. The following are the important methods of measuring seasonal
variations:-

1. Simple Average or Seasonal Averages or Seasonal Variation Index Method;

2. Seasonal Variation through Moving Averages;

74
3. Link relative Method;

4. Ratio-to-trend Method;

5. Ratio-to-moving Average Method.

Let us discuss them one by one:

1. Simple Average or Seasonal Averages or Seasonal Variation Index Method: It is the


simplest and easiest of all the methods of seasonal variations. This method is used when there is
no definite long-term in the data. Its calculation process (in case of monthly data) is as follows:-

(1) First of all, data are arranged by years and months. In first columns 12 months are
mentioned and years are given as headings in subsequent columns.

(2) Then, total of each month for all the years are obtained. For example, data for 4 years
are given then the values of January for all the 4 years will be added and this process
will be repeated for all the months.

(3) The average for each month is obtained by dividing the total by number of years.
Conveniently, we can call them as X1 , X 2 , X 3 ,.......X12 .

(4) After this, the average of monthly averages is obtained by dividing total by 12, i.e.,
 X1  X 2  ......... X12 
  . It is called as general average X  .
 12 

(5) Taking the average of monthly averages (General average) equal to 100, seasonal
(variation) index for each month is calculated by the following formula:-

Monthly Average of Specific Month x 100


Seasonal (Variation) Index No. =
General Average

X1
For example, seasonal index no. for the month of January will be calculated as x100 .
X

Note: It should be checked that the sum of the seasonal indices must be 1200 for monthly data
and 400 for quarterly data.

Example 9: Let us calculate seasonal indices from the following data:

Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000 15 16 18 18 23 23 20 28 29 33 33 38

75
2001 23 22 28 27 31 28 22 28 32 37 34 44
2002 25 25 35 36 36 30 30 34 38 47 41 53

Solution:
Month 2000 2001 2002 Total M.A* S.Index
Jan 15 23 25 63 21 70
Feb 16 22 25 63 21 70
Mar 18 28 35 81 27 90
Apr 18 27 36 81 27 90
May 23 31 36 90 30 100
Jun 23 28 30 81 27 90
Jul 20 22 30 72 24 80
Aug 28 28 34 90 30 100
Sep 29 32 38 99 33 110
Oct 33 37 47 117 39 130
Nov 33 34 41 108 36 120
Dec 38 44 53 135 45 150
Total 1080 360 1200
Average 90 G.A = 30 100

MA = Monthly average, S.Index = Seasonal index numbers, G.A = General Average

MonthlyAverages
Seasonal variation index No. = x100
GeneralAverages

II. Seasonal Variation through Moving Average

If the original data of the time series is affected by trend, then seasonal variations can be
measured by using moving averages. An important advantage of this method is that almost all
types of variations: - short-term, seasonal and irregular can be measured by it. This method is
based on the additive model of time series. Its process is outlined below:-

1. First of all, moving averages of the data are computed. If the data are on quarterly
basis, then four quarterly and if they are monthly, then twelve monthly moving

76
averages are calculated. In both the cases, the time periods of moving averages being
eve, these averages will be centered.

2. From each original value (O), the corresponding moving average figure (T) will be
deducted to find out short-term oscillation [O – T = S + C + I].

3. After this a separate table is prepared in which short-term oscillations are added on
the basis of months or quarters and their means are obtained. These means are called
seasonal variations.

Note : If the corresponding seasonal variations are deducted from the short-term oscillations, the
differences gives the result of irregular fluctuations.

Example 10: From the following data let us calculate the seasonal variations:

Year Summer Monsoon Autumn Winter


2000 20 80 60 140
2001 35 110 85 175
2002 40 125 100 225
2003 55 175 135 255
2004 62 250 125 310

Solution:
Year Season O M.T T.C T O-T S.V*
2000 Summer 20
Monsoon 80
Autumn 60 300 615 76.88 -41.88 -64.88
Winter 140 315 660 82.50 27.50 45.50
2001 Summer 35 345 715 89.38 -4.38 -15.60
Monsoon 110 370 775 96.87 78.13 99.56
Autumn 85 405 815 101.86 -61.86 -64.88
Winter 175 410 835 104.38 20.62 45.50
2002 Summer 40 425 865 108.13 -8.13 -15.60
Monsoon 125 440 930 116.25 108.75 99.56
Autumn 100 490 995 124.38 -69.38 -64.88

77
Winter 225 505 1060 132.50 42.50 45.50
2003 Summer 55 555 1145 143.12 -8.12 -15.60
Monsoon 175 590 1180 147.50 77.50 99.56
Autumn 135 590 1187 148.38 -86.38 -64.88
Winter 225 597 1269 158.63 91.37 45.50
2004 Summer 62 672 1334 166.75 -41.75 -15.60
Monsoon 250 662 1409 176.13 133.87 99.56
Autumn 125 747
Winter 310
O = Original data, M.T = Moving totals, T.C = Total Centered, T = Quarterly moving averages,
O-T = Short term Oscillation, S.V* = Seasonal Variation: Calculated below by systemizing the
short term oscillations in the following table:

Year Summer Monsoon Autumn Winter


2000 -41.88 27.50
2001 -4.38 78.13 -61.86 20.62
2002 -8.13 108.75 -69.38 42.50
2003 -8.12 77.50 -86.38 91.37
2004 -41.75 133.87
Total -62.36 398.25 -259.50 181.99
Average -15.60 99.56 -64.88 45.50

(III) Link Relative Method

The procedure for computing seasonal variation (indices) by this method is as follows:-

1. First of all, link relatives (I.R.) for each seasonal figure (monthly or quarterly) are
calculated by the following formula:-

Value of Current Season x100


L.R =
Value of Previous Season

2. Average is computed for the link relatives of each season.

3. These averages are converted into chain relatives taking the chain relative of the first
season equal to 100 and for this purpose the following formula is used:-

78
L.R. of Current Season x C.R. of Previous Season
Chain Relative (C.R.) Of Current Season =
100

4. The chain relative of the first season is calculated on the basis of last seasons as below:-

C.R. of Last Season x L.R. of First Season


Calculated C.R. of First Season =
100

5. It should be remembered that theoretically the calculated chain relative of the first season
should be 100 but in practice it is not necessary to be 100. In such a case correction factor is
used. If the calculated C.R. of first season exceeds 100, the correction factor would be deducted
from C.R. of other seasons. On the contrary if the calculated C.R. of first season is below 100,
the correction factor would be added. If data are quarterly, the adjustment of correction factor
would be as follows:-

(i) The index no. of first quarter would remain 100 and there would be no adjustment.

(ii) In second quarter there would be adjustment of per season average difference ( d).

(iii) There will be adjustment  2d in third quarter and  3d in fourth quarter.

Per season average difference (d) is calculated as follows:-

Calculated C.R. of First Season  100


4

After making adjustment of correction factor, the adjusted chain relatives are prepared.

6. These adjusted chain relatives are converted into seasonal indices assuming average of
adjusted C.R. equal to 100.

Note : The total of seasonal indices would be 400 for quarterly data and 1200 for monthly data.

Example 11: From the following data let us calculate the index of seasonal variation through
Link Relative Method:

Year I II III IV
2000 2 4 6 8
2001 4 4 5 8
2002 4 6 3 12
2003 6 6 8 12
I,II,III,IV are Prices in quarters.

79
Solution: First of all link relative will be calculated. There will be no computation for the first
quarter. The link relative for the other quarters will be computed as follows:

Pr icesOfCurrentQuarter
L.R = x 100
Pr icesOf Pr eviousQuarter

4 x100 6 x100
On this basis the L.R of second quarter of 2000 =  200 , of third quarter  150
2 4
12 x100
and this will be repeated till the last quarter of 2003 for which L.R =  150
8

Computation of seasonal Variation Indices on the basis of Link Relative Method:

Year I II III IV
2000 ---- 200 150 133.3
2001 50 100 125 160
2002 50 150 50 400
2003 50 100 133.3 150
Total of L.R 150 550 458.3 843.3
Average of 50 137.5 114.6 210.8
L.R
Chain 100 100 x137.5 137.5 x114.6 157 .6 x 210 .8
Relatives 100 100 100
 137.5  157.6  332 .2
Adjusted 100 137.5- 157.6-33= 332.2-49.5=
Chain Relative 16.5=124.6 124.6 282.7
Seasonal 100 x100 121 x100 124 .6 x100 282 .7 x100
Variation 157 .1 157 .1 157 .1 157 .1
 63.65  77.02  79.31  179 .95
Index No.
Note:
1. In computation of average of L.R, total of L.R of I quarter has been divided by 3, while the
totals of other quarters have been divided by 4.

2. The chain relative of first quarter is assumed as 100, while C.R of other quarters has been
calculated by following its respective formula.

50x332.2
3. The calculated C.R of first quarter on the basis of fourth quarter is ( ) 166.1 and on
100

80
166.1  100
this basis correction factor per quarter will be  16.5
4

100  121  124.6  282.7 628.3


4. Average of adjusted C.R =   157.1
4 4

Seasonal Index Nos. has been calculated with the help of the following formula:
AdjustedC.R
x100
AverageofAdjustedC.R

(IV) Ratio-to-trend Method

This method is based on multiplicative model. Its calculation process is as follows:-

1. First of all, trend values are obtained by the method of least squares. If the values given
in question are on quarterly basis, trend equation should be used after computing arithmetic
mean of the values of four quarters of every year. The change rate obtained by the calculation of
„b‟ will be annual, which will be divided by 4 to find out quarterly change rate. If this rate is
positive, half of it will be deducted and added to trend value of first year in order to obtain trend
value of second and third quarter of first year respectively. After this, trend values of other
quarters will also be obtained on the basis of the same rate of change. If the data are on monthly
basis, monthly change rate will be calculated by dividing the annual change rate by 12.

2. Now eliminate trend values on the basis of multiplicative model for which the original
data of all seasons (monthly or quarterly) will be divided by the corresponding trend values and
 O x 100 
the quotient will be multiplied by 100   . They are known as „Ratio-to-Trend‟ to
 T 
„Percentage Trend Value‟.

3. To eliminate the cyclical and irregular variations, these values are arranged season-wise
for various years and arithmetic means of these values are computed. After this general average
of the above means is calculated.

4. Finally, seasonal indices are calculated on the basis of following formula:-

Seasonal Average of Ratio  to  trend x 100


Seasonal Index No. =
General Average of Ratio  to  trend

Example 12: Taking the same data as above, let us calculate Seasonal Variation Indices by ratio-
to-trend method:

81
Year I II III IV
2000 2 4 6 8
2001 4 4 5 8
2002 4 6 3 12
I, II, III, IV are Prices in quarters.
Solution:
Total of Quaterly X X2 XY Yc=a+bX
Year four Average (Y)
quarter
2000 20 5 -1 1 -20 3
2001 21 5.25 0 0 0 5.5
2002 25 6.25 1 1 25 8
N=4  Y  16.50  X  0 X 2
2  XY  5

a=
Y 
16.50
 5.5 , b=
 XY = 5
 2.5
N 3 X 2
2

b or annual growth rate is 2.5 or the quarterly growth rate is 2.5/4 =.625

On the basis of this we can easily find out the quarterly trend values. Considering the
year 2000 we find the trend value is 3. This is the value for the middle of the year 2000, i.e., half
of the second quarter and half of the third quarter. The quarterly growth rate is .625. On this
basis the trend values for the second quarter of 2000 would be 3 - .312 =2.688 and for the third
quarter 3 + .312 = 3.312. The value for the first quarter of 2000 would be 2.688 - .625 = 2.063
and for the last quarter 3.312 + .625 =3.937. Similarly trend values for the various quarters of
other years can be calculated. These values are tabulated below:

Year I II III IV
2000 2.063 2.688 3.312 3.937
2001 4.563 5.188 5.812 6.437
2002 7.063 7.688 8.312 8.937
Ox100
Now let us compute the quarterly values as the percent of Trend values:
T

82
Year I II III IV
2000 96.94 148.80 181.15 203.20
2001 87.66 77.10 86.02 124.28
2002 56.63 78.04 36.09 134.27
Total 241.23 303.94 303.26 461.75
Average 80.41 101.31 101.08 153.91
Seasonal I .No 80.41x100 101 .31x100 101 .08 x100 153 .91x100
109.42 109 .42 109 .42 109 .42
 73.48  92.59  92.37  140 .65

80.41  102.31  101.08  153.91


Average of the ratio-to-trend =  109.42
4

(V) Ratio-to-moving Average Method

This method is also known as percentage of moving average method. The steps involved
in the computation of seasonal indices by this method are as follows:-

1. First of all, twelve monthly or four quarterly (as the case may be) moving averages (T x
C) are calculated in order to eliminate seasonality from the data.

2. The original data of each season (O) is divided by the corresponding moving average (T
x C) and this ratio (ratio-to-moving average) is expressed in percentage.

 O x 100 T x S x C x I 
 T x C or x 100  S x I x 100  .
T xC 

3. The ratio-to-moving averages are arranged month wise or quarter-wise (as the case may
be) and season wise arithmetic means are calculated. In this process irregular fluctuations are
eliminated to a great extent.

4. Finally, seasonal variation indices are obtained on the basis of general average of
arithmetic means equal to 100.

Example 13: Taking the same figures as that of example 10, let us compute the seasonal
variation indices by ratio-to-moving average method:

Solution: The moving averages have already been calculated in example 10. After this step the
calculations will be as follows:

83
Year Season O M.T T.C T R-T-M Av
2000 Summer 20
Monsoon 80
Autumn 60 300 615 76.88 78.04
Winter 140 315 660 82.50 169.70
2001 Summer 35 345 715 89.38 39.16
Monsoon 110 370 775 96.87 113.55
Autumn 85 405 815 101.86 83.45
Winter 175 410 835 104.38 167.66
2002 Summer 40 425 865 108.13 36.99
Monsoon 125 440 930 116.25 107.53
Autumn 100 490 995 124.38 80.40
Winter 225 505 1060 132.50 169.81
2003 Summer 55 555 1145 143.12 38.43
Monsoon 175 590 1180 147.50 118.64
Autumn 135 590 1187 148.38 90.98
Winter 225 597 1269 158.63 141.84
2004 Summer 62 672 1334 166.75 37.18
Monsoon 250 662 1409 176.13 141.94
Autumn 125 747
Winter 310
O = Original data, M.T = Moving totals, T.C = Total Centered, T = Quarterly moving averages,
R-T-M Av = Ratio-to-moving Average.
Seasonal variation indices will be calculated after systemizing ratio-to-moving averages
in a new table as given below:
Years Summer Monsoon Autumn Winter
2000 78.04 169.70
2001 39.16 113.55 83.45 167.66
2002 36.99 107.53 80.40 169.81
2003 38.43 118.64 90.98 141.84
2004 37.18 141.94

84
Total 151.76 481.66 332.87 649.01
Average 37.94 120.41 83.23 162.25
Seasonal 37.94 x100 120 .41x100 83.23 x100 162 .25 x100
Variation 100 .96 100 .96 100 .96 100 .96
 37.57  119 .26  82.44  160 .70
Indices

37.94  120.41  83.23  162.25 403.83


General Average=   100.96
4 4
 Measurement of Cyclical Fluctuations

Croxton and Cowden have mentioned the following four methods in their book „Applied
General Statistics‟ for the measurement of cyclical fluctuations:-

1. Residual Method 2. Direct Method

3. Harmonic Analysis Method 4. Method of Cyclical Averages

Amongst all these methods, the Residual method is most commonly used. Cyclical
fluctuations can be measured by this method on the basis of both:- additive and multiplicative
models.

According to multiplicative model, first of all trend values (T) and seasonal indices (S)
are calculated. After this original data (O) are divided by trend (T) to obtain SCI and SCI is
divided by seasonal indices (S) to get CI. Then three or five yearly moving weighted averages
are taken of CI values. In three yearly moving averages weights are given as 1, 2, 1, while in five
yearly moving averages weights of 1, 2, 4, 2 and 1 are given. These moving averages express
cyclical fluctuations.

On the basis of additive model short-term fluctuations (S + C+ I) are obtained by


deducting trend values (T) from original data (O). Then C + I is obtained by deducting S from S
+ C + I. If there is no irregular fluctuation in the series C + I will express cyclical fluctuations. If
there are irregular fluctuations, then after eliminating them cyclical fluctuations can be measured.

Example 14: Let us calculate cyclical valuation from the following data using additive model:

Years 2000 2001 2002 2003 2004


Y 35 45 50 65 85

85
Solution:
The first step is to calculate the trend values (Yc) by the least square method (Yc = a +
bX). After this cyclical fluctuations will be measured by deducting Yc from Y with the
assumption that there is no seasonal variation in the series. Alternative, „O‟ can be substituted in
of Y and T in place of Yc.
Year Y Yc (Y-Yc) or (O-T)
2000 35 32 +3
2001 45 44 +1
2002 50 56 -6
2003 65 68 -3
2004 85 80 +5

 Measurement of Irregular or Random fluctuations

Irregular fluctuations can also be measured by residual method and additive or


multiplicative, any model may be used. According to additive model irregular fluctuations can be
O
measured by I = O – T – S – C and according to multiplicative model by I = . In
T XSXC
practice, the cycle itself is so erratic and so interwoven with irregular movements that it is
impossible to separate them. Hence, in the analysis of time series, trends and seasonal
movements are usually measured directly while cyclical and irregular fluctuations are left
together and thus, irregular fluctuations are measured as O-T-S or C+I.

Example 14: Let us measure irregular fluctuations in the data given in example 10. There is no
cyclical fluctuation in the data.

Solution: Let us again look back to example 10, where we have already calculated the short term
oscillations (S+C+I) and seasonal variations (S). Here C+I will be obtained by deducting S from
S+C+I. In the question there is no cyclical fluctuations in the data. So the value of C will be zero
and C+I will itself be equal to irregular fluctuations:

Year Season O O-T S.V I.R


2000 Summer 20
Monsoon 80
Autumn 60 -41.88 -64.88 -23
Winter 140 27.50 45.50 +18

86
2001 Summer 35 -4.38 -15.60 -11.22
Monsoon 110 78.13 99.56 +21.43
Autumn 85 -61.86 -64.88 -3.02
Winter 175 20.62 45.50 +24.88
2002 Summer 40 -8.13 -15.60 -7.47
Monsoon 125 108.75 99.56 +9.19
Autumn 100 -69.38 -64.88 -4.50
Winter 225 42.50 45.50 +3
2003 Summer 55 -8.12 -15.60 -7.48
Monsoon 175 77.50 99.56 +22.06
Autumn 135 -86.38 -64.88 -21.50
Winter 225 91.37 45.50 +45.87
2004 Summer 62 -41.75 -15.60 -26.15
Monsoon 250 133.87 99.56 +34.31
Autumn 125
Winter 310
O-T = Short term Oscillation, S.V = Seasonal Variation, I.R= Irregular Fluctuation

 Deseasonalisation of Data

There are two objectives of studying seasonal variations:- (a) to measure them and (b) to
eliminate them from the given series. Elimination of the seasonal effects from the series is
termed as deseasonalisation of data. If the application of multiplication model is assumed, the
following formula will be used for deseasonalisation:

Y x 100
Deseasonalised value =
Seasonal Index

2.7 Forecasting: Concept and Methods of Forecasting.


 Meaning

Business forecasting means to estimate the future prospects on the basis of


scientific analysis of available knowledge and information related to past trends, present
conditions and expected possibilities of business activities. Different authors define it
differently:

87
1. “Business forecasting is the analysis of statistical data and other economic, political and
market information for the purpose of reducing the risks involved in making business
decisions range plans”.

2. “Business forecasting is the calculation of reasonable probabilities about the future, based
on the analysis of all the latest relevant information by tested and logically sound
statistical and econometric techniques, as interpreted, modified and applied in terms of an
executive‟s personal judgment and social knowledge of his own business and his own
industry and trade”.

Is forecasting different from prediction or projection?

Yes, it sure is. First let us understand the meaning of prediction and projection. A
Prediction is an estimate for future, based on the past data or information. In simple words, it is
purely a mechanical extrapolation, in which present conditions and future possibilities are not
adjusted. To tell about the future of a person on the basis of his horoscope is a simple and
practical example of predication, whereas, Projection is a future estimate in which certain
numerical assumptions are also adjusted with past data. For example, in estimating the number of
persons in a town, the assumptions of birth and death rate, marriage rate, sex ratio, etc. are also
adjusted.

A Forecast is something beyond this, i.e. the adjustment and analysis of past data and
present conditions in a scientific manner, including the experiences, decisions and subjective
approaches of the forecaster. In this sense, forecasting involves use of all our knowledge, from
whatever source, about the situation.

 Nature of Business Forecasting

It is necessary to understand the nature of business forecasting:


1. It is based on cause and effect relationship, therefore involves scientific methods.
However, since it involves practical application, is an art.

2. It expresses the future possibilities on the basis of past trends and present conditions.

3. No doubt business forecasting is made on numerical terms but at the same time some
subjective factors such as experience of businessman, advice of experts, opinion of
consumers etc., play an important role in making the forecasts more viable.

4. Forecasting works on approximate values and the level of its accuracy depends upon
the choice of assumptions and techniques used.

88
5. Business forecasting can not be complete and final in one attempt but it modified in
the light of changed circumstances, policies and conditions, i.e., it is flexible.

 Objectives of Business Forecasting

The objects of business forecasting may be pinpointed as under are:


1. To present information about possible events in future;

2. To provide the base for determination of future policies;

3. To provide proper and logical grounds for managerial decisions in case of


uncertainties.;

4. To reduce the risk arising out of business fluctuations;

5. To assist in planning and direction of business activities;

6. To estimate the degree of error in forecasting.

 Why is Business forecasting important for not only for the businessman but
also for other groups of the societies?
The business forecasting is of utmost importance to various groups, which is discussed as
follows:

1. Importance for Business: Forecasting provides very important and essential base for
business activities. It helps business executive in estimating various aspects such as
demand of goods, stock of raw material, purchase of raw material, whether condition,
fashion trends, pockets of the people and their tastes and preferences etc. and
accuracy of these estimates affects his decisions which in turn affects the profit and
losses of the business in future.

2. Control on Trade Cycles: Flow of trade cycle, in which periods of boom, recession
and recovery occur, is an important feature of business world. The expectation of
these periods can properly be estimated on the basis of business forecasting and
accordingly precautious steps may be taken to minimize the harmful effects of trade
cycles.

3. Useful to Society. Business forecasting also helps the different segments of society
such as consumers, retailers, fashion designer, agriculturists, housewives etc. to plan
their activities properly. Consumers and retailers can determine the purchase and
storage of commodities, fashion designer can take decisions in respect of the trends

89
and fashion, agriculturists can decide the pattern and volume of crop production on
the basis of rainfall and whether conditions and housewives plan their budgets and
purchases according to whether, trends, income, inflation, etc

4. Importance to Government. Business forecasting plays a very important role in the


determination of economic policies of the government. The decisions of government
relating to control on inflation, credit policy, global trade, etc. are taken on the basis
of business forecasting. Government prepares five year budget also on the basis of
past and future estimates.

 Methods or Techniques of Business Forecasting

Business forecasting plays an important role in planning business activities. In recent


years a number of scientific statistical techniques have been developed for this purpose.
Some of the important methods or techniques of forecasting are as follows:-

 Index Number or Business Barometers – Index numbers are regarded as the indicators
or barometers of business activity. They are constructed for measuring various changes
taking place in business phenomena such as national product, industrial and agricultural
production, prices, wages, bank deposits, bank credit, employment, prices of shares,
foreign trade, etc. While general index of business activity is prepared by combining
different activities in the field of production, trade, finance, etc., specific index numbers
for a particular industry or business may also be prepared on the other hand. With the
help of these index numbers, changes in business activities are analyzed and possible
changes for future are forecasted.

 Extrapolation or Mathematical Projection – It is a very simple technique of


forecasting, in which the value of some future point of time is projected on the basis of
data pertaining to some variable for a certain period. It is assumed in this technique that
the past behavior of data will be maintained in the future also, i.e., rigidity and
consistency in behavior of the variable is assumed. Evidently, this method can be used
only under those circumstances, wherein there is no change or sudden fluctuations. This
method is very popular in demand or sales forecasting. Some of common curves used for
this purpose are: Arithmetic trend, Semi-log trend, Modified exponential trend, Logistic
curve and Gompetz curve. In selecting the most appropriate curve both empirical as well
as theoretical considerations are to be taken into account.

90
 Time Series Analysis – If data for several years are available and long-term trend,
seasonal variations and cyclical fluctuations in that data are visible, the technique of time
series analysis can be applied. In this technique first of all various components of time
series are separated and then business forecasting is made by putting them back together
through the process of synthesis. There may be error in forecasting by this technique due
to effect on non-measurable cyclical fluctuations and sudden changes, even then it will
have to be recognized that scientific analysis of time series provides an important and
logical base for forecasting.

 Regression Analysis – The regression analysis has also contributed a lot to business
forecasting. In this technique, effort is made to forecast or estimate on the basis of
calculation of nature and degree of mutual relationship between different variables, such
as, relationship between rainfall and agricultural production, expenditure on
advertisement and volume of sales, etc. The assumption of this technique is that there is
functional relationship between different variables.

 Econometric Method- Economic method, developed by the combination of economic


theories, mathematical calculations and statistical procedures, is also being used in
business forecasting. During past some years this method has gained popularity on
account of increased use of computer technology. Econometric method assumes that the
behavior of economic system is guided by numerous economic factors which can be
expressed by econometric models developed on the basis of various simultaneous
equations. The values of constant for these equations are obtained from the analysis of
time series and economic indices. On the whole, econometric method is a process in
which efforts are made to investigate and measure the quantitative aspects of actual
operation of economic activities and to make forecast of certain economic phenomena at
a specific level of probability.

 Opinion Poll – It is a subjective technique of forecasting in which views or opinions of


related persons and experts are solicited in respect of possible changes in future and their
effects and forecasting is made on the basis of analysis of this collection. This technique
is very much popular in the area of sales management.

 Factor-testing Method – Under this method, a forecast is made of business conditions


by descriptive analysis of various factors that are going to affect the future. Unfavorable

91
effects of each factor are reviewed and forecasting is made keeping in view the effects of
all possible causes and factors expected to occur in future.

2.8 Summary
The time series analysis is an important tool to measure the trends with the time period. The
analysis helps to learn the past behavior, to forecast or estimate and compare the things between
time periods. There are long as well as short term fluctuations. The components of time series are
Original data (O) consisting of long (T) and short term fluctuations. There are various methods
of measuring the trends. They are as follows:

1) Free-hand Curve Method

2) Semi-average Method

3) Moving-average Method

4) Method of Least Squares.

Each one is explained in detail in the lesson.

2.9 Questions
What are time series? How will you analyze them?
1. What is secular trend? Critically examine the various methods of measuring trend.
2. Explain the method of least squares method for measuring long-term trend.
3. What is seasonal variation of a time series? Describe the different methods to evaluate it.
4. Write short notes on the following:- (a) Additive and Multiplicative Models; (b) Short-
time Fluctuations; (c) Cyclical Fluctuations.
5. Describe cyclical, seasonal and irregular variations. What do you mean by seasonal
variations indices? Explain.
6. What are the seasonal variations? How will you construct a seasonal index using ratio-to-
trend method? What are the uses and limitations of seasonal indices?
7. The exports made by a table fan manufacturing concern in Eastern Uttar Pradesh during
1991 to 2003 are given as below (in lakhs of rupees):-
Year Export Year Export
1991 351 1998 420
1992 366 1999 450
1993 361 2000 500
1994 362 2001 518
1995 400 2002 540
1996 419 2003 557
1997 410

92
8. The following table shows the number of salesmen working in a certain concern. Use the
method of least squares to fit a straight line trend and estimate the number of salesmen in
2005:-

Year 2000 2001 2002 2003 2004


No. of Salesmen 28 38 46 40 56

9. From the following figures of output of a sugar factory, fit a linear trend by least squares
and show the trend line on a graph paper. What is the monthly increase in production?

Year 1994 1995 1996 1997 1998 1999 2000


Production („000 12 10 14 11 13 15 16
Qntls.)

10. Find trend and short-term fluctuations by the least squares method:-

Year 1992 1993 1994 1995 1996 1997 1998 1999 2000
Y 232 226 220 180 190 168 162 152 144

11. Find out seasonal variation index Nos. by using the method of monthly averages form the
following data:-

Month Year Month Year


2002 2003 2004 2002 2003 2004
January 12 15 16 July 16 17 16
February 11 14 15 August 13 12 13
March 10 13 14 September 11 13 10
April 14 16 16 October 10 12 10
May 15 16 16 November 12 13 11
June 15 15 17 December 15 14 15

12. Find seasonal variations by Ratio-to-trend method from the data given below:-

Year Quarters
I II III IV
2000 30 40 36 34
2001 34 52 50 44
2002 40 58 54 48
2003 54 76 68 62
2004 80 92 86 82

93
2.10 Suggested Reading
 Shukla & Sahai: Business Statistics, Sahitya Bhawan Publication, Agra.
 Elhance DN: Fundamentals of Statistics, Kitab Mahal,New Delhi.
 Dr. Gupta K. L. – „BUSINESS STATISTICS‟.
 Aggarwal S.L. and Bhardwaj S.L. – „BUSINESS STATISTICS‟.

94
UNIT - III

CORRELATION AND REGRESSION ANALYSIS:

Structure

3.1 Objectives
3.2 Introduction
3.3 Correlation: Meaning of Correlation
o Utility and Importance of Correlation
o Correlation and Cause and Effect Relationship
o Types of Correlation
o Degree of Correlation

o Methods of Determining Correlation


 Graphical Method
 Mathematical Methods
o Probable error
3.4 Regression:

o Meaning and definition


o Utility of Regression Analysis
o Types of Regressions
o Difference between Correlation and Regression
o Regression Lines
o Functions or uses of Regressions Lines
o Regression equations
o Regression Coefficient
 Properties of Regression Coefficients-
o Some Important points relating to Regression Analysis
o Standard Error of the Estimate
o Ratio of Variation

3.5 Summary
3.6 Questions
3.7 Suggested Reading

95
3.1 Objectives

The objective of this lesson is to make you understand:


1. How to measure relationship between two variables through correlation.
2. Utility and importance of correlation.
3. Different types, degrees and methods of correlation.
4. Probable error and its utility
5. The meaning of regression equation and coefficient.
6. Practical application of regression analysis.
7. The use of ratio of variation.

3.2 Introduction

It is seen that there exist some relationship among variables associated with various phenomenon
in economic, social and scientific areas. For example, prices increase with the increase in
demand of goods; agricultural production is influenced by the level of rainfall, ages of husbands
and wives, etc. Correlation is a statistical technique which denotes such inter-dependence
between two variables and measures the degree and direction of inter-relationship.

3.3 Correlation: Meaning of Correlation

When two quantitative facts having the relationship of cause and effect are varying
simultaneously in the same or in the opposite directions, the measurement of such variations is
called the measure of correlation.

“If two or more quantities vary in sympathy so that movements in the one tend to be
accompanied by corresponding movements in the other, then they are said to be correlated”. (L.
R. Connor)

o Utility and Importance of Correlation

In statistical analysis, the concept and technique of correlation occupies an importance place.
Measurements of correlation are very useful in social sciences and particularly in economic
analysis. It has become unquestionably important, more so for the following specific purposes:

1. More Reliable Forecasting- The study of correlation helps in reducing the range of
uncertainty associated with decision-making and it leads to more reliable forecasting.

2. Study of Economic Activities- Correlation is also very useful in analytical study of


economic activities. For example, correlation is very useful in studying the impact of

96
price change on the change in demand and of variation in production of cotton on the
production of cloth.

3. Estimation of other variable on the basis of one variable- The concepts of regression and
ratio of variation are based on correlation and with the help of these concepts the
probable value of the variable of the other series can be reliably estimated on the basis of
value of the variable of one series, provided these series are inter-related. For example, if
the share prices at Delhi Mumbai stock exchanges are inter-related, then the probable
price at Mumbai can easily be estimated on the basis of a certain price at Delhi by using
regression equations.

4. Useful in Research- The technique of correlation proves very useful in making analysis,
drawing conclusions and developing hypothesis and theories in the area of research and
investigation

o Correlation and Cause and Effect Relationship

An important issue related to correlation technique is that whether on the basis of quantitative
measurement of correlation, a definite conclusion can be drawn about cause and effect
relationship between two series? Generally, it is assumed in correlation that there is cause and
effect relationship between two series but it is not necessary in every case. In this context, the
following situations are worth consideration:-

1. Both the correlated variables are being affected by a third variable or more than one
variable- It is possible that there may be very high degree of coefficient of correlation
because both the related variables are being affected by a third common cause or causes.
For example, the production of wheat and rice are increasing due to certain other
common factors viz., better rainfall, use of high yielding seeds, etc.

2. Both the variables might be mutually affecting each other so that neither of them could be
designated as a cause or effect. There may be two related variables, in which it becomes
difficult to determine the cause and effect. For example, if the price and demand both of a
commodity are increasing, we may draw a conclusion that prices are increasing in
anticipation of shortage of supply in future or anticipation of price increase is the cause
and increase in demand (in anticipation of future shortage) is the effect.

3. Correlation may be due to pure chance- It may also happen that there is no cause and
effect relationship between two series but a high degree of correlation is obtained by

97
applying the formula. Suppose, the coefficient of correlation is +.9 between production of
milk and production of cycles in a country, it does not mean that there is cause and effect
relationship between these two. Such relation, if calculated, is known as nonsense or
spurious correlation.

o Types of Correlation

On the basis of direction and ratio of change between relevant variables and number of series,
etc. correlation may be of the following three types:-

1. Positive and Negative Correlation- If the values of the two series deviate in the same
direction i.e., an increase in the value of one variable results in a corresponding increase
in the value of the other variable or if a decrease in the value of one variable results in a
corresponding decrease in the value of other variable, it is called positive correlation. For
e.g. when the price of goods increases, its supply increases too.

On the contrary of the above, when two variables deviate in opposite or inverse direction,
it is called negative correlation e.g., decrease in sale of woolen cloth on increase in
temperature, reduction in number of cinema going persons on increasing the number of
television sets, etc.

2. Liner and Curvi-linear Correlation- The distinction between linear and curvi-linear
correlation is based upon the ratio of change between two variables. If the ratio of change
between two variables is constant, it is known as linear correlation. Thus, if with 10%
increase in price each time, the demand decreases by 5% then there is linear
correlationship between price and demand. If there corresponding values of two such
series are plotted on a graph paper, a straight line is obtained. Mathematically, this
relationship may be expressed as Y = a + bX.

If corresponding change in two variables is not at a constant rate but is at a fluctuating


rate, it is said to be curvi-linear or non-linear correlation. For example, if with 10%
increase in price, the demand decreases ranging from 5% to 20%, it will be curvi-liner
correlation. Generally, linear correlation is found only in exact sciences, otherwise curvi-
linear correlation is common in economics and social sciences.

3. Simple, Multiple and Partial Correlation- This distinction is based on the number of
variables studied. The correlation between two variables is known as simple correlation.
When more variables are studied, it is either multiple or partial correlation. For example,

98
if we study the relationship between rainfalls, use of fertilizers and agricultural
production it is a case of multiple correlations. On the other hand, if we study correlation
between rainfall and agricultural production, assuming the use of fertilizers as a constant
factor or between use of fertilizers and agricultural production assuming rainfall as a
constant factor, it will be a case of partial correlation. It means that in partial correlation,
the relationship of two variables is studied by eliminating the effect of other variables on
both.

o Degree of Correlation

The intensity or degree of relationship between two variables is assessed by the quantitative
value of coefficient of correlation. The degree of correlation on the basis of this coefficient is
classified as under:

(1) Perfect Correlation- When the movement in two related variables is in the same direction
and in the same proportion, it is perfect positive correlation. The coefficient of correlation
(r) in this case will be +1. On the other hand, if changes are proportional but in opposite
direction, it will be perfect negative correlation and its calculated value will be –1.

(2) Absence of Correlation- If no interdependence is found between two variables or there is


no relationship between deviations in one variable to corresponding deviation in the other
variable, it is the situation of absence of correlation and in this case coefficient of
correlation will be zero (0).

(3) Limited Degree of Correlation- The relationship between perfect correlation and absence
of correlation is the case of limited degree of correlation, which may be positive as well
as negative. The coefficient of positive limited degree of correlation ranges between 0
and 1(>0 but <1), while negative limited degree of correlation ranges between 0 and –1.
In practice, limited degree of correlation is more common in economic, business and
social activities. The limited degree of correlation can be very high, high, moderate or
low.

The degree of correlation can easily be explained by the following chart:-

Degree of Correlation
Degree Positive Negative
1. Perfect +1 -1
2. Limited
(a) Very High Above +.9 and up to +.99 Below -.9 and up to -.99
(b) High Above +.75 and up to +.9 Below -.75 and up to -.9

99
(c) Moderate Above +.25 and up to +.75 Below -.25 and up to -.75
(d) Low Above 0 and up to +.25 Below 0 and up to -.25
3. Absence 0 0
o Methods of Determining Correlation
The important methods of finding out correlation are as follows:-
 Graphical Method
 Scatter Diagram or Dotogram
 Correlation Graph
 Mathematical Methods
 Karl Pearson‟s Coefficient of Correlation.
 Spearman‟s Rank Coefficient of Correlation.
 Concurrent Deviations Method.
Scatter Diagram or Dotogram

Scatter diagram is very simple method of showing and estimating the degree and
direction of correlation between two variables. In this method, independent variables (X-series)
are shown on the horizontal axis (X-axis) and dependent variables on the vertical axis (Y-axis) of
the graph paper. The given data are plotted on this paper in the form of dots i.e., we put a dot for
each pair of x and y variables and this, obtain dots equal to number of values. By looking to the
scatter and direction of the various dots we can form an idea about the nature of correlation.

Perfect Positive (r=+1) Perfect Negative (r=-1)


Y Y Y . 
 

.
Diagram No. 1 X O Diagram No. 2 X Diagram No. 3
O

O X

100
Y Y

Limited Positive (0<r<1) Limited Negative (-1<r<0)

X O
Diagram No. 4 Diagram No. 5
O X
Interpretation or study of Scatter Diagram-

After viewing the scatter diagram the nature and degree of correlation is interpreted as
given below:-

(1) Perfect Positive Correlation- If all the points lie on a straight line rising from the lower
left-hand corner to the upper right-hand corner, it means at there is perfect positive
correlation between two variables. (Diagram No. 1)

(2) Perfect Negative Correlation- If all the points lie on a straight line falling from the upper
left-hand corner to the lower right-hand corner, it shows that there is perfect negative
correlation between two variables (Diagram No. 2)

(3) Absence of Correlation- If the plotted points are scattered in a haphazard manner and no
definite trend is clear; it will be a case of absence of correlation. (Diagram No. 3).

(4) Limited Correlation- The limited correlation may be either positive or negative. If the
trend of points is of rising from left to right, it is limited positive correlation. (Diagram
No. 4). On the contrary, if the trend is of falling from left to right, there is limited
negative correlation. (Diagram No. 5).

The scatter diagram is simple and attractive method of finding out correlation between
two variables, because mathematical calculations are not required in it. It readily enables us to
form a rough idea about the nature (perfect or limited) and direction (positive or negative) of
relationship between two variables.

An important limitation of this method is that it presents only a visual picture of


correlation i.e., it tells whether there is correlation or not? The degree of correlation cannot be
measured exactly or precisely.

101
Correlation Graph

Graphs can also be used to study correlation between two series. In this method two
curves are drawn on the basis of measuring time, place or serial number of X-axis and two
related variables on Y-axis. If there is almost similarity in the values of both variables, both of
these are measured on left-hand vertical axis. If there is wide difference in values or difference in
limits, then both sides vertical axis are used to represent both these variables separately.

Study of Correlation Graph- On the basis of correlation graph; the conclusions about
correlation are drawn as follows:- (1) If both the curves drawn on the graph are moving in the
same direction (either upward or downward) correlation is said to be positive. (2) If the curves
are moving in the opposite directions, correlation is said to be negative. (3) If there is erratic
fluctuation in the curves showing no similarity, then there may be absence of correlation or a low
degree of correlation.

Example 1: Prepare a correlation graph on the basis of following data and comment about the
correlation between marks of X and Y:-

S.No. 1 2 3 4 5 6 7 8 9 10
X 56 42 72 36 63 47 55 49 38 42
Y 147 125 160 118 149 128 150 145 115 140

Solution: Correlation Graph

102
80 180

70 160

140
60

120
50

100
40
80

30
60

20
40

10 20

0 0
1 2 3 4 5 6 7 8 9 10

Check Your Progress


1. State whether the following statements are true or false. In case of false statement give the
correct statement:-
(i) Positive value of correlation coefficient between X and Y implies that if X decreases, Y
tends to increase.
(ii) Negative correlation in two series means that is the value of one of the variables decrease
the other would also decrease.
(iii) The coefficient of rank correlation has the same limits as the Pearsonian coefficient of
correlation.
(iv) The coefficient of correlation is independent of change of scale and point of origin.
(v) For computing the coefficient of correlation in grouped data, the class interval width for
both the variables must be identical.

Karl-Pearson’s Coefficient of Correlation

This method for measuring the intensity of correlation was propounded by the famous
biologist and statistician Karl Pearson. Since it explains both the degree and direction of
correlation very clearly and precisely, it is widely used. The coefficient of correlation is denoted
by „r‟ and it is calculated on the basis of mean standard deviation. “It (the coefficient of
correlation (r) of two variables) is obtained by dividing the sum of the products of the
corresponding deviations of the various items of two series from their respective means by the

103
product of their standard deviations and the number of pairs of observations.” Symbolically is
represented as:-

r=
 dxdy or
 dxdy
N x S.D. x x S.D. y N.σ x .σ y

This formula can be expressed on the basis of co-variance also:-

r=
Co  variance
(since Co-variance =
 dxdy )
σ x .σ y N

C0  variance of x and y
Or r =
(variance of X) (Varince of Y)

Example 2:

From the values given below, find out Karl Pearson‟s coefficient of correlation:-

(a)  dxdy  150, N  9, σ x  5.8, σ y  3.2

(b) Co-variance = 9, x = 4, y = 7.5

(c) Co-variance of x and y = 12, Variance of x = 16, Variance of y = 13.7

Solution:

(a) r=
 dxdy or
 dxdy =
150
= .897
N x S.D. x x S.D. y N.σ x .σ y 9 x5.8x3.2

Co  variance 9
(b) r= =  .3
σ x .σ y 4 x7.5

C0  variance of x and y 12 12
(c) r= = = = .8108
(variance of X) (Varince of Y) 16 x13.7 219 .2

Direct Method for the calculation of r in Individual or Ungrouped series-

If arithmetic means of both the series are in whole number, the direct method is simple
and appropriate. The following process is involved in it:-

(1) Calculate X  
 X 
 and Y  
 Y 
 of X and Y series respectively.
 N   N 

104
(2) Take the deviations of the values in X series from X and write it under the
column headed by dx (X- X ). Similarly, take the deviations of the values in Y
series from Y and write it under the column headed by dy (Y- Y ) . It should be
checked that while using direct method, total of column headed by dx and dy will
be zero.

(3) Multiply the respective deviations of both series (dx and dy) and write it under the
last column headed by dxdy and the sum of this column is denoted as  dxdy.
(4) Find out d2x by squaring the deviations in X series (dx) and d2y by squaring the
deviations in Y series (dy). Make total of both these columns to obtain

d 2
x and  d 2 y.

(5) Generally, following 7 columns are drawn to show the above mentioned
calculations:-

X dx d2x Y dy d2y dxdy

If years, roll numbers, etc., are given in the question, an additional column before the
column X may be drawn to show these years, roll numbers, etc.

(6) To find out coefficient of correlation on the basis of original formula of Karl Pearson,
standard deviations of both the series are to be calculated from the following formulae:-

S.Dx or  =
d 2
x
S.D.y or  =
d 2
y
N N

(7) Finally, the following formula used:-

r=
 dxdy …….(i)
N.σ x . σ y

This process can be simplified and instead of calculating standard deviations, the various
values are used directly to the formula as given below:-

105
r=
 dxdy ……..(ii)
N
 d2x
x
d2y
N N

From the point of view of calculation it can further be simplified as under:-

r=
 dxdy ……(iii)
d x . d
2 2
y

Note: The basis of all these formulae (i), (ii) and (iii) is the same and so the answer will also be
same. However, the third formula is simple from the view of calculation work.

Example 3:

Calculate the coefficient of correlation between X and Y from the following data:-
X 24 26 32 33 35 30
Y 15 20 22 24 27 24

Solution: Calculation of Coefficient of Correlation


dx d2x dy d2y dxdy
X Y
24 -6 36 15 -7 49 42
26 -4 16 20 -2 4 8
32 +2 4 22 0 0 0
33 +3 9 24 +2 4 6
35 +5 25 27 +5 25 25
30 0 0 24 +2 4 0

 X  180 0 d 2
x  90  Y  132 0 d 2
y  86  dxdy  81

X
X 
180
 30 Y
 Y  132  22
N 6 N 6

S.D.x =
d 2
x

90
 3.87
N 6

S.D.y =
d 2
y

86
 3.79
N 6
106
r=
 dxdy 
81

81
 .92
N.σ x .σ y 6x3.87x3.7 9 88

On the basis of alternative and simple formula-

r=
 dxdy 
81

81

81
 .92
 d x. d
2 2
y 90x86 7,740 87.98

Short-cut Method for the calculation of r in Individual or Ungrouped Series

When arithmetic mean in one or both the series is not a whole number, the short-cut
method may be used to avoid the complex calculations. The deviations in this method are taken
from assumed mean. If the mean of any one of the series is in the whole number deviations may
be taken in that series either from actual or from assumed mean. However, deviations from
actual mean in that series further simplify calculations. In calculating coefficient of correlation
by short-cut method, the following steps are taken:-

(1) First of all, some convenient values are taken as assumed means in series X (Ax).
Convenient value means a whole number approximately nearer to the arithmetic
mean.

(2) Deviations are found out in both the series form the assumed means, which are
denoted as dx (X-Ax) in series X and dy (Y-Ay ) in Y series. The deviations are
added up to obtain  dx and  dy .

(3) The corresponding deviations are multiplied (dx, dy) and the sum of such
multiplication is denoted as  dxdy .

(4) The deviations (dx and dy) are squared up also and their totals

d 2
x and  d 2 y are obtained.

(5) After making these calculations, any of the following formulae may be applied.

First formula: r =
 dxdy  N (X  A x ) (Y  A y )
N.σ x. σ y

In this formula, standard deviations of both series are calculated separately.

107
  dx   dy 
 dxdy  N   
Second formula: r=
 N  N 
  d 2 x   ds  2    d 2 y   dy  2 
N          
 N  N    N  N  

 dx. dy
 dxdy  N
Third formula: r =
  dx   2
 dy  2

 d x   d y 
2 2

 N 
 N 

 dx.dy.N   dx. dy 
 d x.N   dx . d y.N   dy  
Fourth formula: r=
2 2 2 2

Note: The base of all these formulae is the same, so the answers will also be same. However,
from calculation point in view, the fourth formula is the easiest.

Example 4:

Calculate Karl-Pearson’s Coefficient of Correlation between X and Y:-


X 42 44 58 55 89 98 66
Y 56 49 53 58 65 76 58

Solution: Calculation of Karl-Pearson’s Coefficient of Correlation


X dx from d2x Y dy from d2y dxdy
65 60
42 -23 529 56 -4 16 92
44 -21 441 49 -11 121 231
58 -7 49 53 -7 49 49
55 -10 100 58 -2 4 20
89 +24 576 65 +5 25 120
98 +33 1,089 76 +16 256 528
66 +1 1 58 -2 4 -2
-3 2,785 -5 475 1,038

 dx.dy. N   dx .  dy 
 d x . N   dx  . d y . N   dy  
r=
2 2 2 2

108
1,038 x 7  (3x  5)
=
2,785 x 7  (3) 475 x 7  (5) 
2 2

7,266  15 7,251
=   0.9042
19,468[3,300] 8,018.965
7,251 1
r =  A.L.[log7, 251  (log19,486  log3,300)]
19,486 x 3,300 2

1
= A.L. [3.8604 - (4.2897 + 3.5185)]
2
= A.L. [3.8604 – 3.9041] = A.L. 1.9563 = .9042

Various points to be remembered while calculating Karl Pearson‟s Coefficient of Correlation-

(1) Assumed means are given in the question- If assumed means are given in the question,
deviations will have to be taken from these assumed means and the question will be
solved by short-cut method.

(2) Calculation of coefficient of correlation after finding out the missing values- In such
questions, the actual means are given. So missing values can be calculated on the basis of
these actual means.

(3) Preparation of series X and Y on the basis of data given- There may be questions in
which values of series X and Y is not given clearly. In such a situation it is identified that
between which two characteristics, correlation is to be calculated and on that basis, series
are prepared.

(4) Calculation of coefficient of correlation, when various calculations are already given- It
is also possible that in place of two series, various calculations are given and on that basis
coefficient of correlation is to be calculated. In such a case, appropriate formula is used
on the basis of information given in the question.

(5) Product Moment Correlation- If the values of items in two series are comparatively small,
their correlation can be found out by the product moment method. In this method,
deviations are not taken but the values are used directly. Thus, there will be only given
columns as below:-

109
X X2 Y Y2 XY

 XY. N  ( X   Y)
 X .N   X  Y .N   Y 
Formula: r=
2 2 2 2

Karl Pearson‟s Coefficient of Correlation in Grouped Series or Cross Classification Table-

If the values of two variables are grouped and the frequencies of different groups are
given in the form of classified series, coefficient of correlation can be calculated in such series.
This kind of classified series is called „Correlation Table‟ or „Bivariate Frequency Table‟. In this
table cell frequencies and total frequencies of two related discrete or continuous series are
presented in such a manner, so that their relationship may be exhibited. There will be various
cells in correlation table and each cell refers to common frequency of X and Y series. It may be
clearer by the following correlation table

Example 5:-
X / Y 0-10 10-20 20-30 30-40 40-50 Total
0-10 3 7 2 12
10-20 5 8 1 14
20-30 4 6 2 12
30-40 5 4 9
40-50 3 3
Total 3 12 14 12 9 50

The formula for calculating coefficient of correlation is same as it is in individual series


except the adjustment of frequencies (f) in the formula as given below:-

 fdx .  fdy
 f.dx.dy  N
r=

 fd 2 x 
 fdx  2

  fd y 
2
 fdy  
2


 N   N 

 fdxdy . N   fdx .  fdy 


 fd x . N   fdx   fd y . N   fdy 
or r=
2 2 2 2

The process of calculation of coefficient of correlation in grouped series is as follows:

110
(1) If difference between values or class-intervals given in the series is equal and total
frequencies are given in correlation table, four additional columns and four additional rows will
be drawn in addition to columns and rows given in the question. Out of these four additional
columns, one column will be for deviations (dx and dy), which to be drawn just after the column
of class-intervals. Three additional columns are to be drawn in the last and they will be for fdy,
fd2y and fdxdy respectively. Similarly, four additional rows are used for dx, fdx, fd 2x and fdxdy
respectively. If class-intervals differ in series X and Y, then also step deviations may be taken
and difference in class-intervals will not affect the formula.

(2) If class-intervals are equal, step deviations are taken by taking convenient mid-values or
values as assumed means and these deviations are denoted as dx and dy in X and Y series
respectively.

(3) The frequencies of series X are multiplied with the respective deviations (dx) and the
product fdx is added to obtain  fdx . Each fdx is multiplied with dx and the product

thus obtained (fd2x) is added to obtain  fd x .


2

(4) Similarly, the frequencies of series Y are multiplied by dy and the product fdy is again
multiplied by dy to find out fd2y. Both these products are added to obtain  fdy and
 fd 2
y respectively.

(5) For the calculation of  fdxdy each cell frequency is multiplied by „dx‟ given above

and „dy‟ given to the left of a cell and value of f x dx x dy .

(6) The „fdxdy‟ of cells are totaled both row-wise and column-wise. This total is denoted as
„fdxdy‟.

Solution:
X 0 10- 20- 30- 40- F Fdy fdy2 fdxdy
-10 20 30 40 50
Y
dx -2 -1 0 1 2
dy
0-10 -2 3 7 2 12 -24 48 26
12 14 0
10-20 -1 5 5 8 1 - 14 -14 14 4

111
0 1
20-30 0 4 6 2 12 0 0 0
0 0 0
30-40 1 5 4 9 9 9 13
5 8
40-50 2 31 3 6 12 12
2
2
F 3 12 14 12 9 N  fdy  fdy

=50 =-23 =83


fdx -6 -12 0 12 18 fdx
=12
fdx2 12 12 0 12 36  fdx
2

= 72
fdxdy 12 19 0 4 20 fdxd
y=55

Applying the values in the formula:

 fdxdy . N   fdx .  fdy 


 fd x . N   fdx   fd y . N   fdy 
r=
2 2 2 2

55 x50  (12 x  23) 2750  276 3026


r= =   .8554
(72 x50  144) x(83 x50  529 3456 x3621 1251417 6

Assumptions of Karl Pearson’s Coefficient of Correlation-

Karl Pearson‟s coefficient of correlation is based on the following assumptions;-

(1) Linear Relationship- It is assumed that there is a linear relationship between two series.
In other words, if the pairs of items of both the series are plotted on a graph paper, the
plotted points will form a straight line.

(2) Normality- The correlated variables are affected by a large number of independent factors
so that they acquire normality.

(3) Cause and Effect Relationship- It is also assumed that there is cause and effect
relationship between the factors affecting the values of two series. If there is no such
relationship, coefficient, if calculated, will be a non-sense correlation.

112
Mathematical Properties of the Coefficient of Correlation

(1) Mathematical limits of coefficient - The value of Karl Pearson‟s coefficient of correlation
lies between  1. It cannot be greater than or less than 1 in any case.

(2) No effect of change in origin or scale- Karl Pearson‟s coefficient of correlation is


independent change of origin or scale of X and Y variables. By change of origin means
that a constant is subtracted from all values and by change of scale means that all values
of X and Y series are multiplied or divided by some constant.

(3) Geometric mean of regression coefficients- Coefficient of correlation is equal to

geometric mean of regression coefficients i.e., r = b xy x b tx .

o Probable error
Probable error (P.E.) is an important measure to determine the limits of coefficient of correlation
and to assess the reliability of the value of coefficient. The formula for calculating probable error
of Karl Pearson‟s coefficient of correlation is as follows:-

0.6745 (1  r 2 )
P.E. =
N
Note : In order to simplify the process of calculation 2/3 may be used in place of 0.6745 and this
substitution does not effect the accuracy of result significantly.

There are two main uses of probable error of coefficient of correlation:-

(1) Determination of limits of coefficient of correlation- Probable error of coefficient of


correlation can determine such minimum and maximum limits, within which coefficient
and correlation of the total universe or of other groups selected from the same universe at
random will fall with a probability of 50%. These limits are determined on the basis of r
P.E. i.e., maximum limit = r + P.E. and minimum limit = r - P.E.

(2) Interpretation of coefficient of correlation- Probable error does help in the interpretation
of coefficient of correlation.

Example 6:

(a) If coefficient of correlation of two variables is 0.9 and number of pairs of items is 16,
calculate the probable error.

113
(b) When the number of items (N) is 25 and the value of coefficient of correlation (r) is 0.7,
find the limits within which coefficient of correlation lies for another sample from the same
universe.

Solution:

p.6725 (1  r 2 ) 0.6745 (1  .9) 2 .1282


(a) P.E. =    .032
N 16 4

2 2 2
(1  r) 2 (1  .7) 2 x 5.1
(b) P.E. = 3  3  3  .068
N 25 5

Maximum limit = r + P.E. = 0.7 + 0.68 = .768

Minimum limit = r-P.E. = 0.7 - .068 = .632

Coefficient of Correlation and Standard Error

In modern statistics, the use of standard error is considered better than probable error.
Standard error of coefficient of correlation is computed by the following formula:-

1 r2 2
S.E. of r = or P.E. of r  or .6745 x S.E. of r
N 3

The limits of coefficient of correlation are determined on the basis of standard error on
the basis of r  3 S.E.

Spearman‟s Rank Difference Method

This method was devised in the year 1904 by Charles Edward Spearman, a psychologist
in Britain. This method is also known as „Ranking Method‟ and is considered more appropriate
in finding out correlation between those qualitative facts and which cannot be measured
quantitatively but can be placed in an order such as honesty, beauty, intelligence, etc. For
example, we have to find out correlation between honesty and beauty of 10 persons. These facts
are not measurable quantitatively. But on the basis of such order, coefficient of correlation can
be calculated by Spearman‟s Rank Difference method.

The formula for the calculation of coefficient of correlation on the basis of Spearman‟s
Rank Difference method is as follows:-

6 D 2
rr = 1-
N (N 2  1)

114
Note :

1. „rr‟ has been used for Rank Correlation. Technically, Greek letter  (rho) is used for it.

2. D2 = Squares of the difference of rank of each pair of observations.

3. Spearman‟s rank difference coefficient of correlation also comes within the limits of –1
to +1 and it can be interpreted in the same way as that Karl Pearson‟s method.

The various problems of computation of coefficient of correlation by Spearman‟s Rank


Difference method may be discussed as under:-

When ranks of items are given in the question-

Example 7: Ten competitors in a debate contest are ranked by three judges in the
following order. Use the rank correlation coefficient to determine which pair of judges has the
nearest approach to common tastes in judgments-
1stJudge 1 6 5 10 3 2 4 9 7 8
2ndJudge 3 5 8 4 7 10 2 1 6 9
3rdJudge 6 4 9 8 1 2 3 10 5 7
Solution:
To find out that which pair of judges has the nearest approach to common tastes in
beauty, we will calculate three sets of coefficient of correlation between the rankings of :-
(1) 1srt and 2nd Judge (R1 and R2)
(2) 2nd and 3rd Judge (R2 and R3)
(3) 1st and 3rd Judge (R1 an R3)
R1 R2 R3 R1-R2 D12 R2-R3 D22 R1-R3 D32
=D1 =D2 =D3
1 3 6 -2 4 -3 9 -5 25
6 5 4 1 1 1 1 2 4
5 8 9 -3 9 -1 1 -4 16
10 4 8 6 36 -4 16 2 4
3 7 1 -4 16 6 36 2 4
2 10 2 -8 64 8 64 0 0
4 2 3 2 4 -1 1 1 1
9 1 10 8 64 -9 81 -1 1
7 6 5 1 1 1 1 2 4

115
8 9 7 -1 1 2 4 1 1
200 214 60

6 D 2
rr = 1-
N (N 2  1)

6 x 200
(R1 and R2) rr = 1- = 1 - 1.212 = -.212
10 (100 - 1)

6 x 214
(R2 and R3) rr = 1- = 1 – 1.297 = -.297
10 (100 - 1)

6 x 60
(R1 and R3) rr = 1 - = 1 - .364 = +.336
10 (100 - 1)

These answers reveal that there is positive correlation only between the judgments of first
and third-judges. Therefore, this pair of judges has the nearest approach to common tastes.

(ii) When ranks are not given-

If quantitatively measurable variables are given in the question, even the coefficient of
correlation can be calculated by Spearman‟s method. In this process ranks are assigned to the
values given in both the series. These ranks may be assigned in ascending or descending order
but generally descending order is preferred in which the highest value is given rank 1 and the
other values are given ranks accordingly. After assigning the ranks, difference of ranks of X (R x)
and ranks of Y (Ry) is calculated (Rx-Ry) and it is shown under the column headed by D. These
differences are squared up (D2) and D2 is obtained by totaling these squares and finally the
formula is applied.

Example 8:

Calculate rank-difference coefficient of correlation from the following data:-


X 75 88 95 70 60 80 81 50
Y 120 134 150 115 110 140 142 100

Solution:
X Rx Y Ry D(Rx-Ry) D2
75 5 120 5 0 0

116
88 2 134 4 -2 4
95 1 150 1 0 0
70 6 115 6 0 0
60 7 110 7 0 0
80 4 140 3 1 1
81 3 142 2 1 1
50 8 100 8 0 0
D=0 D2= 6

6 D 2
rr =1-
N (N 2  1)

6x6
=1-  1  .071  .929
8 (64  1)

(III) When some values or ranks are equal-

Sometimes it may happen that there is more than one item with the same value. In such a
case equal rank or tied rank is assigned to all items of the same value and there are two
methods of assigning equal ranks:-

(1) Bracket Rank Method- In this method the same rank is assigned to all items of the same
value, which is given to the first item of that value. After this value, that rank is given to
the next value which would have been assigned in case of difference in the items of same
value. For example, ranks are to be assigned to the values 12, 15, 11, 15, 18 and 15, then
rank 1st will be given to 18, rank 2nd to all the three items of 15, rank 5th to 12 and rank
6th to 11. Here the rank 5th is given to 12 because if the previous three items (15) were
not equal, then their ranks would also have been 2nd , 3rd and 4th in the place of 2.

(2) Average Rank Method- In this method average rank is assigned to the items of the same
value. “In other words, where two or more items are to be ranked equal, the rank assigned
for purposes of calculating coefficient of correlation is the average of the ranks which
these individuals would have got had they differed slightly from each other”. For
example, in the values 12, 15, 11, 15, 18 and 15 rank 3rd will be assigned to the value 15.
The basis is that rank 1st will be assigned to 18. After that the value 15 has come thrice
and if the separate ranks would have been assigned they would have been 2nd , 3rd and 4th
. But due to equal value, equal rank is to be assigned and that will be the average of 2, 3

117
23 4
and 4 i.e.  3 rd . After this rank 5th and 6th will be assigned to 12 and 11
3
respectively. In practice this method is more popular.

It is important that where equal ranks are assigned to same values, an adjustment is made
1
in the formula. This adjustment consists of adding ( m 3  m) to the value of D2, where „m‟
12
is the number of times an item is repeated. This adjustment or correlation factor is added to each
repeated value in both the series. In such a case, the modified formula may be written as below:-

 1 1 
6  D 2  (m 3  m)  (m 3  m)  ....
rr = 1 -
 12 12 
N (N  1)
2

Example 9: Calculate coefficient of correlation between X and Y by the method of rank


differences:-

X 48 33 40 9 16 16 65 24 16 57
Y 13 13 24 6 15 4 20 9 6 19

Solution:
X Rx Y Ry Rx-Ry=D D2
48 3 13 5.5 -.25 6.25
33 5 13 5.5 -0.5 0.25
40 4 24 1 3.0 9.00
9 10 6 8.5 1.5 2.25
16 8 15 4 4 16.00
16 8 4 10 -2 4.00
65 1 20 2 -1 1.00
24 6 9 7 -1 1.00
16 8 6 8.5 0.5 0.25
57 2 19 3 -1 1.00
D=0 41

In this question, in X series, the value 16 is repeated thrice. The common rank given to
the value 16 is 8th , which is the average of 7, 8 and 9 (i.e., 7+8+9/3) ranks which these values

118
would have assumed if they were different. Here m=3, so the correction factor to be added for
1 3
this value will be (3 -3). In series Y, the value 13 and 6 are repeated twice. The average rank
12
for the value 6 is 8.5th (8+9)/2 while for the value 13 it is 5.5th (5+6)/2. In both these cases, the
1 3
correction factor will be (2 -2).
12
Thus, the rank correlation is –

 1 1 1 
rr = 1- 6 D 2  (m 3  m)  (m 3  m)  (m 3  m) 
 12 12 12 
_______________________________________

N (N2 –1)

 1 1 1 
6 41  (33  3)  (2 3  2)  (2 3  2)
=1-
 12 12 12 
10 (10  1)
2

6[41  2  .5  .5] 264


=1-  1  1  .267  0.733
990 990
Spearman‟s Rank Difference Method is simpler, to understand and easy to calculate as
compared to Karl Pearson‟s method. In this method the sum of differences of ranks (D) is
always equal to zero; which provides a check on calculation. This method can be used in case of
irregular distribution also. If actual data is not given and only ranks are given, even then
correlation can be calculated by this method. Last but not the leas this method is the only way of
studying correlation between qualitative facts which cannot be measured in figures but can be
arrange in serial order.

On the other hand this method cannot be used in two-way frequency table or bivariate
frequency distribution. It is appropriate when n is limited, say 30. Hence, it is considered that in
case of more than 30 items, this method should be used when ranks are clearly given in the
question.

Concurrent Deviation Method

Concurrent deviation method is the simplest method of studying correlation. It is


generally used at that time when the main objective of study is to find out the direction of
correlation because in its calculation the magnitude of deviation is ignored and only direction of

119
deviation (increase or decrease i.e., positive or negative) is taken into account. The process of
calculating coefficient of correlation by concurrent deviation method is as follows:-

(1) First of all, deviation signs are marked in both the series on the basis of direction of
change. These signs are marked on the basis of comparison of each value with its
preceding value. In this process, no sign is marked for the first value because it has no
preceding value. Now the second value is compared with the first value. If the second
value is increasing, put a sign of plus (+), if it is decreasing, put a sign of minus (-) and if
it is constant, put a sign of equal (=). Similarly, the third value will be compared with the
second value and this process continues for all other values. It may be noted that only
sign of deviation is to be marked and the magnitude of deviation is not shown. These
deviations are put in the column of dx and dy for the series of X and Y respectively.

(2) Deviation signs of both the series are multiplied and the sign of such multiplication is put
in the last column. The value of C (concurrent) is determined on the basis of positive
signs of dxdy.

(3) Finally, the following formula is used:-

 2C  n 
rc =  
 n 

Note : 1. The coefficient by concurrent deviation method also falls within the limits of –1
to +1.

2. n = N – 1, when N is the total number of pairs of observations in X and Y series.

3. The plus and minus signs given in this formula should be carefully noted. If the value of

 2C  n 
  is negative, its square root cannot be calculated and in this case this negative
 n 
value is multiplied by the minus sign inside, which would make it positive and we can
2C  n
take square-root. But the final answer will be negative. If is positive, then the
n
negative signs will be ignored and the answer will be a positive value.

4. Same signs when multiplied gives a positive sign and two different signs when multiplied
gives a negative sign.

Example 10: Find the coefficient of concurrent deviations for the following data:-

120
X 65 40 35 75 75 80 35 20 80 80 50
Y 60 55 50 56 30 70 40 35 80 80 75
Solution:
X dx Y dy dxdy
65 60
40 - 55 - +
35 - 50 - +
75 + 56 + +
75 = 30 -
80 + 70 + +
35 - 40 - +
20 - 35 - +
80 + 80 + +
80 = 80 = +
50 - 75 - +

 2C  n  2x9  10
rc =    
 n  10

18  10  8
=       .80  .894
 10  10

Concurrent Deviation Method is simple to compute and easy to understand. The fact is
that it is simplest of all the methods. Moreover, if the number of pairs of observations is very
large, this method is very convenient. This method is very useful in studying the correlation
between the item-values having short-term fluctuations.

On the other hand in this method only the direction of change is considered and the
quantum of change is ignored. It gives only rough indication about correlation because it does
not differentiate between small and big deviations. For example, whether the item value
increases from 100 to 1000 or from 100 to 101, there will be no difference and only a sign of
plus will be marked in both the cases.

3.4 Regression: Meaning

The credit of using regression technique in statistics for the first time goes to British
Biometrician Sir Francis Galton who used this term in 1877, while studying the relationship

121
between the height of fathers and sons. The technique of „correlation‟ is used to measure
statistical relationship which provides information regarding the degree and the direction of
relationship between two related series. But, the technique of regression analysis is required, if
the value of one series (variable) is given and the value of the other series is to be predicted.

o Utility of Regression Analysis


The technique of regression analysis was initiated for the first time in biometrics, but now it has
become an important statistical tool almost in all natural and social sciences. In brief, the utility
of regression analysis can be expressed under following heads:-

1. Analysis of cause and effect – The relationship of cause and effect between two or more
variables can be analyzed with the help of regression analysis.

2. Determination of rate of change in variables-The change in the value of one variable


can be determined form regression coefficients if there is change of a unit in the value of
other variable. For instance, if regression coefficient of X on Y (bxy) is 0.9, it means that
there will be change of, 0.9 in the value of X, if there is change of unit (one) in the value
of Y.

3. Estimation of values-Regression analysis provides estimates of values of the dependent


variable on the basis of values of the independent variable in the areas of social,
economic and business activities. For example, the price at other place on the basis of
price at one place, production on the basis of rainfall etc., can be estimated with the help
of regression equations.

4. Utility in Business – A businessman has interest in estimating production, demand, sales,


profit and prices, etc., and he can do this with the help of regression analysis.

5. Calculation of coefficient of correlation- The coefficient of correlation between two


variables can also be measured with the help of regression analysis.

o Types of Regressions
 Simple Regression-If regression analysis is based only on two variables, it is called
simple regression. (Only this type of regression has been discussed in this lesson).
 Multiple Regression- If more than two variables are studied at a time in regression
analysis, it is known as multiple regression.
 Linear Regression-In case of linear regression the values of the dependent variable
changes at a constant rate for a unit change in the value of the independent variable. This

122
constant change may be in terms of absolute amount or percentage.
 Curvi-linear or Non-linear Regression – If the regression line is not a straight line but a
smoothed curve, regression is termed s curvi-linear or non-linear.

In regression analysis there are two types of variables: - Independent and dependent. The
variable, whose value is known and which influences the values or is used for prediction, is
called independent variable, and which is predicted, is called as „Regressor‟, „Predictor‟ or
„Explanator‟. The variable, which is predicted, is called dependent variable and it is also termed
as „Regressed‟, „Predicted‟ or „Explained‟

o Difference between Correlation and Regression


Commonly both techniques i.e. Correlation and Regression is to determine the degree and
direction of relationship between two or more variables. The choice of one or the other technique
depends on the purpose of study. If one has to know the degree and direction of relationship,
correlation is used. On the other hand, if one has to predict the value of dependent variable for
the given of independent variable the regression technique will be required. However, there are
certain basic differences in these two techniques, which can be discussed under following heads:-

1. Degree and Nature of Relationship – Correlation coefficient is a measure of degree and


direction of co-variation between two variables i.e., whether the variables under study
move in the same direction or in reverse direction and what is the degree (coefficient of
correlation) of their co-variation. On the contrary, regression tells us about the relative
movement in the variables so that the value of one variable may be predicted on the basis
of the value of the other variable.

2. Relationship of Cause and Effect- After finding out coefficient of correlation, in many
cases it may not be ascertained as to which of the two is cause and which one is effect.
But in regression analysis one variable is clearly assumed as a cause and the other as its
effect. There will be no difference in the value of coefficient of correlation whether it is
found out between x and y or between y and x. However, there will be difference in
regression equations and regression coefficients of x on y and y on x.

3. Limits of Coefficient – The coefficient of correlation varies between  1, but it is not


necessary for regression coefficients. However, the products of both regression
coefficients cannot be more than 1.

4. Independent of change of scale and origin- Correlation coefficient is independent of

123
change of scale and origin whereas the regression coefficients are independent of change
of origin but not of scale.

5. Nature of Coefficient – Correlation coefficient is a relative measure of linear relationship


and is independent of the units of measurement, while regression coefficients are absolute
measures linked with the units of measurement.

6. Area of Use – Correlation analysis has limited application as it is confined only to the
study of linear relationship between the variables. On the contrary, regression analysis
has much wider application because it studies linear as well as non-linear relationship.

o Regression Lines
Regression lines are the lines of best fit expressing mutual average relationship between two
series. These lines given the best estimate of one variable for any given value of other variable.

In case of two variables there are always two lines of regression. If both variables are
assumed as X and Y, we shall have two regression lines one of X on Y and the other of Y on X.
The regression line of X and Y gives the most probable values of X for any known value of Y
and the regression line of Y on X gives the most probable values of Y for known values of X.

Here the question arises that why there are two regression lines? This is because of the
following reasons:-

In regression analysis no single variable is permanently dependent or independent. Any


out of two may be assumed as independent and the other as dependent. If the value of X is given
and the value of Y is to be predicted, then X will be independent and Y dependent. On the
contrary, if the value of X is to be predicted on the basis of given value of Y then Y will be
independent and X dependent. In such a situation one line cannot solve the problem and two
different lines will be drawn on the assumption of both the variables being independent.

2. Assumption of the line of best fit- Regression lines are drawn in the form of lines of
best fit. The concept of lines of best fit is based on the principle of least squares which
stipulates that the sum of squares of the deviations of different value from this line shall
be minimum. The deviations from the points to the line of best fit can be measured in two
ways: Horizontally i.e., parallel to x-axis and vertically i.e., parallel to y-axis. For
minimizing the total of squares of deviations separately it is essential to have two
regression lines. The regression line of X and Y is drawn in such a way that it minimizes
the total of squares of the horizontal deviations, whereas the regression line of Y on X

124
minimizes the total of squares of the vertical deviations.

Exception of one regression line – Generally, there are two regression lines. However, it there
is either perfect positive or perfect negative correlation between the two variables ( 1), there
will be only one line of regression because in such a case both the lines coincide. This solution
can be explained as below:-

Equation of Y and X:

σy
Y  Y    (X  X) because r =  1
σx


Y  Y    X  X  …..(1)
σy  σ 
 x 

Equation of X and Y:

X  X    σ x
(Y  Y)
σy


X  X    Y  Y  …….(2)
σx  σ 
 y 

It is evident that equation No. 2 is the same equation No.1 and due to this fact both
regression lines coincide in case of perfect correlation.

o Functions or uses of Regressions Lines


Though the technique of regression is very useful, it has certain specific functions. It helps to
estimate the degree and direction of correlation and helps to estimates the best value of unknown
or dependant variable for the given values of other independent variable.

It helps to compute the value of mean in the following manner. If from the point, where
both the regression lines cut each other, perpendicular is drawn on X-axis, the mean value of X is
obtained and if from that point a horizontal line is drawn on Y-axis the mean value of Y is
obtained.

The ratio of variation at a particular point can also be found out with the help of
regression lines.

o Regression equations
Regression equations are algebraic expressions of the regression lines. As there are two

125
regression lines, so there are two regression equations. These equations are:

1. Regression Equation of X and Y

This equation describes the variation in the values of X for the given changes in Y.
Hence, this equation is used for estimating the value of X for the given value of Y.

a) Direct method:
σx
This equation is expressed as follows:- X  X   r. (Y  Y)
σy

Where, X = Value of X variable to be predicted

X = Arithmetic mean of X series

R = Correlation coefficient of X and Y series

x = Standard Deviation of X series

y = Standard Deviation of Y series

Y = that value of Y variable, corresponding to which the value of X variable is to be


predicted

Y = Arithmetic mean of Y series.

Example 1:

Obtain the regression equations from the following data:


Mean Standard Deviation
X 25 100
Y 50 75

Coefficient of Correlation between X and Y series = +.8

Solution: The information given in the question is as follows:

X =25, Y = 50,  x = 100 and  y = 75

Regression Equation X on Y

X  X   r. σ x
(Y  Y)
σy

126
100
X-25 = +.8 (Y-50)
75

X = 1.06 (Y -50) +25

X= 1.06Y -53.00+25

X = 1.06Y – 28.00

The above formula is convenient only when S.D. and correlation coefficient are either
given in the question or can be calculated easily. If X and Y both series are given in the question,
x
then the use of short-cut method is easy for the calculation of value of r. .
y

b) Short cut method:

i) If deviations are taken form actual means, the


 dxdy can be used in place of r. x
. This
d y2
y

equation is expressed as follows:- X  X  


 dxdy (Y  Y)
d y 2

Example 2:

Calculate the regression equation of X on Y from the following data:

X 1 4 5 7 8
Y 6 5 9 3 7

Solution:
X dx(X- X ) d2x Y dy(Y- Y ) d2y dxdy

1 -4 16 6 0 0 0
4 -1 1 5 -1 1 1
5 0 0 9 3 9 0
7 2 4 3 -3 9 -6
8 3 9 7 1 1 3

 X  25  d2x  Y  30 d 2
y  dxdy 
X =5 = 30 Y =6 20 -2

127
X  X   
dxdy
Equation of X on Y = (Y  Y)
d 2
y

2
= X-5 = (Y – 6)
20

X= -.1(Y – 6) + 5

X = -.1Y +.6 +5

X = -.1Y +5.6

ii) If deviations have been taken from assumed mean, then


 dxdy x N   dx x  dy  can be
 d y x N   dy 
2 2

x
used in place of r. .
y

This equation is expressed as follows:- X  X  


 dxdy.N  ( dx. dy) (Y  Y)
 d y.N  ( dy)
2 2

Example 3: Obtain the regression equation of X on Y from the following data:


X 10 15 20 25 30 35
Y 12 23 46 50 55 62
Solution:
X dx(X-A) d2x Y dy d2y dxdy
(Y-A)
10 -10 100 12 -34 1156 340
15 -5 25 23 -23 529 115
20 0 0 46 0 0 0
25 5 25 50 4 16 20
30 10 100 55 9 81 90
35 15 225 62 16 256 240

 X =135  dx  15  d 2
x  475 Y   dy  d 2
y  dxdy 
X =22.5 248 -28 2038 805
Y  41.3

128
Assumed mean (A) for X is 20 and Y is 46

Equation of X on Y = X  X  
 dxdy .N  ( dx. dy ) (Y  Y)
 d y.N  ( dy )
2 2

805 x6  (15 x  28)


X – 20 = (Y  46)
2038 x6  (28) 2

4830  420
X= (Y  46) +20
12228  784

5250
X= (Y  46)  20
11444

X = .458(Y-46) +20

X = .458Y -21.06 +20

X= .458Y – 1.06

2. Regression Equation of Y on X-

This equation is used for estimating the value of Y for the given value of X. The
equations are:-

a) Direct Method:

y
(Y  Y) = r. ( X  X)
x

Where: Y = value of y to be predicted,

X=given value of X for which the value of Y is to be predicted.

Example 4: Taking the same data from example 1, let us calculate the Regression equation of Y
y
on X: (Y  Y) = r. ( X  X)
x

Solution: Given information: X =25, Y = 50,  x = 100 and  y = 75

75
Y – 50 = .8 ( X  25)
100

Y = .60 (X-25) +50

129
Y = .6X – 15 +50

Y = .6X +35

This method is used only when S.D. and correlation coefficient are either given in the
question or can be calculated easily. If X and Y both series are given in the question, then the use
x
of short-cut method is easy for the calculation of value of r. .
y

b) Short cut method: If both series are given in the question, then;

i)
 dxdy if the Arithmetic mean is actual. The equation used then is as follows:
d x 2

Y - Y    dxdy (X - X)
 d x 2

Example 5: On applying the same data as in example 2, let us compute the regression equation
of Y on X:


Solution: Equation of Y on X = Y - Y    dxdy (X - X)
d x 2

2
Y-6= (X  5)
30

Y = -.07(X -5) +6

Y = -.07X + .35 +6

Y = -.07X +6.35

ii)
 dxdy x N  ( dx x  dy) can be used in place of r.
y
, if the arithmetic mean is
 d x x N  ( dx)
2 2
x

assumed. The equation used then is as follows: Y  Y    dxdyd .Nx.N (( dxdx.) dy ) (X - X)
 2
 2

Example 6: On applying the same data as in example 3, let us compute the regression equation
of Y on X:

130

Solution: Equation of Y on X = Y  Y    dxdyd .Nx.N (( dxdx.) dy ) (X - X)
 2
 2

805 x6  (15 x  28)


Y – 46 = (X-20)
475 x5  (15) 2

4830  420
Y= ( X  20)  46
2375  225

5250
Y= ( X  20)  46
2600

Y= 2.02(X-20)+46

Y =2.02X- 40.40 +46

Y = 2.02X +5.60

Example 7: Given are the following data for two tests:

Accountancy Statistics

Mean 75 70

Standard Deviation 6 8

Coefficient of correlation = .72, calculate (a) Both the regression equation, (b) Find out
the marks of a student in statistics if his marks in Accountancy are 65.
Solution: Let marks in Accountancy be X and marks in Statistics be Y
(a)
Regression equation of X on Y Regression equation of Y on X
σx y
= X  X   r. (Y  Y) = (Y  Y) = r. ( X  X)
σy x
6 8
X – 75 = .72 (Y  70) Y- 70 = .72 ( X-75)
8 6
X = .54(Y- 70) +75 Y = .96(X-75) +70
X = .54Y -37.80 +75 Y = .96X -72 +70
X = .54Y + 37.20 Y = .96X -2

(b) We have to find the marks in statistics ( Y), which will be a dependant factor, if the mark in

131
Accountancy (X) , which becomes an independent factor, is given as 65. Therefore the value of
Y will be found out with the help of regression equation Y on X, which is computed as : Y =
.96X -2

 Y = .96 x 65 -2 = 60.40

 Marks in Statistics will be 60.4 when the marks in Accountancy are 65.

o Regression Coefficient
Regression coefficient is an algebraic measurement of the slope of the regression line. This
coefficient indicates that if there is a unit change in the values of one variable, then what will be
the average change in the values of other variable. As there are two regression equations, so there
are two regression coefficients also.

1. Regression Coefficient of X and Y (bxy)-

This coefficient represents the change in the value of variable X for a unit change in the
value of the variable Y. It can be calculated on the basis of different alternative formulae
discussed as under:

a. When Coefficient of Correlation (r) and both Standard Deviations (x and y) are given
or have been calculated:

x
bxy = r
y

b. When X and Y series are given and deviations are taken from actual means in both the
series:-

bxy =
 dxdy
d y 2

c. When X and Y series are given and deviations have been taken from assumed mean
either in one or in both series:

bxy =
 dxdy x N  ( dx x  dy)
 d y x N  ( dy)
2 2

d. When in place of deviations „Product Moment Method‟ is used:-

132
bxy =
 XY x N  ( X. Y)
 Y .N  ( Y)
2 2

2. Regression Coefficient of Y on X (byx)-

This coefficient represents the change in the value of variable Y for a unit change in the
value of variable X. Different alternative formulae on the basis of calculation process are as
follows:-

y
(a) byx = r.
x

(b) byx =
 dxdy
d x 2

(c) byx =
 dxdy x N  ( dx x  dy)
 d x x N  ( dx)
2 2

(d) byx =
 XY x N  ( X .  Y)
 X . N  ( X)
2 2

Example 8: Calculate from the following data: (a) Two regression coefficient, (b) Two
regression equations, (c) Coefficient of correlation, (d) Most likely value of X when Y is 39 and
(e) Most likely value of Y when X is 30.

X 25 28 35 32 31 36 29 38 34 32
Y 43 46 49 41 36 32 31 30 33 39

Solution:

X dx(X- X ) d2x Y dy(Y- Y ) d2y dxdy

25 -7 49 43 5 25 -35
28 -4 16 46 8 64 -32
35 3 9 49 11 121 33
32 0 0 41 3 9 0
31 -1 1 36 -2 4 2

133
36 4 16 32 -6 36 -24
29 -3 9 31 -7 49 21
38 6 36 30 -8 64 -48
34 2 4 33 -5 25 -10
32 0 0 39 1 1 0

X  d 2
x Y  d 2
y  dxdy 
320 380
140 398 -93
X =32 Y =38

(a) Two regression coefficient:

i. Regression coefficient of X on Y: bxy =


 dxdy
d y 2

 93
bxy = = - .234
398

ii. Regression coefficient of Y on X: byx =


 dxdy
d x 2

 .93
byx = = -.664
140

(b) Two regression equations:

X  X   
dxdy
i. Regression Equation of X on Y = (Y  Y)
d 2
y

X- 32 = -.234 (Y- 38)

X = -.234Y + 8.89 +32

X = -.234Y + 40.89


ii. Equation of Y on X = Y - Y    dxdy (X - X)

d x 2

Y-38 = -.664(X-32)

Y = -.664X + 21.25 + 38

134
Y = -.664X+ 59.25

(c) Coefficient of correlation

r= bxy x byx

 Coefficient of correlation (r) =  .234 x  .664 = +.394

(d) Most likely value of X when Y is 39

Regression Equation of X on Y: X = -.234Y + 40.89

X = -.234 x 39 +40.89 = 31.76

(e) Most likely value of Y when X is 30.

Regression Equation of Y on X: Y = -.664X +59.25

Y = -.664 x 30 + 59.25 = 39.33

 Properties of Regression Coefficients


The regression coefficient possesses some important properties, which are as follows:-

1. Same Sign – Both the regression coefficients will have the same algebraic sign, i.e.,
either both of these will be positive or negative. It is never possible that one of the regression
coefficients is negative and the other positive.

2. Value of Coefficients- Both the regression coefficients cannot be greater than one. It
means if one of the regressions coefficients is greater than one, the other must definitely be less
than one because the multiplication of two regression coefficients cannot be more than one. In
other words bxy x byx < 1.

3. Calculation of Correlation Coefficient: If both regression coefficients are known, then


correlation coefficient can be calculated by using the following formula:-

r= bxy x byx

It indicates that correlation coefficient is the geometric mean of two regressions


coefficient.

4. Same sign as that of regression coefficients: It means if regression coefficients have a


negative sign, then correlation coefficient will also be negative and if regression coefficients

135
have a positive sign, then r will also be positive.

5. Mean and coefficient of correlation- Arithmetic mean of both regression coefficients


will be equal to or greater than the coefficient of correlation symbolically,

bxy  byx
r
2

6. Regression coefficients are independent of change of origin but not of scale.

o Some Important points relating to Regression Analysis


Calculation of Means, Correlation Coefficient, S.D., etc. on the basis of Regression
Equations- Both the lines of regression intersect each other on the points of their mean values (
X and Y ). In other words, the mean values ( X and Y ) can be obtained at the point of
intersection of both regression lines as shown in the following Diagram:

Y on X

o x
x

It is clear from this diagram that if two regression equations are given, in the question
both means ( X and Y ) can be calculated by solving these equations as simultaneous equations.
If it is not clear in the question that which equation is X on Y and which is Y on X, then by
assuming any equation as X and Y and the other as Y on X the values of bxy and byx are
calculated. If multiplication of bxy and byx is not greater than 1, our assumptions are correct and
if it is greater than 1, assumption will be reversed.

Example 9: Find the mean of the variables X and Y and correlation coefficient from the
following information:

Regression equation of X on Y = 6Y – 8X – 100 = 0

Regression equation of X on Y = 2Y – 4X – 40 = 0

136
Solution:

(a) Mean of the variables X and Y

Regression equation of X on Y = 6Y – 8X – 100 = 0 ……….(i)

Regression equation of X on Y = 2Y – 4X – 40 = 0……… (ii)

OR 6Y -8X = 100……..(i)

2Y- 4X = 40…….. (ii)

We will solve them as simultaneous equations:


Multiply equation (ii) by 2 and subtract On substituting the value of Y in the
(i) from it: equation (i)
6Y -8X = 100 ………….(i) 6Y -8X = 100……..(i)
(2Y- 4X = 40) x 2……..(ii) 6x10 -8X = 100
-8X = 100 – 60
6Y -8X = 100 ………….(i) -X = 40/8 = 5
4Y – 8X = 80 ……….(ii) Or X = -5
(-) (+) (-) (on changing signs)
---------------------------------
2Y = 20
------------------------------------
Or Y =10

b. Correlation coefficient: in order to find out the correlation coefficient we will first have
to find out the value of bxy and byx on the basis of the equations:

Y on X X on Y
6Y -8X = 100 2Y- 4X = 40
6Y = 8X + 100 -4X = -2Y + 40
8 X 100  2Y 40
Y =  -X = 
6 6 4 4
byx = 1.33 X = .5Y – 10
bxy = .5

137
r= bxy x byx

r = 1.33x.5 = +.815

2. Calculation of Correlation Coefficient on the basis of Standard Deviations- If the values


of standard deviations and number of items are given in the question, the following formula is
used for the calculation of correlation coefficient:-

r=
 dxdy
N x σx x σy

It is also possible that bxy and byx are calculated on the basis of information given in the
question and then r on the basis of bxy x byx .

1. Graphic Presentation of Regression Lines- The lines of equations are drawn on the
basis of calculation of the values of other variables with the help of regression equations.
These values so calculated can be plotted on the graph to get both the lines of regressions
i.e. X on Y and Y on X.

2. Regression calculation in Bivariate Grouped Series: In this case the value of

 fdxdy,  fdx,  fdy,  fd x,  fd


2 2
y are obtained with the help of correlation table

as discussed in the last lesson. Regression equations can be developed by calculating X ,

Y , bxy and byx with the help of following formula:

X =Ax +  fdx x ix Y = Ax +
 fdy x iy
N N

ix  fdxdy.N  ( fdx. fdy)


bxy = x
iy  fd 2 y.N  ( fdy)2

byx =
iy
x
 fdxdy.N  ( fdx. fdy)
ix  fd x.N  ( fdx)
2 2

Example 10:

The following table gives the marks obtained by 50 students in Statistics and Economics.
Find the two regression equations. Also estimate the marks in Statistics of a student who secured
20 marks in Economies:-

138
Marks in Economics Marks in Statistics Total
20-25 25-30 30-35
16-20 10 15 - 25
20-24 6 10 4 20
24-28 - - 5 5
Total 16 25 9 50

Solution:

Let the marks in Statistics be denoted by X and marks in Economies by Y.

X 20-25 25-30 30-35 Total fdy fd2y fdxdy


Y (f)
dx -1 0 +1
dy
16-20 -1 10 15 - 25 -25 25 10
10 0
20-24 0 6 10 4 20 0 0 0
0 0 0
24-28 +1 - - 5 5 5 5 5
5
Total 16 25 9 N=50 fdy= fd2y=
(f) -20 30
fdx -16 0 9 fdx=
-7
fd2x 16 0 9 fd2x=
25
fdxdy 10 0 5 fdxdy=15

The value of obtained on the basis of above table are as follows:-

fdxdy = 15, fdx = -7, fdy = -20, fd2x = 25, fd2y = 30, N=50, ix = 5, iy=4, Ax = 27.5, Ay =
22.

139
X  Ax 
 fdx xi Y = AY +
 fdy x i
N N
7  20
= 27.5 + x5 = 22 + x4
50 50
= 27.5-.7 = 26.8 = 22 -1.6 = 20.4
bxy =
byx =
iy
x
 fdxdy.N  ( fdx. fdy)
i x  fdxdy X N  ( fdx X  fdy)
x
ix  fd x.N  ( fdx)
2 2

iy  fd 2 y X N  ( fdy)2 =
4 15 x 50  (7x  20)
x
5 15 x 50  (7 x  20) 5 25 x 50  (7) 2
= x
4 30 x 50  (20) 2 4 610
= x  .406
5 610 5 1,201
= x  .69
4 1,100 Equation of Y on X
Equation of X on Y (Y- Y ) = byx (X- X )
(X- X ) = bxy (Y- Y ) (Y-20.4)= .406 (X – 26.8)
(X-26.8) = .69 (Y-20.4) Y-20.4 = .406 X -10.88
X = .69 Y – 14.076 + 26.8 Y=.406 X-10.88 + 20.4
X = .69 Y + 12.724 Y = .406 X + 9.52

Marks of student in Statistics who secured 20 marks in Economies:-

X = .69 Y + 23.724

X = .69 x 20 + 12.724 = 13.8 + 12.724 = 26.524

o Standard Error of the Estimate


The estimated values of X and Y calculated with the help of regression equations, may or
may not be equal to the actual values. The error or the difference in the actual and the estimated
values can be found out by the technique „Standard Error of Estimate‟. It is the measure of
reliability of the estimating equations and in terms of calculation it can be calculated as follows:

140
Sxy =
(X  X c )2
Syx =
 (Y  Y )c
2

N N

Here X and Y are the actual values and Xc and Yc are the estimated values respectively. If
r,  x and  y are given, then the following formula may be used:

Sxy =  x 1  r 2 Syx =  y 1  r 2

If the Standard Error of Estimate is zero then there is variation between the actual value
and the estimated values. If the value of standard error is smaller, the estimated values are the
best estimate of the actual values and on the other side, if the value of standard error is larger,
then the estimated values will be considered less representative of the actual values.

Example 11: Find the standard error of estimates from the following:

 x = 3.5,  y = 2.82 and r = .34

Solution:

Sxy =  x 1  r 2 Syx =  y 1  r 2

= 3.5 1 (.34) 2 =2.82 1 (.34) 2


= 3.5 x 0.94 = 2.82 x 0.94
3.29 2.65

o Ratio of Variation
Although two series may have almost perfect correlation, yet the proportionate changes in them
may be different from each other. For instance, if it is assumed that there is perfect positive
correlation between rainfall and agricultural production then it does not mean that with the
increase of 10% in rainfall, agricultural production will also increase by 10%. Similarly, if there
is negative correlation between price and demand, i.e., with the increase in price, demand
decreases, then it is not necessary that with the increase of 20% in price, demand will also
decrease by 20%. Hence, to know proportional changes in two series another measure, known as
„Ratio of Variation‟, is calculated.

“Ratio of variation is the arithmetic average of the ratio of the percentages deviations
from the mean in the relative series (Y) as compared to these in subject (X).” In other words,
ratio of variation gives answer of this question that if subject (X) changes by 1%, by what

141
percentage the relative (Y) would change.

Calculation of Ratio of Variation-

There are two methods of computing ratio of variation:-

a. Algebrical Method.

b. Graphical Method or Galton Graph

3.5 Summary

The technique of correlation coefficient is the most useful tool in statistical analysis in every
discipline. It not only helps in economic behavior but also locates the variables on which other
depends. Correlation can successfully be utilized to forecast and plan future. In the lesson we
learn how to calculate correlation by graphical as well as mathematical methods. We also
understand the relation of probable with that of coefficient of correlation.

In the lesson we learn that the regression analysis is a step further to correlation analysis.
In correlation we can calculate only the relationship between the two variables i.e. whether it is
positive or negative. The extent of the relationship can be found out but to what extent one
variable is related to the other, can only be found out by regression analysis. The regression
equations help us to compute the extent of change in one variable due to the given change in the
other one. The coefficient of regression helps us to find the slope of the equation line. There are
always two lines of regression, one X on Y and the other Y on X. Standard error of estimate and
Ratio of variation help to find the difference in the actual and the estimated values.

3.6 Questions

1. What is meant by correlation? Does it always signify cause and effect relationship
between the two variables? Explain with one illustration.

2. Distinguish between positive and negative correlation. Explain the interpretation of


correlation with the help of scatter diagrams.

3. Write short notes on the following:- (i) Simple, Multiple and Partial Correlation, (ii)
Linear and Curvilinear Correlation, (iii) Degree of Correlation.

4. What is probable error of coefficient of correlation? How can the correlation be discussed
with the help of probable error?

5. Define Regression. Why are there two regression lines? Under what conditions can there

142
be only one line?
6. Explain the terms Correlation and Regression and their utility in economic analysis.
Explain by an example.
7. What is Ratio of Variation?
8. What do you mean by regression coefficient? What are the uses of regression analysis?
3.7 Suggested Reading

1. Hooda, R. P., “Statistical for Business and Economics”, Macmillan, New Delhi.

2. Ya-lun Chou, “Statistical Analysis with Business and Economics Applications”, Holt;
Rienhart & Winster, New York.

3. Dr. Gupta, K.L.: „BUSINESS STATISTICS‟, NAVYUG SAHITYA SADAN,


AGRA, 2006.

4. Agarwal, S.L. and Bhardwaj S.l.: „BUSINESS STATISTICS‟, KALYANI


PUBLISHERS, LUDIANA, 2002.

143
UNIT - IV

INDEX NUMBER & PROBABIITY

Structure

4.1 Objectives

4.2 Index Number Meaning, Types and Uses;

o Methods of constructing Price and Quantity indices


o Problems Relating to Methods of Base Year
o Problems in constructing index numbers.
4.3 THEORY OF PROBABILITY:
o Meaning and definitions of Probability

o Types of Probability

o Basic concept of Probability

o Methods to Use in Solving Probability Problems

o Approaches of Assigning Probabilities

4.4 Summary

4.5 Questions

4.6 Suggested Reading

4.1 Objectives

The objective of the lesson is to make you understand:

1. The meaning and types of index number,

2. Use and application of index number, and

3. The concept living index No. and wholesale price index No.

4. Meaning & concept of Probability


5. Different events and computation of probabilities in different conditions

1.2 Index Number Meaning:

The technique of index number was developed by Bishop Fleetwood in 1707. But the actual
credit goes to G.R.Carli who applied the device to study the behaviour of change in price level in

144
the year 1750. It is the statistical technique developed for measuring related variables and
provides representative figures as regards the combined changes in the variables and their
effects.

It is specialized average through which changes in data related to time or space is presented
in relative or comparative form,

 Meaning

“Index number is a single ratio which measures the combined change of several variables
between two different times, places or situations”.

“Index numbers are devices for measuring differences in the magnitude of a group of
related variables” Croxton and Cowden.

Conclusively, it can be said that index number is a specialized average expressed in terms
of percentage, which measures the general trend of relative changes in data on the basis of
different times, places or other variables.

 Characteristics of Index Numbers

The main characteristics of the measurement of index numbers are as follows:-

1. Relative Measurement – The basic characteristics of index number is that it is a relative


or comparative measurement, because changes can be measured only in relative or
comparative term. For example, if it is said that price index number of the year 2003 in
comparison of 2000 is 160, it means that prices have increased by 60% during this
period.

2. Specialized Average-is a specialized average because it is necessary for a simple average


that all figures should be in same units. If certain figures are in kilograms, some in liter
and some in meter, simple average cannot be calculated, but this limitation does not apply
on index number and index number is expressed in terms of percentage, though generally
the sign of percentage is not used with it.

3. Measurement of Changes not capable of Direct Measurement-Generally, the


technique of index number is used in measuring such mixed and complicated changes,
which cannot be measured directly such as price level, cost of living, changes in
economic activities, etc.

4. Measurement of Common Characteristics of a Group of Items-Index number

145
expresses the common characteristics of a group of items. For example, if price index is
increasing, it does not mean that price of every commodity is increasing. It is also
possible that prices of some items are increasing and of some other items they are
decreasing but the general trend is of increasing in prices.

5. Comparison on the basis of Time or Place-It measures the relative changes either on
the basis of time or on the basis of place. For example, in cost of living index number, the
cost of living of a certain group of persons for two different years or cost of living of two
different groups for the same year may be compared.

6. Universal Use-The technique of index numbers was developed for the measurement of
changes in prices, but today it measures changes in production, trade, economic activities,
productivity, etc.

 Utility or Importance of Index Numbers

Index number has become an important and useful tool for measurement and analysis of
changes in economic and business activities. The uses and importance of index numbers can be
studied under following heads:-

(I) General Importance-

(1) Simplification of Complicated Facts-For example, changes in business activities in a


country is a complicated fact, which includes changes in various segments such as
industry, business, banking, transport, etc. But it can easily be simplified by index
number.

(2) Comparative Study-With the help of index numbers units of different nature are
converted into a meaningful and simple numerical value.

3. Measurement of Trends-The trends in different economic phenomena such as industrial


production, foreign trade, national income, balance of payment, etc. can easily be
measured and understood on the basis of index numbers. Future trends can also be
estimated with its help.

4. Helpful in Policy Formulation- determination of policies and formulation of plans is


useful for all individuals, business enterprises, government and its organizations.

(II) Specific Importance-

(1) Measurement of purchasing power of Money- Index numbers is useful in obtaining the

146
intrinsic worth of money as compared with its nominal worth. For example, if the price
index number of current year is 400, it means that the purchasing power of money is only
25 paisa as compared to that of base year.

(2) Basis for determination of Salary, D.A. etc.- Cost of living index numbers are helpful
in determination of salary, dearness allowance, etc., of employees.

(3) Useful to Business Community-According to M. M. Blair, “Index numbers are the signs
and guide-posts along the business highway that indicate to the businessman how he
should drive or manage his affairs”. In fact, businessman uses various indices for making
decisions relating to quantity of production, selling price, rate of profit, storage of goods,
etc.

(4) Useful to Politicians and Government-Index numbers are very useful to politicians also
because with the help of index numbers:- (a) economic condition of the country can be
analyzed and (b) economic policies of the government can be evaluated critically and
constructively.

(5) Deflating the Prices-Index numbers are highly useful in deflating i.e., in knowing the
real value of the nominal data. For example, index numbers are used for converting
national income from current price to the price of some base year or for changing money
income into real income.

(6) Useful in other Areas-On the basis of index numbers investors prepare plan for
investing their wealth, speculators determine the direction of trade, banks determine the
rate of interest, insurance companies take decision about the rate of premium and social-
reforms and evaluate the progress and changes in society.

 Limitations of Index Numbers

(1) Based on Samples-Index numbers are generally based on sample and the accuracy of the
result will depend on the proper size and method of sampling. If the items included in the
sample do not represent the universe properly, index number cannot present the true
position of the problem under study.

(2) Indicator of Average or Approximate Trend-Index numbers indicate only average or


approximate trend of changes. Hence, they should be interpreted keeping this limitation
into consideration.

147
(3) Limitations of Construction-There may be confusion on account of errors or lack of
precaution in the construction of index numbers. Such errors may arise in the selection of
base year, determination of weights, use of average and formula, etc.

(4) Impact of Specific Objectives-the index number calculated for one objective may not
serve the other objective. For example, wholesale price index numbers cannot be used in
place of cost of living index numbers.

(5) Ignorance of Change in Qualitative Facts-If there are changes in qualitative facts
index numbers may not express them properly.

 Types of Index Numbers

Index numbers are classified on the basis of phenomena whose changes they measure. In
the economic and business sphere, index numbers can be classified into the following four types:

(1) Price Index Numbers-The price index numbers measures the changes in the prices or
price level. They are further divided into two sub-parts:- (a) Wholesale Price Index
Numbers and (b) Retail or Cost of Living Price Index Numbers.

(2) Index Numbers of Physical Quantities or Quantity Index Numbers-These index


numbers are prepared to measure the increase or decrease in physical quantities
(produced, consumed or traded) such as Index numbers of agricultural production,
industrial production, etc.

(3) Total Value Index Numbers-These are intended to study the change in the total value
(Quantity x Price) in current year in comparison to base year i.e., index numbers of sales.

(4) On the basis of number of Commodities- If the index number is constructed on the
basis of price of one commodity, it is known as „Simple Index Number‟. On the other
hand, if it constructed for the group of commodities, it is known as „composite or
aggregate‟ index number.

(5) Special Purpose Index Numbers-Index numbers may be prepared for specific purposes
also such as index numbers of national income, growth rate, productivity, etc.

o Methods of constructing Price and Quantity Index Numbers


The various methods of constructing index numbers can be explained by the following
chart:

148
Price Index Number

Methods of Constructing Index Numbers

Unweighted Weighted

Simple Aggregative Simple Average of Weighted Weighted Average


Method Price Relatives Aggregative of Price-
Relatives
Method
(I) Unweighted Index Numbers-
In unweighted index numbers no weight is assigned to various items and it is assumed
that all the items have equal weight or equal relative importance. These index numbers
can be further divided in two categories on the basis of construction techniques:
(1) Simple Aggregative Method- In this method price of all commodities in base year and
current year are added separately and they are denoted as p0 and p1 respectively. The
total of current year (p1) is divided by the total of base year (p0) and the quotient is
multiplied by 100. Symbolically:-

Index No. (P01) =


p 1
x 100
p 0

Evidently, this method expresses the aggregate price of all commodities in the current
year as a percentage of the aggregate price in the base year.
(2) Simple Average of Price Relatives Method- In this method, first of all price relatives
(P.R.) of the various items included in the index is obtained. For this purpose current
year‟s price of each commodity is divided by the price of base year and the quotient is
 p x 100 
multiplied by 100  1  . The index number is obtained on the basis of dividing the
 p0 
total of price relatives by the number of commodities. Symbolically-
 p1 
 p x 100 
 P.R.
P01 =  0  or
N N
Example 10: From the following data, construct simple index numbers for 2004 taking 2000 as
base year by (a) simple aggregative method, and (b) simple average of price relative method:
Items A B C D E
Price 2000 12 25 10 5 6
Price 2004 15 20 12 10 15

149
Solution:

Items Price 2000 ( P0) Price 2004 (P1) Calculation P.R


A 12 15 15 125
x100
12
B 25 20 20 80
x100
25
C 10 12 12 120
x100
10
D 5 10 10 200
x100
5
E 6 15 15 250
x100
6
 P0 = 58  P1= 72  P.R=
775
(a) Simple Aggregative method:

Index No. (P01) =


p 1
x 100 =
72
x100 = 124.13
p 0 58
(b) Simple average of price relative method:
 p1 
 p x 100 
 P.R. =
P01 =  0  or 775
 155
N N 5
(II) Weighted Index Numbers-
Weighted Index numbers are those index numbers in which comparative or relative
importance is assigned to different commodities and on account of this adjustment these index
numbers are considered more rational and logical. From the view of construction process, these
index numbers may also be divided into following two types:
(1) Weighted Aggregative Method- In this method appropriate weight is assigned to
various commodities to reflect their relative importance. If these weights are based on the
actual quantity (of consumption or production) the symbol „q‟ is used for it and if weights
are assigned on estimates, symbol of „w‟ is appropriate. The process of construction of
this index number is as follows:-
(i) First of all, the prices of the current year (p1) are multiplied by quantity weights (q1) of
the current year and their total (p1q1) is obtained.
(ii) The prices of the base year (p0) are multiplied by quantity weights (q0) of the current year
and they are also totaled (p0q0). The formula for the construction of index number is :-

Index No. =
p q 1 0
x 100 or
 p w x 100
1

p q 0 0 p w
0

Example 11: From the following data prepare a price index number for 2004 based upon 2000
by weighted aggregative expenditure method:

150
Items q0 P0 P1
A 10 13 25
B 25 27 35
C 5 38 50
D 40 3 6
E 35 25 25
Solution:

Items q0 P0 P1 P0q0 P1q0


A 10 13 25 130 250
B 25 27 35 675 875
C 5 38 50 190 250
D 40 3 6 120 240
E 35 25 25 875 875
 
P0q0=1990 P1q0=2490

Index No. =
p q1 0
x 100 =
2490
=125.12
p q0 0 1990

(2) Weighted Average of Price Relatives Method- In this method the price relatives (P.R.) for
 p x 100 
the current year are calculated on the basis of the prices of base year  1  . These price
 p0 
relatives are multiplied by the respective weights (PR x W= WPR) of the items. These products
are added up (WPR) and are divided by the sum of weights (W). Symbolically-

Weighted Index Nos. =


 WPR
W
Note : 1. If weights are clearly given in the question, they are used. But if quantities (q0) of the
base year are given, the value weights are computed by multiplication of quantity (q 0) and price
(p0) of each commodity in base year.
2. Some authors symbolize value weight by „V‟ in place of „W‟.
Example 12:

Group Index No. Weight


Food 210 40
Fuel 350 15
Clothing 125 30
Rent 500 25
Entertainment 150 10

151
Solution:

Group Index No. Weight WPR


Food 210 40 8400
Fuel 350 15 5250
Clothing 125 30 3750
Rent 500 25 12500
Entertainment 150 10 1500
 W= 120 
WPR=31400

Weighted Index Nos. =


 WPR = 31400  261.66
W 120

Quantity Index Numbers

The quantity index numbers express the comparative position of changes in physical
quantity of goods produced, sold or consumed. The process of construction of these index
number is similar to that of price index numbers. However, the symbol „q‟ is used in place of „p‟.
Quantity index numbers can also be simple or weighted. Weights in these index numbers are
given in terms of prices. The formulae of quantity of index numbers are as follows:-

(A) Unweighted Quantity Index Number-

(1) Simple Aggregative Method: Q01 =


q 1
x 100
q 0

(2) Simple Average of Relatives Method: Q01= Q01 =


 (q p
1 0 ) x 100 
N

(B) Weighted Quantity Index Numbers-

(i) Weighted Aggregative Index Numbers:

(1) Laspeyre‟s Method =


q p 1 1
x 100
q p 0 0

(2) Paasche‟s Method =


q p
1 1
x 100
q p
0 1

1   q 1 p 0  q 1 p1 
(3) Dorbish and Bowley‟s Method =    x 100
2   q 0 p 0  q 0 p1 

152
  q 1 p 0  q 1 p1 
(4) Marshal-Edgeworth‟s Method =    x 100
  q 0 p 0  q 0 p1 

(5) Fisher‟s Method =


 q p x q p
1 0 1 1
x 100
 q p x q p
0 0 0 1

(ii) Weighted Average Relatives:

 q1lq 0  x 100  W
Q01 = Where as W = q0P0
W

o Problems Relating to Methods of Base Year


Generally, in price index numbers changes in various years are measured on the basis of
base year. There are two methods of base year: (1) Fixed-base method and (II) Chain-base
method.
(I) Fixed Base Method:
In this method base year remains fixed. It means that index numbers of all other years are
prepared on the basis of a particular year or period. Fixed-base can also be of two types:
(1) Single year fixed-base and (2) Multi-year average base.
(1) Single year fixed-base – In this method any normal year or period is selected as base
year. The price of base year is denoted as p0 and the prices of other years as p1 and the
following formula is used for the calculation of index numbers or price relatives (P.R.).
p1 x 100
Index Number or Price Relative (P.R.) =
P0
Note: Prices of other years may also be denoted as p1, p2, p3, p4…… etc. and such a case the
P P P P
formula will be 1 x 100, 2 x 100, 3 x 100, 4 x 100 respectively.
P0 Po P0 P0
Example 1. Prepare index numbers for different years taking 1998 as base year:

Year 1998 1999 2000 2001 2002 2003

Prices 24 28 30 35 40 42

Solution:

Year Price Calculation Index No. (P.R)


1998 24 - 100
1999 28 28x100 116.6
24
2000 30 30x100 125
24
153
2001 35 35x100 145.8
24
2002 40 40x100 166.6
24
2003 42 42x100 175
24

The index number of the base year is represented by 100. Since the base year is 1998, its
current price and the base year‟s price shall be same i.e. 24, therefore the I.N will be 100.
(2). Multi-year Average Base-When there is difficulty in selecting a particular year as a base
year, the average price of a few years is taken as base price and this average price is
expressed as p0.
Example 2: Calculate the price relatives taking average price from 2001 to 2004 as base from
the following data:

Year 2001 2002 2003 2004 2005 2006

Prices 40 44 50 58 60 65

Solution:
Year Price Calculation Index No.(P.R)
2001 40 40x100 83.3
48
2002 44 44x100 91.6
48
2003 50 50x100 104
48
2004 58 58x100 120.8
48
2005 60 60x100 125
48
2006 65 65x100 135.4
48

pricefrom2001to2004 40  44  50  58
P0 = = = 48
4 4
(II) Chain Base Method
In the method price relative (also known as link relative) for every current year is
calculated on the basis of price of the immediately preceding year. For example, if index
numbers are to be constructed from to 2000 to 2004 then for 2000 the year 1999 will be taken as
base and similarly 2000 will be base for 2001; 2001 for the year 2002 and so on.

154
Example3: Taking the same data as above prepare chain base index number:
Solution:

Year Price Calculation Index No.(P.R)


2001 40 - 100
2002 44 44x100 110
40
2003 50 50x100 113.6
44
2004 58 58x100 116
50
2005 60 60x100 103.4
58
2006 65 65x100 108.33
60

Example 4: The following table gives the average wholesale price of three commodities for the
years 2003 to 2007, compute index numbers by the chain base method:
Commodities 2003 2004 2005 2006 2007

A 5 8 10 12 15

B 10 14 18 20 22

C 15 18 20 22 24

Solution:

Com 2003 2003 2004 2004 2005 2005 2006 2006 2007 2007

P LR P LR P LR P LR P LR

A 5 100 8 160 10 125 12 120 15 125

B 10 100 14 140 18 128 20 111 22 110

C 15 100 18 120 20 111 22 110 24 109

TLR 300 420 364 341 344

ALR 100 140 121.3 113.6 114.6

P = Price, LR = Link Relative, TLR = Total Link Relative, ALR = Average Link Relative.

155
Chain Indices Chained to a Common Base:
The average of link relatives establishes relationship between two adjoining years. In
other words, it indicates the percentage change in the next year in comparison to previous year.
These link relatives can be chained together. This is done by multiplying the average link
relatives of the current year by the chain index of the previous year and dividing the product by
100 as shown in the following formula. It is known as chain indices chained to a common base.
Chin Index No.of Previous Year x Average
Link Relative of Current
Chain Index No. for Current Year =
100
Example5: Taking the above data let us calculate the Chain Base Index No. with 2003 as base
year:
Solution:
2003

2003

2004

2004

2005

2005

2006

2006

2007

2007
Com

P LR P LR P LR P LR P LR
A 5 100 8 160 10 125 12 120 15 125
B 10 100 14 140 18 128 20 111 22 110
C 15 100 18 120 20 111 22 110 24 109
TLR 300 420 364 341 344
ALR 100 140 121.3 113.6 114.6
140x121.3 169.8x113.6 192.8x114.6
CIB 100x140
100 100 100 100
2003 100 = 169.8 = 192.89 =220.9
= 140
P = Price, LR = Link Relative, TLR = Total Link Relative, ALR = Average Link Relative, CIB
2003= Chain Indices Based 2003

(III) Base Conversion


Base Conversion may be of two types: - (1) from fixed base to chain base and (2) from
chain base to fixed base.
(1) From Fixed Base to Chain Base- For such conversion the chain index for the first year will
be 100 and for next years the following formula is to be used:-
Current Year's Fixed Base I. No. x 100
Current Year Cain Base Index Number =
Previous Year's Fixed Base Index No.
Example6: From the Fixed Base Index Numbers given below, Prepare chain base index
numbers:
Year 2000 2001 2002 2003 2004 2005

Index No. 150 175 200 180 195 220

156
Solution:

Year Fixed Base I. No. Conversion Chain base I. No.


2000 150 - 100
2001 175 175x100 116.66
150
2002 200 200x100 114.28
175
2003 180 180x100 90
200
2004 195 195x100 108.33
180
2005 220 220x100 112.82
195

(2) From Chain Base to Fixed Base- In this conversion, the fixed base index for the first year
will be taken the same as the chain base index. However, if it given specifically in the question
that first year is to be taken as base, the index number for the first year will be 100. For
calculating the indices for other the following formula will be used:-
Current Year Fixed Base Index Nos. =
Current Year's Chain Index No x Previous Year's Fixed Index No.
100
Example 7: From the Chain base index numbers given below, prepare fixed base index number:
Year 2000 2001 2002 2003 2004 2005

Index No. 200 175 225 250 210 180

Solution:

Year Chain based I. No. Conversion Fixed base I. No.


2000 200 - 200
2001 175 200x175 350
100
2002 225 350x225 787.5
100
2003 250 787.5x250 1968.7
100
2004 210 1968 .7 x210 4134.3
100
2005 180 4134 .3x180 7441.8
100

157
(IV) Base Shifting
The problem of base shifting of an index number arises on account of following two
reasons-
(1) When the base year has become too old- When the base year has become too old
and is far away from the current year, it becomes unsuitable for meaningful comparison.
Hence, the old base year is shifted to some new one around which the prices are
fluctuating in recent years.
(2) When two or more such index numbers are to be compared which have different
base years-Thus, if it‟s desired to compare prices in India with the prices in U.S.A., both
index numbers must have a common base year. Suppose, the base year is 1985 in the case
of U.S.A. and 1990 is in the case of India. Then, either the base year of U.S.A. will be
shifted to 1990 or the base year of India will be shifted to 1985.
There are two methods of base shifting –(1) Direct or Reconstruction Method-In this
method the prices of the new base year are taken as 100 and the prices of all other years are
converted into price relatives and all the index numbers are constructed afresh.
(2) Indirect or Short Method-In this method the index numbers of new base year is
taken as 100 and all other old index numbers are changed on the basis of the following
formula: -
Old Index No. of Current Year x100
New Base Index Nos. =
Old Index No. of New Base Year
The second method is very easy and convenient. But it gives correct results only when
the simple geometric mean has been used in the construction of index number. If arithmetic
mean has been used in the construction of index number, then the second method can give only
an approximate idea of the correct index number. However, in general practice, the second
method is used more widely.
Example 8: Shift the base year from 1999 to 2002 in the following series of index number:
Year 1999 2000 2001 2002 2003 2004 2005

I. No. 100 115 130 135 150 160 165

Solution:
Year I. No. Base Shifting I. No (2002=100)
1999 100 100 74.07
x100
135
2000 115 115 85.18
x100
135
2001 130 130 96.29
x100
135
2002 135 135 100.00
x100
135
2003 150 150 111.11
x100
135

158
2004 160 160 118.5
x100
135
2005 165 165 122.22
x100
135

(V) Splicing-
By splicing of index we mean combining two or more series of overlapping index
numbers to obtain a single series of index numbers on a common base. It is generally done when
a series of old index numbers with an old base is discontinued and a new series of index number
is being started with a new base.
There may be two types of splicing:-
(1) Splicing of new index numbers to old index numbers-In this type of splicing the old
series of index numbers is kept as it is, but the new series is converted to splice it with the
old series. It is also called forward splicing and for this purpose following formula is
used:-
Index No. of Current Year x Old Index No. of New Base Year
Spliced Index No. =
100
(2) Splicing of old index numbers to new index numbers-It is called backward splicing. In
this type, the new series is kept as it is but the old series is changed on the basis of
following formula to splice it with the new series:
Index No. of Current Year x100
Spliced Index No. =
Old Index No. of New Base Year
Example 9: Splice the following two index number series, continuing series A forward and
series B backward:
Year 1996 1997 1998 1999 2000 2001

Series A 100 120 150

Series B 100 110 120 150

Solution:
Year Series A Series B F. S S.S B.S S.S
1996 100 100 100 66.67
x100
150
1997 120 120 120 80
x100
150
1998 150 100 100x150 150 150 100
x100
100 150
1999 110 110x150 165
100
2000 120 120x150 180
100

159
2001 150 150x150 225
100

S.S = Spliced Series, F. S = Forward Splicing, B.S = Backward Splicing

o Problems in constructing index numbers.

Before constructing index numbers a careful thought must be given the following problems:

1. The purpose of the index.

At the very outset the purpose of constructing the index must be very clearly, decided – what

the index is to measures and why? There is no all-purpose index. Every index is of limited

and particular use. Thus, a price index that is intended to measure consumers‟ prices must not

include wholesale prices. And if such an index in intended to measure the cost of living of

poor families, great care should be taken not to include goods ordinarily used by middle class

and upper-income groups.

2. Selection of a base period.

Whenever index numbers are constructed a reference is made to some base period. The base

period of an index number (also called the reference period) is the period against which

comparisons are made.

(I) THE BASE PERIOD SHOULD BE NORMAL ONE.

The period that is selected as base should be normal, i.e., it should be free from abnormalities

like wars, earthquakes, famines, booms, depressions, etc. However, at times it is really difficult

to select year which is normal in al respects – a year which is normal in one respect may be

abnormal in another.

(II) THE BASE PERIOD SHOULD NOT BE TOO DISTANT IN THE PAST.

It is desirable to have an index based on a fairly recent period, since comparison with a familiar

set of circumstances is more helpful than comparison with vaguely remembered conditions.

160
(III) FIXED BASE OR CHAIN BASE.

While selecting the base a decision has to be made as to whether the base shall remain fixed or

not, i.e., whether we have a fixed base or chain base index.

3. Selection of number of items.

The items included in an index should be determined by the purpose for which the index is

constructed. Every item cannot be included while constructing an index number and hence once

has to select a sample. It is also necessary to decide the grade or quality of the items to be

included in the index. Index numbers shall give wrong result if at one time one set of qualities is

included and at another time another set.

4. Price quotations.

After the commodities have been selected, the next problem is to obtain price quotations for

these commodities. It is a will known fact that prices of many commodities vary from place to

place and even from shop to shop in the same market. It is impracticable to obtain price

quotations from all the places where a commodity is dealt in. A selection must be made of

representative places and persons. These places should be those which are well known for

trading for that particular commodity.

5. Choice of an Average.

Theoretically speaking, geometric mean is the best average in the construction of index numbers

because of following reasons: (i) in the constructions of index number we are concerned with

ratios of relative changes and the geometric mean gives equal weights to equal ratio of change;

(ii) geometric mean is less susceptible to major variations as a result of violent fluctuations in the

values of the individual items; and (iii) index numbers calculated by using the average are

reversible and, therefore, base shifting is easily possible. The geometric mean index always

satisfies the time reversal test.

161
6. Selection of appropriate weights.

The problem of selecting suitable weights in quite important and at the same quite difficult to

decide. The term „weight‟ refers to the relative importance and hence it is necessary to devise

some suitable method whereby the varying importance of the different items by taken into

account. This is done by allocating weights. Thus, in the former case, no specific weights are

assigned whereas in the latter case specific weights are assigned to various items. It may be

pointed out here that no index is unweighted in strict sense of the term as weights implicitly enter

the unweighted indices because we are giving equal importance to all the items and hence

weights are unity. It is, therefore, necessary to adopt some importance to all the items and hence

weights are unity. It is, therefore, necessary to adopt some suitable method of weighting so that

arbitrary and haphazard weights may not affect the results. There are two methods of assigning

weights: (i) implicit, and (ii) explicit.

7. Selection of an appropriate formula.

A large number of formulae have been devised for constructing the index. The problem very

often is that of selecting the most appropriate formula. The choice of the formula would depend

not only on the purpose of the index but also on the data available.

4.3 THEORY OF PROBABILITY:

In our day to day life, we come across many uncertainties of events. We wake up in the
morning and check the weather report. The statement could be 'there is 60% chance of rain
today. This statement infers that the chance of rain is more than that having a dry weather.
How probable an event is? We generally infer by repeated observation of such events in
long term patterns. Probability is the branch of mathematics devoted to the study of such
events.
o Meaning and definitions of Probability

Probability is an important and complex field of study. Fortunately, only a few basic
issues in probability theory are essential for understanding statistics at the level covered in this

162
book. These basic issues are covered in this chapter.

A probability provides a quantitative description of the likely occurrence of a particular


event. Probability is conventionally expressed on a scale from 0 to 1; a rare event has a
probability close to 0, a very common event has a probability close to 1. A few definitions of
probability are as follows:

According to Laplace, “Probability is the ratio of favorable events to the total number of
equally likely events.”

According to Connor, “Probability is an attitude of mind towards uncertain events.” “It is


the degree of likelihood that something will happen”. Probabilities are expressed as fractions
(1⁄2, 1⁄4, 3⁄4), as decimals (.5, .25, .75), or as percentages (50%, 25%, 75%) between 0 and 1. For
example, a probability of 0 means that something can never happen and a probability of 1 means
that something will always happen.

Modern definition: The modern definition starts with a set called the sample space,
which relates to the set of all possible outcomes in classical sense, denoted by. It is then assumed
that for each element, an intrinsic "probability" value is attached, which satisfies the following
properties:

That is, the probability function f(x) lies between zero and one for every value of x in the
sample space Ω, and the sum of f(x) over all values x in the sample space Ω is exactly equal to 1.
An event is defined as any subset of the sample space Ω,. The probability of the event defined as
so, the probability of the entire sample space is 1, and the probability of the null event is 0. It can
be expressed as:

For example, the probability of drawing a spade from a pack of 52 well-shuffled playing
cards is 13/52 = 1/4 = 0.25 since

Event E = 'a spade is drawn'; the number of outcomes corresponding to E = 13 (spades);


the total number of outcomes = 52 (cards).

When tossing a coin, we assume that the results 'heads' or 'tails' each have equal
probabilities of 0.5.

163
Here is a more complex example. You throw 2 dice. What is the probability that the sum
of the two dice will be 6? To solve this problem, list all the possible outcomes. There are 36 of
them since each die can come up one of six ways. The 36 possibilities are shown below.

Die Die Total Die Die Total Die Die Total


1 2 1 2 1 2
1 1 2 3 1 4 5 1 6
1 2 3 3 2 5 5 2 7
1 3 4 3 3 6 5 3 8
1 4 5 3 4 7 5 4 9
1 5 6 3 5 8 5 5 10
1 6 7 3 6 9 5 6 11
2 1 3 4 1 5 6 1 7
2 2 4 4 2 6 6 2 8
2 3 5 4 3 7 6 3 9
2 4 6 4 4 8 6 4 10
2 5 7 4 5 9 6 5 11
2 6 8 4 6 10 6 6 12

You can see that 5 of the 36 possibilities total 6. Therefore, the probability is 5/36.

If you know the probability of an event occurring, it is easy to compute the probability
that the event does not occur. If P(A) is the probability of Event A, then 1 - P(A) is the
probability that the event does not occur. For the last example, the probability that the total is 6 is
5/36. Therefore, the probability that the total is not 6 is 1 - 5/36 = 31/36.
o Types of Probability

Three types of probabilities are discussed

1. Marginal Probability: A marginal probability is usually calculated by dividing some subtotal


by the whole. For example, the probability of a person wearing glasses is calculated by dividing
the number of people wearing glasses by the total number of people. Marginal probability is
denoted P(X), where X is some event.

164
2. Union Probability: A union probability is denoted by P(X or Y), where X and Y are two
events. P(X or Y) is the probability that X will occur or that Y will occur or that both X and Y
will occur. The probability of a person wearing glasses or having blond hair is an example of
union probability. All people wearing glasses are included in the union, along with all blondes
and all blond people who wear glasses.

3. Joint Probability: A joint probability is denoted by P(X and Y). To become eligible for the
joint probability, both events X and Y must occur. The probability that a person is a blond head
and wears glasses is an example of joint probability.

o Basic concept of Probability

Basic concepts of Probability: Before considering the methodology for estimating system
reliability, some basic concepts from probability theory should be reviewed.

Outcome: An outcome is the result of an experiment or other situation involving


uncertainty. The set of all possible outcomes of a probability experiment is called a sample
space.

Unions & Intersections: An element qualifies for the union of X, Y if it is in either X or


Y or in both X and Y. For example, if X=(2, 8, 14, 18) and Y=(4, 6, 8, 10, 12), then the union of
(X,Y)=(2, 4, 6, 8, 10, 12, 14, 18). The key word indicating the union of two or more events is
„or‟.
An element qualifies for the intersection of X, Y if it is in both X and Y. For example, if X= (2,
8, 14, 18) and Y= (4, 6, 8, 10, 12), then the intersection of (X, Y) =8. The key word indicating
the intersection of two or more events is „and‟. see the following figures:

Sample Space The sample space is an exhaustive list of all the possible outcomes of an
experiment. Each possible result of such a study is represented by one and only one point in the
sample space, which is usually denoted by S. For example, Experiment Rolling a die once:

165
Sample space S = {1, 2, 3, 4, 5, 6}

Experiment Tossing a coin: Sample space S = {Heads, Tails}

Experiment measuring the height (cms) of a girl on her first day at school:

Sample space S = the set of all possible real numbers

Experiment (E): An experiment is any well-defined action that may result in a number
of outcomes. For example, the rolling of dice can be considered an experiment. Experiment is a
process of observation that leads to a single outcome that cannot be predicted with certainty. For
examples:

1. Toss a coin.

2. Pull a card from a deck.

3. Select an account for auditing.

Consider an experiment that consists of the rolling of a six-sided die. The numbers on
each side of the die are the possible outcomes. Accordingly, the sample space is
S = {1, 2, 3, 4, 5, 6}.

Let A be the event of rolling a 3, 4 or 6 (A = {3, 4, 6}) and let B be the event of rolling a
2, 3 or 5 (B = {2, 3, 5,}).

1. The union of A and B is: A B = {2, 3, 4, 5, 6}

2. The intersection of A and B is: A B = {3}.

3. The complement of A is: = {1, 2, 5}.

Suppose we toss a coin three times. This experiment has three steps: the first toss, the
second toss and the third toss. Each step has two outcomes: a head and a tail. Thus,

Total outcomes for three tosses of a coin = 2 x 2 x 2 = 8

The eight outcomes for this experiment are

HHH, HHT, HTH, HTT, THH, THT, TTH, and TTT

Event: An event is any collection of outcomes of an experiment. Formally, any subset of


the sample space is an event. Any event which consists of a single outcome in the sample space
is called an elementary or simple event. Events which consist of more than one outcome are
called compound events. Set theory is used to represent relationships among events.

166
It is denoted by E and is any subset of S and refers to outcomes of S that we are interested in. For
examples

E = Head

E = Spades

E = Account in error

In general, if A and B are two events in the sample space S, then

(A union B) = 'either A or B occurs or both occur'

(A intersection B) = 'both A and B occur'

(A is a subset of B) = 'if A occurs, so does B'

A' or = 'event A does not occur'

(the empty set) = an impossible event

S (the sample space) = an event that is certain to occur

For example, experiment rolling a dice once -

Sample space S = {1,2,3,4,5,6}

Events A = 'score < 4' = {1,2,3}

B = 'score is even' = {2,4,6}

C = 'score is 7' =

= 'the score is < 4 or even or both' = {1,2,3,4,6}

= 'the score is < 4 and even' = {2}

A' or = 'event A does not occur' = {4,5,6}

There are different types of events in the theory of probability, if you want to understand
the theory of probability then the understanding these events are must. These events are
discussed one by one with the example:

Mutually Exclusive Events: Two events A and B are said to be mutually exclusive if it
is impossible for them to occur simultaneously (A B = C). In such cases, the expression for
the union of these two events reduces to the following, since the probability of the intersection of
these events is defined as zero.

167
Two events are mutually exclusive (or disjoint) if it is impossible for them to occur
together. Formally, two events A and B are mutually exclusive if and only if

If two events are mutually exclusive, they cannot be independent and vice versa. In case of
mutually exclusive event, the happening of event will be only one and question of “or” will be
there.

Examples: Experiment rolling a die once

Sample space S = {1,2,3,4,5,6}

Events A = 'observe an odd number' = {1,3,5}

B = 'observe an even number' = {2,4,6}

= the empty set, so A and B are mutually exclusive.

Those events that cannot happen together are called mutually exclusive events. For
example, in the toss of a single coin, the events of heads and tails are mutually exclusive. The
probability of two mutually exclusive events occurring at the same time is zero. See the
following figure:

Working rule of addition theorem of probability:

(i) A B denotes the event of occurrence of at least one of the event „A‟ or „B‟

(ii) A B denotes the event of occurrence of both the events „A‟ and „B‟.

(iii) P (A B) or P (A+B) denotes the probability of occurrence of at least one of the event „A‟ or
„B‟.

(iv) P ( B) or P(AB) denotes the probability of occurrence of both the event „A‟

168
Ex-1: The probability that a contractor will get a contract is „2/3‟ and the probability that
he will get on other contract is 5/9 . If the probability of getting at least one contract is 4/5, what
is the probability that he will get both the contracts?

Sol.: Here P(A) = 2/3, P(B) = 5/9

P(A b) = 4/5, (P(A B) = ?

By addition theorem of Probability:

P(A B) = P(A) + P(B) - P(A B)

= 4/5 = 2/3 + 5/9 - P(A B)

or 4/5 = 11/9 – P(A B)

or P(A B) = 11/9 – 4/5 = (55-36) / 45

P(A B) = 19/45

Ex-2.: Two cards are drawn at random. Find the probability that both the cards are of
red colour or they are queen.

Sol.: Let S = Sample space.

A = The event that the two cards drawn are red.

B = The event that the two cards drawn are queen.

A B = The event that the two cards drawn are queen of red colour.

n(S) = 52C2, n(A) = 26C2, n(B) = 4C2

n(A B) = 2C2

P(A) = n(A) / n(S) = 26C2 / 52C2 , P(B) = n(B) / n(S) = 4C2 / 52C2

P(A B) = n(A B) / n(S) = 2C2 / 52C2

P(A B) = ?

We have P(A B) = P(A) + P(B) – P(A B)

= 26C2 / 52C2 + 4C2 / 52C2 – 2C2 / 52C2

= (26C2 + 4C2 – 2C2) / 52C2

= (13X25+2X3-1) / (26X51)

169
P(A B) = 55/221

Ex.3: A bag contains „6‟ white and „4‟ red balls. Two balls are drawn at random. What is the
chance, they will be of the same colour?

Sol.: Let S = Sample space

A = the event of drawing „2‟ white balls.

B = the event of drawing „2‟ red balls.

A B = The event of drawing 2 white balls or 2 red balls.

i.e. the event of drawing „2‟ balls of same colour.

n(S) = 10C2

n(A) = 6C2 = 15

n(B) = 4C2 = 6

P(A) = n(A) / n(S) = 15/45 = 1/3

P(B) = n(B) / n(S) = 6/45 = 2/15

P(A B) = P(A) + P(B)

= 1/3 + 2/15 = (5+2) / 15

P(A B) = 7/15

Ex-4: For a post three persons „A‟, „B‟ and „C‟ appear in the interview. The
probability of „A‟ being selected is twice that of „B‟ and the probability of „B‟ being selected is
thrice that of „C‟, what are the individual probabilities of A, B, C being selected?

Sol.: Let „E1‟, „E2‟, „E3‟ be the events of selections of A, B, and C respectively.

Let P(E3) = x

P(E2) = 3. P(E3) = 3x

and P(E1) = 2P(E2) = 2 x 3x = 6x

As there are only „3‟ candidates „A‟, „B‟ and „C‟ we have to select at least one of the
candidates A or B or C, surely.

P( E1 E2 E3) = 1

and E1, E2, E3 are mutually exclusive.


170
P(E1 E2 E3) = P(E1) + P(E2) + P(E3)

1 = 6x + 3x + x

10x – 1 or x = 1/10

P(E3) = 1/10, P(E2) = 3/10 and P(E1) = 6/10 = 3/5

Independent Events: Two events are independent if the occurrence of one of the events
gives us no information about whether or not the other event will occur; that is, the events have
no influence on each other.

Some other examples of independent events are:

 Landing on heads after tossing a coin AND rolling a 5 on a single 6-sided die.

 Choosing a marble from a jar AND landing on heads after tossing a coin.

 Choosing a 3 from a deck of cards, replacing it, AND then choosing an ace as the

second card.

 Rolling a 4 on a single 6-sided die, AND then rolling a 1 on a second roll of the

die.

To find the probability of two independent events that occur in sequence, find the
probability of each event occurring separately, and then multiply the probabilities. This
multiplication rule is defined symbolically below. Note that multiplication is represented by
AND.

In probability theory we say that two events, A and B, are independent if the probability
that they both occur is equal to the product of the probabilities of the two individual events, i.e.

The idea of independence can be extended to more than two events. For example, A, B
and C are independent if:

a. A and B are independent; A and C are independent and B and C are independent
(pair wise independence);

b.

171
If two events are independent then they cannot be mutually exclusive (disjoint) and vice
versa.

Example-1: A man and a woman each have a pack of 52 playing cards. Each draws a
card from his/her pack. Find the probability that they each draw the ace of clubs.

We define the events:

A = probability that man draws ace of clubs = 1/52

B = probability that woman draws ace of clubs = 1/52

Clearly events A and B are independent so:

= 1/52 × 1/52 = 0.00037

That is, there is a very small chance that the man and the woman will both draw the ace of clubs.

Example-2: if you flip/toss a coin twice, what is the probability that it will come up
heads both times? Event A is that the coin comes up heads on the first flip and Event B is that the
coin comes up heads on the second flip. Since both P(A) and P(B) equal 1/2, the probability that
both events occur is 1/2 x 1/2 = 1/4

Example-3: If you flip a coin and roll a six-sided die, what is the probability that the coin
comes up heads and the die comes up 1? Since the two events are independent, the probability is
simply the probability of a head (which is 1/2) times the probability of the die coming up 1
(which is 1/6). Therefore, the probability of both events occurring is 1/2 x 1/6 = 1/12.

Example-4: You draw a card from a deck of cards, put it back, and then draw another
card. What is the probability that the first card is a heart and the second card is black? Since there
are 52 cards in a deck, and 13 of them are hearts, the probability that the first card is a heart is
13/52 = 1/4. Since there are 26 black cards in the deck, the probability that the second card is
black is 26/52 = 1/2. The probability of both events occurring is therefore 1/4 x 1/2 = 1/8.

If Events A and B are independent, the probability that either Event A or Event B occurs is:

P(A or B) = P(A) + P(B) - P(A and B)

In this discussion, when we say "A or B occurs" we include three possibilities:

1. A occurs and B does not occur


2. B occurs and A does not occur
3. Both A and B occur

172
This use of the word "or" is technically called inclusive or because it includes the case in
which both A and B occur. If we included only the first two cases, then we would be using an
exclusive or.

(Optional) We can derive the law for P(A-or-B) from our law about P(A-and-B). The
event "A-or-B" can happen in any of the following ways:

1. A-and-B happens

2. A-and-not-B happens

3. not-A-and-B happens.

The simple event A can happen if either A-and-B happens, or A-and-not-B happens.
Similarly, the simple event B happens if either A-and-B happens or not-A-and-B happens. P(A)
+ P(B) is therefore P(A-and-B) + P(A-and-not-B) + P(A-and-B) + P(not-A-and-B) whereas P(A-
or-B) is P(A-and-B) + P(A-and-not-B) + P(not-A-and-B). We can make these two sums equal by
subtracting one occurrence of P(A-and-B) from the first. Hence,

P(A-or-B) = P(A) + P(B) - P(A-and-B).

Example-5: If you flip a coin two times, what is the probability that you will get a head
on the first flip or a head on the second flip (or both)? Letting Event A be a head on the first flip
and Event B be head on the second flip then P(A) = 1/2, P(B) = 1/2, and P(A and B) = 1/4.
Therefore,

P(A or B) = 1/2 + 1/2 - 1/4 = 3/4.

Example-6: If you throw a six-sided die and then flip a coin, what is the probability that
you will get either a 6 on the die or a head on the coin flip (or both)? Using the formula,

P(6 or head) = P(6) + P(head) - P(6 and head)


= (1/6) + (1/2) - (1/6) (1/2)
= 7/12

An alternate approach to computing this value is to start by computing the probability of


not getting either a 6 or a head. Then subtract this value from 1 to compute the probability of
getting a 6 or a head. Although this is a complicated method, it has the advantage of being
applicable to problems with more than two events. Here is the calculation in the present case.
The probability of not getting either a 6 or a head can be recast as the probability of

173
(not getting a 6) and (not getting a head).

This follows because if you did not get a 6 and you did not get a head, then you did not
get a 6 or a head. The probability of not getting a six is 1 - 1/6 = 5/6.

The probability of not getting a head is 1 - 1/2 = 1/2. The probability of not getting a six
and not getting a head is 5/6 x 1/2 = 5/12. This is therefore the probability of not getting a 6 or a
head. The probability of getting a six or a head is therefore (once again) 1 - 5/12 = 7/12.

If you throw a die three times, what is the probability that one or more of your throws
will come up with a 1? That is, what is the probability of getting a 1 on the first throw OR a 1 on
the second throw OR a 1 on the third throw? The easiest way to approach this problem is to
compute the probability of not getting a 1 on the first throw and not getting a 1 on the second
throw and not getting a 1 on the third throw.

The answer will be 1 minus this probability. The probability of not getting a 1 on any of
the three throws is 5/6 x 5/6 x 5/6 = 125/216. Therefore, the probability of getting a 1 on at least
one of the throws is 1 - 125/216 = 91/216.

Summary: The probability of two or more independent events occurring in sequence can
be found by computing the probability of each event separately, and then multiplying the results
together.

Dependent events: In some cases one event is dependent on another; that is, two or more
events are said to be dependent if the occurrence or nonoccurrence of one of the events affects
the probabilities of occurrence of any of the others.

Consider that two or more events are dependent. If p1 is the probability of a first event; p2
the probability that after the first happens, the second will occur; p3 the probability that after the
first and second have happened, the third will occur; etc., then the probability that all events will
happen in the given order is the product p1 - p2 - p3 .

Example-1: A box contains 3 white marbles and 4 black marbles. What is the probability of
drawing 2 black marbles and 1 white marble in succession without replacement?

Solution: On the first draw the probability of drawing a black marble is

P1=4/7

On the second draw the probability of drawing a black marble is

174
P2=3/6 or 1/2

On the third draw the probability of drawing a white marble is

P3=3/5

Therefore, the probability of drawing 2 black marbles and 1 white marble is

P= P1× P2× P3 =4/7.1/2.3/5 = 6/35

Example-2: Slips numbered 1 through 9 are placed in a box. If 2 slips are drawn, without
replacement, what is the probability that

1. both are odd?

2. both are even?

Solution: 1. The probability that the first is odd is P1=5/9

and the probability that the second is odd is P2= 4/8

Therefore, the probability that both are odd is P= P1× P2 = 5/9.4/8=5/18

2. The probability that the first is even is P1=4/9

and the probability that the second is even is P2 = 3/8

Therefore, the probability that both are even is P= P1× P2 = 4/9.3/8=1/6

A second method of solution involves the use of combinations.

1. A total of 9 slips are taken 2 at a time and 5 odd slips are taken 2 at a time; therefore,

P= 5C2 =5/8
9
C2

2. A total of 9C2 choices and 4 even slips are taken 2 at a time; therefore,

P= 4C2 =1
9
C2 6

Exhaustive Events: A list of collectively exhaustive events contains all possible


elementary events for an experiment. For example, for the die-tossing experiment, the set of
events consists of 1, 2, 3, 4, 5, and 6. The set is collectively exhaustive because it includes all
possible outcomes. Thus, all sample spaces are collectively exhaustive.

Equally Likely events: Two or more events which have an equal probability of
175
occurrence are said to be equally likely, i.e. if on taking into account all the conditions, there
should be no reason to except any one of the events in preference over the others. Equally likely
events may be elementary or compound events.

Example: In the experiment of tossing a coin: Where

i. A : the event of getting a "HEAD" and

ii. B : the event of getting a "TAIL"

Events "A" and "B" are said to be equally likely events


[Both the events have the same chance of occurrence].

2. In the experiment of throwing a die: Where

i. A : the event of getting 1

ii. B : the event of getting 2

iii. ...

iv. ...

v. F : the event of getting 6

Events "A", "B", "C", "D", "E", "F" are said to be equally likely events
[All these events have the same chance of occurrence.]

vi. M : the event of getting an even number

vii. N : the event of getting an odd number

The two compound events "M" and "N" are said to be equally likely.

Not Equally Likely: Where

viii. P : the event of getting an odd number {1, 3, 5}

ix. Q : the event of getting 6

The two events "P" and "Q" cannot be said to be equally likely.

If a pair of dice is thrown. There are 36 ways for the dice to fall, shown in the body of the
diagram, all are equally likely.

176
Example-1: When one dice is tossed, the possible outcomes are just the values {1, 2, 3, 4, 5, and
6}. If the dice is fair, then the six outcomes are equally likely, and so we can write:

P (1) = 1/6 P (2) = 1/6 P (3) = 1/6

P (4) = 1/6 P (5) = 1/6 P (6) = 1/6

So, what is the probability that you will get an even number when you roll one dice? Simply

P (even number) = P (2 or 4 or 6) = P (2) + P (4) + P (6) = 1/6 +1/6 + 1/6 = 0.5,

Since the events {1, 2, 3, 4, 5, 6} are mutually exclusive simple events.

Complementary Events: The complement of event E, denoted by Ē and is the set of outcomes
in S that are not included in event E.

Examples: E = Head; Ē = Tail

E = Spades; Ē = not a spade

E = Account in error; Ē = Account not in error

The complement of an event A is the set of all outcomes in the sample space that are not
included in the outcomes of event A. The complement of event A is represented by (read as A
bar).

Rule: Given the probability of an event, the probability of its complement can be found
by subtracting the given probability from 1.

P ( ) = 1 - P (A)
The complement of an event such as A consists of all events not included in A. For example, if
in rolling a die, event A is getting an odd number, the complement of A is getting an even
number. Thus, the complement of event A contains whatever portion of the sample space that
event A does not contain. See the following figure:

177
If A is an event, and A' is the complementary event,

p(A) + p(A') = 1 or p(A') = 1 - p(A)

Example: A spinner has 4 equal sectors colored yellow, blue, green and red. What is the
probability of landing on a sector that is not red after spinning this spinner?

Solution: Sample Space: {yellow, blue, green, red}

The probability of each outcome in this experiment is one fourth. The probability of
landing on a sector that is not red is the same as the probability of landing on all the other colors
except red.

Probability (not red) = 1/4 +1/4 + 1/4 = 3/4

The probability of an event is the measure of the chance that the event will occur as a
result of the experiment. The probability of an event A, symbolized by P(A), is a number
between 0 and 1, inclusive, that measures the likelihood of an event in the following way:

 If P(A) > P(B) then event A is more likely to occur than event B.

 If P(A) = P(B) then events A and B are equally likely to occur.

 If event A is impossible, then P(A) = 0.

 If event A is certain, then P(A) = 1.

The complement of event A is . P( ) = 1 - P(A)

Example-2 If one throws an ordinary six-sided die eight times. What is the probability
that one sees a "1" at least once?

Solution: It may be tempting to say that

P (["1" on 1st trial] or ["1" on second trial] or ... or ["1" on 8th trial])

= P ("1" on 1st trial) + P ("1" on second trial) + ... + P ("1" on 8th trial)

178
= 1/6 + 1/6 + ... + 1/6.

= 8/6 = 1.3333... (WRONG ANSWER)

That cannot be right because a probability cannot be more than 1. The technique is wrong
because the eight events whose probabilities got added are not mutually exclusive.

Instead one may find the probability of the complementary event and subtract it from 1, thus: P
(at least one "1") = 1 − P (no "1"s)

= 1 − P ([no "1" on 1st trial] and [no "1" on 2nd trial] and ... and [no "1" on 8th trial])

= 1 − P (no "1" on 1st trail) × P (no "1" on 2nd trial) × ... × P (no "1" on 8th trial)

= 1 − (5/6) × (5/6) × ... × (5/6)

= 1 − (5/6)8

= 0.7674...

o Methods to Use in Solving Probability Problems

There are indefinite numbers of ways which can be used in solving probability problems. These
methods include the tree diagrams, laws of probability, sample space, insight, and contingency
table. Because of the individuality and variety of probability problems, some approaches apply
more readily in certain cases than in others. There is no best method for solving all probability
problems.
Three laws of probability are discussed in this lecture note: the additive law, the multiplication
law, and the conditional law.

1. The Additive Law:

A. General Rule of Addition:

When two or more events will happen at the same time, and the events are not mutually
exclusive, then: P(X or Y) = P(X) + P(Y) - P(X and Y)

For example, what is the probability that a card chosen at random from a deck of cards
will either be a king or a heart?

P (King or Heart) = P(X or Y) = 4/52 + 13/52 - 1/52 = 30.77%

179
B. Special Rule of Addition:

When two or more events will happen at the same time, and the events are mutually
exclusive, then: P(X or Y) = P(X) + P(Y)

For example, suppose we have a machine that inserts a mixture of beans, broccoli, and
other types of vegetables into a plastic bag. Most of the bags contain the correct weight, but
because of slight variation in the size of the beans and other vegetables, a package might be
slightly underweight or overweight. A check of many packages in the past indicate that:

Weight.................Event............No. of Packages.........Probability

Underweight..........X.......................100...........................0.025

Correct weight.......Y.......................3600.........................0.9

Overweight............Z.......................300...........................0.075

Total................................................4000......................1.00

What is the probability of selecting a package at random and having the package be under
weight or over weight? Since the events are mutually exclusive, a package cannot be under
weight and overweight at the same time. The answer is: P(X or Z) = P (0.025 + 0.075) = 0.1

2. The Multiplication Law:

A. General Rule of Multiplication:

When two or more events will happen at the same time, and the events are dependent,
then the general rule of multiplication law is used to find the joint probability:
P(X and Y) = P(X) . P (Y|X)

For example, suppose there are 10 marbles in a bag, and 3 are defective. Two marbles are
to be selected, one after the other without replacement. What is the probability of selecting a
defective marble followed by another defective marble?

Probability that the first marble selected is defective: P(X)=3/10

Probability that the second marble selected is defective: P(Y)=2/9

P(X and Y) = (3/10) . (2/9) = 7%

This means that if this experiment were repeated 100 times, in the long run 7 experiments
would result in defective marbles on both the first and second selections. Another example is

180
selecting one card at random from a deck of cards and finding the probability that the card is an 8
and a diamond. P(8 and diamond) = (4/52) . (1/4) = 1/52 which is = P(diamond and 8) = (13/52) .
(1/13) = 1/52.

B. Special Rule of Multiplication:


when two or more events will happen at the same time, and the events are independent, then the
special rule of multiplication law is used to find the joint probability:
P(X and Y) = P(X) . P(Y)

If two coins are tossed, what is the probability of getting a tail on the first coin and a tail
on the second coin?

P (T and T) = (1/2). (1/2) = 1/4 = 25%. This can be shown by listing all of the possible
outcomes: T T, or T H, or H T, or H H. Games of chance in casinos, such as roulette and craps,
consist of independent events. The next occurrence on the die or wheel should have nothing to
do with what has already happened.

3. The Conditional Law:

Conditional probabilities are based on knowledge of one of the variables. The conditional
probability of an event, such as X, occurring given that another event, such as Y, has occurred is
expressed as:
P (X|Y) = P(X and Y) / P(Y) = {P(X). P (Y|X)} / P(Y)

Note that when using the conditional law of probability, you always divide the joint
probability by the probability of the event after the word given. Thus, to get P(X given Y), you
divide the joint probability of X and Y by the unconditional probability of Y. In other words, the
above equation is used to find the conditional probability for any two dependent events. When
two events, such as X and Y, are independent their conditional probability is calculated as
follows:

P(X|Y) = P(X) and P(Y|X) = P(Y)

For example, if a single card is selected at random from a deck of cards, what is the
probability that the card is a king given that it is a club?

P (king given club) = P (X|Y) = {P(X) .P (Y|X)} / P(Y)

P(Y) = 13/52, and P (king given club) = 1/52, thus

P (king given club) = P (X|Y) = (1/52) / (13/52) = 1/13

181
Note that this example can be solved conceptually without the use of equations. Since it
is given that the card is a club, there are only 13 clubs in the deck. Of the 13 clubs, only 1 is a
king. Thus P (king given club) = 1/13.

o Approaches of Assigning Probabilities

There are three approaches of assigning probabilities, as follows:

1. Classical Approach: Classical probability is predicated on the assumption that the outcomes
of an experiment are equally likely to happen. The classical probability utilizes rules and laws. It
involves an experiment. The following equation is used to assign classical probability:

P(X) = Number of favorable outcomes / Total number of possible outcomes

Note that we can apply the classical probability when the events have the same chance of
occurring (called equally likely events), and the set of events are mutually exclusive and
collectively exhaustive.

2. Relative Frequency Approach: Relative probability is based on cumulated historical data.


The following equation is used to assign this type of probability:

P(X) = Number of times an event occurred in the past/ Total number of opportunities for
the event to occur

Note that relative probability is not based on rules or laws but on what has happened in
the past. For example, your company wants to decide on the probability that its inspectors are
going to reject the next batch of raw materials from a supplier. Data collected from your
company record books show that the supplier had sent your company 80 batches in the past, and
inspectors had rejected 15 of them. By the method of relative probability, the probability of the
inspectors rejecting the next batch is 15/80, or 0.19. If the next batch is rejected, the relative
probability for the subsequent shipment would change to 16/81 = 0.20.

3. Subjective Approach: The subjective probability is based on personal judgment,


accumulation of knowledge, and experience. For example, medical doctors sometimes assign
subjective probabilities to the length of life expectancy for people having cancer. Weather
forecasting is another example of subjective probability.

182
4.4 Summary

In the lesson we learn that the statistical device which measures related variables and provides
representative figures as regards the combined changes in the variables is called index numbers.
There are different types of index numbers namely the price index numbers, Index Numbers of
Physical Quantities or Quantity Index Numbers, Total Value Index Numbers, on the basis of
number of Commodities and Special Purpose Index Numbers. The index numbers can be
calculated on the basis of fixed base as well as chain base. The index numbers can be weighted
or unweighted. In this part we learn about the unweighted index numbers, whereas in the next
lesson we shall learn in detail about the weighted index numbers.

Emergence of Probability is a history of the early development of the very concept of


mathematical probability. A probability provides a quantitative description of the likely
occurrence of a particular event. A subjective probability is conventionally expressed on a scale
from 0 to 1. In the lesson we discussed in detail the various concepts related to probability. In the
next lesson we shall discuss the topic in further depth.

4.5 Questions
1. Distinguish between simple and weighted index numbers. Explain the methods of
Weighted Aggregative and Weighted Average of Price Relatives using arithmetic mean.
2. What points are considered in the selection of (a) base year, (b) average, and (c) weights
in the construction of an index number?
3. What do you understand by probability?
4. What do you mean by sample space?
5. What is an experiment?
6. Define event in context with probability.

4.6 Suggested Reading


1. Shukla &Sahai: Business Statistics, Sahitya Bhawan Publication, Agra.
2. Elhance DN: Fundamentals of Statistics, Kitab Mahal,New Delhi.

183

You might also like