Download as pdf or txt
Download as pdf or txt
You are on page 1of 150

Introduction

• In simple terms, Association can be understood as


connection/relationship.
• We have built our professions in an environment where
understanding the association between a lot of entities has
become vital.
• Consciously or otherwise, we work with a lot of variables in our
daily routine and thrive to understand the association between
many of them.

bschool.cms.ac.in
Introduction

• Let’s take few examples as below:


• Is there any association between the number of hours an employee logs
in everyday and the productivity?
• Is there be any association between the demand of my product/service
and the Covid-19 pandemic outbreak?
• As people age, will their maturity improve?
• Will the end-to-end travel time decrease if the frequency of the public
transport increases?

bschool.cms.ac.in
Introduction

• Also, there are other questions like:


• If the USD exchange value decreases, will the demand for my
product/service decrease? If yes, by what extent?
• Will there be increase in the attrition rate in my organization if there are
no performance based bonuses given? If yes by how much?

bschool.cms.ac.in
Introduction

• We can find the answers for the above questions by measuring


the association between the variables.
• This means, we need to identify the following details:
• Is there any association between the selected variables?
• If yes, then in what direction – positive or negative?
• If yes, then how strong the relationship is?
• What is the magnitude of the relationship?

bschool.cms.ac.in
Introduction

• Most popular measures of association are correlation analysis


and regression analysis.
• Correlation analysis helps in understanding the strength and
direction of the association whereas regression analysis helps in
understanding the magnitude of the association between given
variables.
• Correlation analysis can be conducted by measuring coefficient
of correlation between the variables.

bschool.cms.ac.in
Introduction

• Karl Pearson’s coefficient of correlation can be used to


measure the strength and direction of the association.
• Spearman’s rank correlation coefficient can be used to
measure the association when the data is ordinal.

bschool.cms.ac.in
Introduction

• Regression analysis can be conducted to understand the


magnitude of relationship between the variables.
• This can be done by arriving at a linear equation (also known as
modelling) that states the relation between a dependent variable
and independent variable(s).
• In this module, the above measures of association will be
dealt-with.

bschool.cms.ac.in
Comparative Analysis

• We live in an environment full of variables.


• Some we understand and most we don’t.
• The environment we progress in demands for maximum
understanding of these variables.
• One of the major requirements in understanding these variables
is their association.

bschool.cms.ac.in
Comparative Analysis

• There is a need to answer the questions:


• How are the variables around us related to each other?
• Are they really related to each other?
• If yes then how strong is their relationship or in which way they are
related?

bschool.cms.ac.in
Correlation Analysis

• More often, an analysis of data concerning two or more


quantitative variables is needed to look for any statistical
relationship or association between them.
• The knowledge of such relationship is important to make
inferences in a given situation.

bschool.cms.ac.in
Correlation Analysis

• Consider an example:
• Typically, in the summer as the temperature increases people are
thirstier.
• Consider the two numerical variables, temperature and water
consumption.
• We would expect the higher the temperature, the more water a
given person would consume.
• Thus we would say that in the summer, temperature and water
consumption are positively correlated.

bschool.cms.ac.in
Correlation Analysis
• Consider another example:
• For seven random summer days, a person recorded the
temperature and their water consumption, during a three-hour
period spent outside.
Water Consumption
Temperature (F)
(Ounces)
75 16
83 20
85 25
85 27
92 32
97 48
99 48 bschool.cms.ac.in
Correlation Analysis
• The graph below helps visualize what appears to be a somewhat
linear relationship between temperature and the amount of
water one drinks.

bschool.cms.ac.in
Correlation Analysis

• Similarly, we come across various examples in our daily life like,


• Budget on ration and number of visitors / special occasions at home
• Family income and expenditure on luxury items
• Frequency of smoking and lung damage
• Age and sign legibility distance
• No. of occupants in a hotel and its water / electricity consumption
• The list is never ending.
• Hence, correlation can be defined as “a measure of association
between two numerical variables”.

bschool.cms.ac.in
Significance of Measuring Correlation

• Correlation analysis contributes to the understanding of


economic behavior, aids in locating the critically important
variables on which others depend, may reveal to the economist
the connections by which disturbances spread and suggest to
him the paths through which stabilizing forces may become
effective.
• The effect of correlation is to reduce the range of uncertainty of
our prediction. The prediction based on correlation analysis will
be more reliable and near to reality.

bschool.cms.ac.in
Correlation Coefficient

• Sample Correlation Coefficient, ‘r’, measures the direction and


the strength of the linear association between two numerically
paired variables.
• It varies between +1 and -1 (-1 ≤ r ≤ +1)
• The values can be interpreted as mentioned in the tables in the
next slide.
• Direction of the Association: The association can be either
positive or negative.

bschool.cms.ac.in
Correlation Coefficient
• Positive Correlation: As the ‘X’ variable increases so does the ‘Y’
variable.
r value Positive Correlation Interpretation

+1 Perfect positive linear relationship

0 No linear relationship

0.9 Strong Positive Association

0.5 Moderate Positive Association

0.25 Weak Positive Association

Example: In the summer, as the temperature increases, so does thirst.


bschool.cms.ac.in
Correlation Coefficient
• Negative Correlation: As the ‘X’ variable increases, the ‘Y’ variable
decreases.
r value Negative Correlation Interpretation

-1 Perfect Negative linear relationship

-0.9 Strong Negative Association

-0.5 Moderate Negative Association

-0.25 Weak Negative Association

Example: As the price of an item increases, the number of items sold decreases.
bschool.cms.ac.in
Correlation Coefficient

• If ‘r’ equals zero, then there is no linear association between the


two variables.
• The closer ‘r’ is to one (in magnitude) the stronger the linear
association.

bschool.cms.ac.in
Measures of Correlation

• The degree of relationship between the two variables can be


measured using the following methods:
a. Scatter diagram.
b. Karl –Pearson’s coefficient of correlation.
c. Spearman’s Rank correlation coefficient.

bschool.cms.ac.in
Scatter Diagram

• It is a graphical presentation of bi-variate data.


• Here a variable (X) is taken along the x-axis and the other
variable (Y) is taken along the y-axis and each pair of (X, Y)
values are represented by a point of the graph.
• The rough estimate of correlation can be obtained using the
following scatters diagrams.

bschool.cms.ac.in
Scatter Diagram
• If the variables form a positives slope (a line moving in the upward
direction) they are said to be perfectly positively correlated.
• If the variables are clustered around the positive slope then they are
positively correlated.

bschool.cms.ac.in
Scatter Diagram
• If the variables form a negative slope (a line moving in downward direction)
they are said to be perfectly negatively correlated.
• If the variables are clustered around the negative slopes, they are negatively
correlated.

If the variables are spread all over the graph, then they are not correlated. bschool.cms.ac.in
Karl Pearson’s Correlation Coefficient

• It is a mathematical measure based on covariance and variances.


• Covariance is a descriptive measure of the linear association
between two variables.
• Covariance describes the extent to which a change in one
variable (x) is paired with a comparable change in another
variable (y).

bschool.cms.ac.in
Properties of Karl Pearson’s Correlation Coefficient
• The value of r does not depend upon the units of measurement.
• The value of r does not depend upon which variable is labelled ‘X’
and which is labelled ‘Y’
• Correlation coefficient lies between -1 and 1. A positive value of r
means a positive linear relationship, a negative value means a
negative linear relationship
• If r = ±1, then all the points of the scatter diagram lie exactly on a
straight line and the correlation is said to be positive perfect if r =
+1 and negative perfect if r = -1.
• ‘r’ measures only the linear relationship between ‘X’ and ‘Y’

bschool.cms.ac.in
Karl Pearson’s Correlation Coefficient

bschool.cms.ac.in
PRACTICE :
Numerical Problems

bschool.cms.ac.in
Karl Pearson’s Correlation Coefficient
• A travel and leisure magazine provides an annual list of the 500
best hotels in India. The magazine provides a rating for each
hotel along with a brief description that includes the size of the
hotel, amenities and the cost per night for a double room. A
sample of 12 of the top-rated hotels in India is as follows:

bschool.cms.ac.in
Hotel Location No. of Rooms Cost/night Rs. ’00)
Cubs Trail Resort Kanha, MP 220 499
Seasons Resort and Spa Cochin, Kerala 727 340
Buffalo Inn Coorg, Karnataka 285 585
Swasti Heritage Hotel Udaipur, Rajasthan 273 495
Tiger Den Jim Corbett, Uttarakhand 145 495
Snowden Spa and resorts Dharmashala, HP 213 279
Sun & Sand Beach Resort Panjim, Goa 398 279
Sand Stone Beach Resort Mahabalipuram, TN 343 455
Snow View Towers Gangtok, Sikkim 250 595
Six Seasons Beach Resort Vizag, AP 414 367
Golden Sands Mapusa, Goa 400 675
Chiru Towers Hyderabad, Telangana 700 420
bschool.cms.ac.in
Problem 1
Questions:
a. Develop a scatter diagram with the number of rooms on the
horizontal axis and the cost per night on the vertical axis. Does
there appear to be a relationship between the number of rooms
and the cost per night? Discuss.
b. What is the correlation coefficient? What does it tell you about
the relationship between the number of rooms and the cost per
night for a double room? Does this appear reasonable? Discuss.

bschool.cms.ac.in
The data points on the
scatter diagram does not
follow any pattern. They
neither are around positive
scope nor around negative
scope.
Hence, there appears no
relationship between the
number of rooms and the
cost per night per room.

bschool.cms.ac.in
X Y X2 Y2 XY
220 499 48400 249001 109780
727 340 528529 115600 247180
285 585 81225 342225 166725
273 495 74529 245025 135135
145 495 21025 245025 71775
213 279 45369 77841 59427
398 279 158404 77841 111042
343 455 117649 207025 156065
250 595 62500 354025 148750
414 367 171396 134689 151938
400 675 160000 455625 270000
700 420 490000 176400 294000
4368 5484 1959026 2680322 1921817
bschool.cms.ac.in
X Y X2 Y2 XY
220 499 48400 249001 109780
727 340 528529 115600 247180
285 585 81225 342225 166725
273 495 74529 245025 135135
145 495 21025 245025 71775
213 279 45369 77841 59427
398 279 158404 77841 111042
343 455 117649 207025 156065
250 595 62500 354025 148750
414 367 171396 134689 151938
400 675 160000 455625 270000
700 420 490000 176400 294000
4368 5484 1959026 2680322 1921817
bschool.cms.ac.in
Solution 1
Since r = -0.29, there is a weak negative correlation between the number of
rooms and the cost of room per night.
This does appear reasonable as this result is a reflection of the scatter diagram.

bschool.cms.ac.in
Problem 2
Newly appointed finance secretary receives a feedback from his team in a
review meeting about the rising unemployment in the country. Coming from
the science background, he decides to take various parameters to understand
the real reason behind the rise in the unemployment rate. One of the
parameters he selects is the industrial production. He seeks the data about the
industrial production index and number of unemployed people between 2012
and 2019 from his team.
He gets the following table that gives indices of industrial production and
number of registered unemployed people (in lakh). He decides to use the
correlation analysis to understand the relationship between the given data.
Use the Karl Pearson’s Coefficient of Correlation analysis to find out what the
finance secretary discovers from the given data.
bschool.cms.ac.in
Problem 2

Year 2012 2013 2014 2015 2016 2017 2018 2019

Index of Production 100 102 104 107 105 112 103 99

Number Unemployed 15 12 13 11 12 12 19 26

bschool.cms.ac.in
X Y X2 Y2 XY
100 15 10000 225 1500
102 12 10404 144 1224
104 13 10816 169 1352
107 11 11449 121 1177
105 12 11025 144 1260
112 12 12544 144 1344
103 19 10609 361 1957
99 26 9801 676 2574
832 120 86648 1984 12388

bschool.cms.ac.in
X Y X2 Y2 XY

100 15 10000 225 1500

102 12 10404 144 1224

104 13 10816 169 1352

107 11 11449 121 1177

105 12 11025 144 1260

112 12 12544 144 1344

103 19 10609 361 1957

99 26 9801 676 2574

832 120 86648 1984 12388


bschool.cms.ac.in
Solution 2
Since r = -0.629, there is a moderate negative correlation between the Index of
Production and Unemployment.
This means, as the index of production increases, the unemployment reduces
moderately.

bschool.cms.ac.in
Problem 3
A financial analyst wanted to find out whether inventory turnover influences
any company’s earnings per share (in percent). A random sample of 7
companies listed in a stock exchange was selected and the following data was
recorded for each. Find the strength of association between inventory turnover
and earnings per share. Interpret this finding to the analyst.

bschool.cms.ac.in
Problem 3
Inventory Turnover Earnings per share
Company
(no. of times) (percent)
A 4 11
B 5 9
C 7 13
D 8 7
E 6 13
F 3 8
G 5 8
bschool.cms.ac.in
X Y X2 Y2 XY

4 11 16 121 44

5 9 25 81 45

7 13 49 169 91

8 7 64 49 56

6 13 36 169 78

3 8 9 64 24

5 8 25 64 40

38 69 224 717 378

bschool.cms.ac.in
X Y X2 Y2 XY

4 11 16 121 44

5 9 25 81 45

7 13 49 169 91

8 7 64 49 56

6 13 36 169 78

3 8 9 64 24

5 8 25 64 40

38 69 224 717 378

bschool.cms.ac.in
Solution 3
Since r = 0.126, there is a weak positive correlation between inventory turnover
and earnings per share.
This means, as the inventory turnover increases, the earning per share
increases not significantly.

bschool.cms.ac.in
Problem 4
A nutritionist well-known for her nutritional prescriptions to pregnant women
wishes to estimate the association between gestational age and infant birth
weight in order to enhance her prescriptions. For this, a small study is
conducted involving 10 infants to investigate the association between
gestational age at birth, measured in weeks, and birth weight, measured in
grams. Calculate the association and give recommendations to the nutritionist.

bschool.cms.ac.in
Problem 4
Infant ID Gestational Age (In Weeks) Birth Weight (In Grams)
1 35 1895
2 36 2030
3 29 1440
4 40 2835
5 36 3090
6 42 3827
7 40 3260
8 37 2690
9 41 3285
10 38 2920
bschool.cms.ac.in
X Y X2 Y2 XY

35 1895 1225 3591025 66325

36 2030 1296 4120900 73080

29 1440 841 2073600 41760

40 2835 1600 8037225 113400

36 3090 1296 9548100 111240

42 3827 1764 14645929 160734

40 3260 1600 10627600 130400

37 2690 1369 7236100 99530

41 3285 1681 10791225 134685

38 2920 1444 8526400 110960

374 27272 14116 79198104 1042114


bschool.cms.ac.in
X Y X2 Y2 XY

35 1895 1225 3591025 66325

36 2030 1296 4120900 73080

29 1440 841 2073600 41760

40 2835 1600 8037225 113400

36 3090 1296 9548100 111240

42 3827 1764 14645929 160734

40 3260 1600 10627600 130400

37 2690 1369 7236100 99530

41 3285 1681 10791225 134685

38 2920 1444 8526400 110960

374 27272 14116 79198104 1042114


bschool.cms.ac.in
Solution 4
Since r = 0.89, there is a strong negative correlation between Gestational Age
and the birth weight.
It can be inferred that as the gestational age increases the birth weight also
increases proportionately.

bschool.cms.ac.in
Problem 5
The success of a shopping center can be represented as a function of the
distance (in miles) from the center of the population and the number of clients
(in hundreds of people) who will visit. The data is given in the table below.
Calculate the linear correlation coefficient.

No. Customers 8 7 6 4 2 1

Distance 15 19 25 23 34 40

bschool.cms.ac.in
Association Analysis of Ranked Order
• At times we need to measure the strength of the linear relationship between
variables using data which can be trusted only to the extent of its rank
ordering.
• The rank correlation coefficient may be used in many situations, for which the
conventional correlation coefficient is unsuitable.
• Spearman's rank correlation coefficient is a measure of rank correlation
(statistical dependence between the rankings of two variables).

bschool.cms.ac.in
Association Analysis of Ranked Order
• When the given pairs of observations in the data set are not ranked, the ranks
are assigned by taking either the highest or the lowest value as 1 for both the
variable’s value.
• While attempting to rank the observations as mentioned above, we may come
across a situation of more than one observations being of equal size.
• In such a case, the rank to be assigned to individual observations is an
average of the ranks which these individual observations would have got had
they differed from each other.
• For example, if two observations are ranked equal at fourth place, then the
rank 4 is assigned to these two observations. But the next rank would be 6
and not 5.

bschool.cms.ac.in
Spearman’s Rank Correlation Coefficient

bschool.cms.ac.in
Problem 1
In one of the recruitment drives, Tata Motors Limited (TML) decided that they
would select a group of employees for skill based training on the basis of
aptitude tests. On completion of training, the quality of their work is assessed
and they are again ranked as follows where ‘X’ denotes aptitude ranking and ‘Y’
denote quality ranking. Calculate the rank correlation and comment on the
selection of employees.

X 2 1 3 7 6 8 4 5 10 9

Y 3 2 1 8 4 9 5 6 10 7

bschool.cms.ac.in
Solution 1
X Y D D2 •
2 3 -1 1
1 2 -1 1
3 1 2 4
7 8 -1 1
6 4 2 4
8 9 -1 1
4 5 -1 1
5 6 -1 1
10 10 0 0
9 7 2 4
18

bschool.cms.ac.in
Problem 2
The following table provides data about the percentage of students who have qualified for a
scholarship offered by the state universities and their CGPA scores. Calculate the Spearman’s
Rank Correlation between the two and interpret the result.
State University % of Students qualified for scholarship % of students scoring above 8.5 CGPA
Bangalore 14 54
Delhi 7 64
Mumbai 27 44
Jaipur 33 32
Kolkata 38 37
Raipur 16 68
Vishakhapatnam 5 62
Trichy 8 43
Bhopal 29 49
Cuttack 18 52

bschool.cms.ac.in
X Rank X Y Rank Y D2 •
14 7 54 4 9
7 9 64 2 49
27 4 44 7 9
33 2 32 10 64
38 1 37 9 64
16 6 68 1 25
5 10 62 3 49
8 8 43 8 0
29 3 49 6 9
18 5 52 5 0
278

bschool.cms.ac.in
Problem 3
Following the tradition followed for years, the department of Horticulture, Karnataka
organized its annual Republic Dar flower show at Lalbagh Botanical Garden,
Bengaluru from 17th to 28th January 2020. As is the practice, the best theme would
be awarded. To judge the display of various flowers a panel comprising of three judges
was appointed. There were eight participants who were ranked by the panel based on
mutually agreed criteria. The panel’s rankings are as follows:

Participant No. 1 2 3 4 5 6 7 8
Judge 1 4 5 2 1 6 8 7 3
Judge 2 3 2 6 8 1 5 7 4
Judge 3 1 5 3 6 8 7 4 2

Using Spearman’s rank correlation coefficient, name two among the three judges
who have closer views regarding the display of flowers. bschool.cms.ac.in
Correlation between Judge 1 and Judge 2

X (Judge 1) Y (Judge 2) D2

4 3 1
5 2 9
2 6 16
1 8 49
6 1 25
8 5 9
7 7 0
3 4 1
110

bschool.cms.ac.in
Correlation between Judge 2 and Judge 3

Y (Judge 2) Z (Judge 3) D2

3 1 4
2 5 9
6 3 9
8 6 4
1 8 49
5 7 4
7 4 9
4 2 4
92

bschool.cms.ac.in
Correlation between Judge 3 and Judge 1

Z (Judge 3) X (Judge 1) D2

1 4 9
5 5 0
3 2 1
6 1 25
8 6 4
7 8 1
4 7 9
2 3 1
50

Since the correlation between judge 1 and judge 3 are the highest, they have closer
views regarding the display of flowers. bschool.cms.ac.in
Problem 4
Covid-19 data as compiled by the Ministry of Health and Family Welfare, India,
the following data is selected to measure the association between the number
of active cases, number of cured cases and the number of deaths. Using
Spearman’s rank correlation coefficient, name two among the three variables
which have closer association.

bschool.cms.ac.in
Problem 4
State No. of active cases No. of cured cases No. of deaths

Karnataka 405 426 31


Gujrat 5248 2780 513
Bihar 364 377 6
Rajasthan 1611 2264 113
Kerala 26 489 6
Madhya Pradesh 1817 1747 221
Assam 29 34 2
Punjab 1678 168 31

bschool.cms.ac.in
X (Active Cases) Rank X Y (Cured Cases) Rank Y Z (Deaths) Rank Z

405 5 426 5 31 4

5248 1 2780 1 513 1

364 6 377 6 6 6

1611 4 2264 2 113 3

26 8 489 4 6 6

1817 2 1747 3 221 2

29 7 34 8 2 8

1678 3 168 7 31 4

bschool.cms.ac.in
X Y D2

5 5 0
1 1 0
6 6 0
4 2 4
8 4 16
2 3 1
7 8 1
3 7 16
38

bschool.cms.ac.in
Y Z D2

5 4 1
1 1 0
6 6 0
2 3 1
4 6 4
3 2 1
8 8 0
7 4 9
16

bschool.cms.ac.in
Z X D2
4 5 1 •
1 1 0
6 6 0
3 4 1
6 8 4
2 2 0
8 7 1
4 3 1
8

Since the correlation between no. of active cases and the no. of deaths are the highest,
they have closer association. bschool.cms.ac.in
Problem 5
The following data corresponds to the scores of a student of MBA at Jain
University in continuous assessment in 2nd Semester. His mentor wishes to
know if there is any association between the marks scored by the student in
two subjects. Use Spearman’s rank correlation analysis to measure the
association and interpret the result.

X 78 42 90 24 73 80 81 62 65 42

Y 84 51 92 43 75 54 86 54 54 43

bschool.cms.ac.in
X Rank X Y Rank Y D2 •
78 4 84 3 1
42 8 51 8 0
90 1 92 1 0
24 10 43 9 1
73 5 75 4 1
80 3 54 5 4
81 2 86 2 0
62 7 54 5 4
65 6 54 5 1
42 8 43 9 1
13

bschool.cms.ac.in
Problem 6
TVS Motor Company is about to launch their new 100 CC stylish scooter targeted at
the youth. As part of the testing processes, they decide to invite the general public to
test drive the scooters in order to evaluate its mileage. For this experiment, the
company selects two youths as test drivers from two different colleges in Bengaluru
and Chennai. Each driver is supposed to travel a distance on 9 random routes and
record observations. The observations are as follows. Use Spearman’s rank correlation
analysis to measure the association between two drivers and interpret the result.

X 41 49 52 35 41 42 30 50 48

Y 51 44 44 47 49 51 28 39 22

bschool.cms.ac.in
X Rank X Y Rank Y D2 •
41 6 51 1 25
49 3 44 5 4
52 1 44 5 16
35 8 47 4 16
41 6 49 3 9
42 5 51 1 16
30 9 28 8 1
50 2 39 7 25
48 4 22 9 25
137

bschool.cms.ac.in
Business Prediction Models
• Irrespective of the sector, the organizations are literally in the race
to predict the future of their organization, be it in terms of
opportunities or challenges.
• They are finding ways for prediction (Forecasting)
• Looking at some of the oldest forecasting strategies, the most
common one would be to use the historic data and forecast.
• As the data grew, the need to analyse this data using relevant tools
came up.
• Statistical tools were the result.

bschool.cms.ac.in
Business Prediction Models
• Among many such statistical forecasting tools was regression
analysis.
• Others are simulation technique, exponential smoothing, etc.
• Regression analysis is one of the most tried & tested, popular
statistical tools for forecasting.

bschool.cms.ac.in
Regression Analysis
• It was Sir Francis Galton who first used the term regression as a
statistical concept in 1877.
• He made a statistical study that showed that the height of children
born to tall parents tends to ‘regress’ towards the mean height of
population.
• Galton used the term regression as a statistical technique to
predict one variable (the height of children) from another variable
(the height of parents).
• This is called ‘regression’ or ‘simple regression’ confined to
bivariate data.
bschool.cms.ac.in
Regression Analysis
• Regression analysis tells us how one variable is related to another
by providing an equation that allows us to use the known value of
one or more variables, to estimate the unknown value of the
remaining variable.
• A statistical model is a set of mathematical formulae and
assumptions which describe a real world situation.

bschool.cms.ac.in
Regression Analysis
• Regression analysis is a mathematical measure, which helps to
determine the probable form of the relationship between variables
and it is used to predict or estimate the value of one variable,
corresponding to a given value of another variable.
• The variable being predicted is called dependent variable and
variable used to predict the value of dependent variable is called
independent variable.

bschool.cms.ac.in
Regression Analysis
• The simplest type of regression analysis involving one
independent variable and one dependent variable in which the
relationship between the variables is approximated by a straight
line is called linear regression.
• Regression analysis involving two or more independent variables
is called multiple regression analysis.
• The relationship between two variables is quantified by
representing the line of best fit as a mathematical equation known
as regression equation.
• In other words, the linear relationship between two variables can
be described by a straight line, which is known as regression line.
bschool.cms.ac.in
Regression Analysis

bschool.cms.ac.in
Regression Analysis

bschool.cms.ac.in
Regression Analysis - LEAST SQUARE METHOD

bschool.cms.ac.in
Regression Analysis - LEAST SQUARE METHOD

The goal is to minimize the sum of the square of the errors of the data
points using Ei = Yi - (a+bX). This minimizes the Mean Square Error
bschool.cms.ac.in
Regression Equations

bschool.cms.ac.in
Regression Coefficients

bschool.cms.ac.in
Properties of Regression Coefficients
• Correlation coefficient is the geometric mean between the
regression coefficients.
• Arithmetic mean of the regression coefficient is greater than or
equal to the correlation coefficient.
• Regression coefficients are independent of change of origin but
not of scale
• If one of the regression coefficient is greater than unity, the other
must be less than unity
• Both the regression coefficients will have the same sign, either
positive or negative.

bschool.cms.ac.in
Problem 1
• The government of India is announcing plenty of reforms during this
pandemic period. In continuation with this activity, it has assigned the
ministry of commerce and industry to predict the relationship between
import and export values of electronic sector in the country. The ministry has
gathered the data between 2013-14 and 2018-19 from DGCIS for the
prediction. Use regression analysis to model the relation between import
& export and vice versa of the electronic data. Also predict the import for
the year 2020-21 given that the export will be USD 11 Billion.

Year 2013-14 2014-15 2015-16 2016-17 2017-18 2018-19


Imports (US $ 32 36 40 42 51 55
Billion)
Exports (US $ Billion) 8 6 6 7 7 9
bschool.cms.ac.in
X Y X2 Y2 XY

32 8 1024 64 256

36 6 1296 36 216

40 6 1600 36 240

42 7 1764 49 294

51 7 2601 49 357

55 9 3025 81 495

256 43 11310 315 1858

bschool.cms.ac.in

bschool.cms.ac.in
Problem 2
• Ralison Appliances Pvt. Ltd. manufactures different types of electrical
appliances in India. It has been using radio (FM) for advertising its products.
The following table shows the amounts of radio time and the number of
electrical appliances sold over seven weeks. Fit linear equations of radio time
on the number of electrical appliances sold and vice-versa. Also calculate the
sales when the radio time is 24 minutes.

Radio Time (Minutes) 25 18 32 21 35 28 30


No. of appliances sold 16 11 20 15 26 32 20

bschool.cms.ac.in
X Y X2 Y2 XY

25 16 625 256 400

18 11 324 121 198

32 20 1024 400 640

21 15 441 225 315

35 26 1225 676 910

28 32 784 1024 896

30 20 900 400 600

189 140 5323 3102 3959


bschool.cms.ac.in

bschool.cms.ac.in
Multiple Linear Regression
• Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
• The goal is to model the linear relationship between the
explanatory (independent) variables and response (dependent)
variable.
• In essence, multiple regression is the extension of ordinary
least-squares regression that involves more than one explanatory
variable.

bschool.cms.ac.in
Multiple Linear Regression
• A simple linear regression is a function that allows an analyst or
statistician to make predictions about one variable based on the
information that is known about another variable.
• Linear regression can only be used when one has two continuous
variables—an independent variable and a dependent variable.
• The independent variable is the parameter that is used to calculate
the dependent variable or outcome.
• A multiple regression model extends to several explanatory
variables.

bschool.cms.ac.in
Multiple Linear Regression
• For example, an analyst may want to know how the movement of
the market affects the price of Exxon Mobil (XOM).
• In this case, his linear equation will have the value of the S&P 500
index as the independent variable, or predictor, and the price of
XOM as the dependent variable.

bschool.cms.ac.in
Multiple Linear Regression
• In reality, there are multiple factors that predict the outcome of an
event.
• The price movement of Exxon Mobil, for example, depends on
more than just the performance of the overall market.
• Other predictors such as the price of oil, interest rates, and the
price movement of oil futures can affect the price of XOM and stock
prices of other oil companies.
• To understand a relationship in which more than two variables are
present, a multiple linear regression is used.

bschool.cms.ac.in
Multiple Linear Regression
• Multiple linear regression is used to determine a mathematical
relationship among a number of random variables.
• In other terms, it examines how multiple independent variables
are related to one dependent variable.
• Once each of the independent factors has been determined to
predict the dependent variable, the information on the multiple
variables can be used to create an accurate prediction on the level
of effect they have on the outcome variable.
• The model creates a relationship in the form of a straight line
(linear) that best approximates all the individual data points.
bschool.cms.ac.in
Multiple Linear Regression
A two-variable multiple linear regression equation is given as:

Y = a + b 1X1 + b 2X2

bschool.cms.ac.in
Problem 3
• People in the aerospace industry believe the cost of a space project is a
function of the weight of the major object being sent into space. Use the
following data to develop a regression model to predict the cost of a space
project by the weight of the space object.

Weight (Tons) 2 3 1 1.5 1.25 2 2.5

Cost ($ millions) 53.5 185 7 24 34 110 104

bschool.cms.ac.in
Problem 4
• The editor-in-chief of Bangalore Mirror has been trying to convince the
paper’s owner to improve the working conditions in the press room. He is
convinced that the noise level, when the presses are running, creates
unhealthy levels of tension and anxiety. He recently had a psychologist
conduct a test during which pressmen were placed in rooms with varying
levels of noise and then given a test to measure mood and anxiety levels. The
following table shows the index of their degrees of nervousness and the level
of noise to which they were exposed (5 is low and 10 is high). Develop
estimating equations. Also predict the degrees of nervousness that we might
expect when the noise level is 7.5

Noise Level 7.0 6.5 5.5 6.0 8.0 8.5 6.0 6.5
Degree of Nervousness 23 38 45 36 16 18 39 41

bschool.cms.ac.in
Problem 5
• As a part of a study on transportation safety, Karnataka State Government
collected data on number of fatal accidents per 1000 licenses and percentage
of licensed drivers under the age of 21 in 10 cities of the state. The data is
tabulated as below. Fit linear equations to the above data.

Fatal Accidents per 1000


2.6 3.8 0.8 1.2 0.6 1 2.8 1.4 1.8 2
licenses
Percent of licensed drivers
17 18 8 13 6 9 16 12 9 10
under 21 years

bschool.cms.ac.in
Problem 6
• A researcher at Jain University wished to investigate if there is any
relationship between atmospheric temperature (in oC) and the number of
Covid-19 cases. In this regard, the researcher collected the following data
from among 12 random states in India. Model the relation between the
temperature and the Covid-19 cases. Also estimate the number of Covid-19
cases when the temperature is 8oC

bschool.cms.ac.in
Problem 6
State Average Temperature (oC) No. of Covid-19 Cases
HP 17 38
J&K 18 540
UP 33 2055
Delhi 32.5 5894
Chhattisgarh 35 29
West Bengal 30.5 1818
Karnataka 27 535
Andhra Pradesh 32 952
Tamil Nadu 34 8423
Gujarat 38 6906
Goa 30 10
Rajasthan 37 2522 bschool.cms.ac.in
Problem 7
• Hyundai Motor India Ltd has recently held 3-day road-side exhibits on the
introduction of its new model of Creta. The number of sales personnel
employed at each of a sample of 10 exhibitions and the number of cars
booked at each one are given as follows. Using these data, regress the number
of cars booked on the number of salesmen and obtain the regression
equation. Also estimate the number of cars booked if 10 salesmen are
employed on an exhibition.

No. of Salesman 5 8 6 8 9 3 5 4 6 6

No. of Cars booked 132 160 148 156 168 102 142 98 152 142

bschool.cms.ac.in
Problem 8
• ITI Limited recorded data showing the experience of machine operators and
their performance rating as given by the number of good parts turned out per
100 pieces. Obtain the regression equation of performance rating on
experience. Use this equation to estimate the probable performance if an
operator has 7 years of experience.

Operator 1 2 3 4 5 6 7 8

Experience (Years) 16 12 18 4 3 10 5 12

Performance Rating 87 88 89 68 78 80 75 82
bschool.cms.ac.in
Basic terminologies
•POPULATION
•SAMPLE
•CENSUS
•SAMPLING
•STATISTIC
•PARAMETER

bschool.cms.ac.in
Population

The aggregate of all the elements, sharing


some common set of characteristics, that
comprises the universe for the purpose of the
marketing research problem

bschool.cms.ac.in
Census

A complete enumeration of the elements of a


population

bschool.cms.ac.in
Parameter

The population parameter is the numerical


information collected through census

bschool.cms.ac.in
Sample

A subgroup of the elements of the


population selected for participation in the
study

bschool.cms.ac.in
Sampling

The process of choosing a sample is called as


sampling

bschool.cms.ac.in
Statistic

The numerical information collected from the


sample is called as statistic

bschool.cms.ac.in
What is a good sample ?

The good sample is that, which gives as much


information as possible of the whole universe

bschool.cms.ac.in
When is Census appropriate ?
•A census is appropriate if the population size itself is
quite small
•If the cost of making an incorrect decision is high, then
a census is more appropriate
•If the sampling errors are high, then a census may be
more appropriate than a sample

bschool.cms.ac.in
Advantages of Census

•The data is very accurate and reliable


•The data collected becomes a good data base
for future studies

bschool.cms.ac.in
Disadvantages of Census
•Costly process
•Time consuming
•High amount of manpower and effort reqd.
•Handling huge data collection
•Maintenance of data base

bschool.cms.ac.in
Sampling method is more desired when….
•The Population is very large
•Quick results are required
•In studies involving destruction of the elementary
units under study
•Cost of conducting surveys are prohibitive
•Difficulty in handling large size data

bschool.cms.ac.in
Advantages of Sampling

•Less time consumption


•Less money spent for sampling
•Possible to give attention to different
characteristics of the elementary units

bschool.cms.ac.in
Reasons for Taking a Census

• Eliminate the possibility that a random sample is not representative of


the population.

• The person authorizing the study is uncomfortable with sample


information.

bschool.cms.ac.in
Reasons for Sampling
• Sampling can save money.
• Sampling can save time.
• For given resources, sampling can broaden the scope of the data set.
• Because the research process is sometimes destructive, the sample
can save product.
• If accessing the population is impossible; sampling is the only option.

bschool.cms.ac.in
Sampling : Design and procedures
• The sampling design process

1. Define the target population


2. Determine the sampling frame
3. Select sampling technique(s)
4. Determine the sample size
5. Execute the sampling process

bschool.cms.ac.in
Target population

The collection of elements or objects that


possess the information sought by the
researcher

bschool.cms.ac.in
Sampling frame

A representation of the elements of the


target population.
It consists of a list or set of directions for
identifying the target population

bschool.cms.ac.in
Sample size

The number of elements to be included in the


study
Determining the sample size is complex and
involves several qualitative and quantitative
considerations

bschool.cms.ac.in
Qualitative factors

•The importance of the decision


•The nature of research
•The number of variables
•The nature of analysis
•Sample sizes in similar studies
•Incidence rates
•Completion rates
•Resource constraints
bschool.cms.ac.in
Numericals

What should be the sample size for the


population of 600000 elements with the
precision of 95% confidence level

bschool.cms.ac.in
Formula

bschool.cms.ac.in
Numericals

What should be the sample size for the


population of 100000 elements with the
precision of 90% confidence level

bschool.cms.ac.in
Numericals
Maggi Omega is worried with their reduced sales
and hence has decided to conduct a survey with
the level of precision of ± 5 and the confidence
level of 95%. The standard deviation of the
population is known to be 55. Determine the
sample size.

bschool.cms.ac.in
Data available

bschool.cms.ac.in
Formula

bschool.cms.ac.in
Numericals
At the confidence level of 95% and the precision
level of ± 4, determine the sample size for a survey
given that the standard deviation of the population
to be 39.

bschool.cms.ac.in
Numericals

The statistical sample size was determined to be


456 in the population of 4500. The constraint is to
have a sample lesser than 10% of the total
population. Statistically obtain the corrected
sample size.

bschool.cms.ac.in
Formula

bschool.cms.ac.in
Numericals

The statistical sample size was determined to be


1256 in the population of 10000. The constraint is
to have a sample lesser than 10% of the total
population. Statistically obtain the corrected
sample size.

bschool.cms.ac.in
Sampling techniques

Sampling techniques may be broadly


classified as
Probability sampling
Non-probability sampling

bschool.cms.ac.in
bschool.cms.ac.in
Probability Sampling

A Probability sample is one for which the


inclusion or exclusion of any individual
element of the population depends upon the
application of probability methods and not
on a personal judgement

bschool.cms.ac.in
Non - probability sampling

Non probability sampling is a procedure of


selecting a sample without the use of
probability or randomization

bschool.cms.ac.in
Non-probability sampling

1. Convenience sampling
2. Judgmental sampling
3. Quota sampling
4. Snowball sampling

bschool.cms.ac.in
bschool.cms.ac.in
Types of probability sampling
•Simple random sampling
•Systematic sampling
•Stratified random sampling
•Cluster sampling
•Multistage sampling
•Area sampling
•Multiphase sampling

bschool.cms.ac.in
Simple random sampling
A probability sampling technique in which each
element in the population has a known and equal
probability of selection.
Every element is selected independently of every other
element and the sample is drawn by a random
procedure from a sampling frame.

bschool.cms.ac.in
Systematic sampling

A probability sampling technique in which


the sample is chosen by selecting a random
th
starting point and then picking every i
element in succession from the sampling
frame

bschool.cms.ac.in
Stratified sampling

A probability sampling technique that uses a


two step process to partition the population
into subpopulations or strata.
Elements are selected from each stratum by
a random procedure

bschool.cms.ac.in
Stratified sampling….

The strata should be mutually exclusive and


collectively exhaustive in that every
population element should be assigned to
one and only one stratum and no population
elements should be omitted

bschool.cms.ac.in
Classification of stratified sampling
Stratified sampling is broadly classified as
Proportional stratified sampling and
disproportional stratified sampling
Proportional stratified sampling is further
classified as Directly proportional stratified
sampling and Inversely proportional stratified
sampling
bschool.cms.ac.in
Directly proportional stratified sampling
Assume that a researcher is evaluating customer satisfaction
for a beverage that is consumed by a total of 600 people.
Among the 600 people, 400 are brand loyal and 200 are
variety seeking

Consumer type Group size 10% sample


Brand loyal 400 40
Variety seeking 200 20
Total 600 60

bschool.cms.ac.in
Inversely proportional stratified sampling
Assume that among the 600 consumers in the
population, 200 are heavy drinkers and 400 are light
drinkers. If a researcher values the opinion of the heavy
drinkers more than that of the light drinkers, more
people will have to be sampled from the heavy drinkers
group. In such instances, one can use an inversely
proportional stratified sampling

bschool.cms.ac.in
Inversely proportional stratified sampling
If a sample size of 60 is desired then a 10 percent inversely
proportional stratified sampling is employed

Consumer type Group size 10% sample

Heavy drinkers 200 40

Light drinkers 400 20

Total 600 60

bschool.cms.ac.in
Cluster sampling
First, the target population is divided into mutually exclusive and
collectively exhaustive subpopulations called clusters
Then, a random sample of clusters is selected based on a probability
sampling technique such as simple random sampling
For each selected cluster, either all the elements are included in the
sample or a sample of elements is drawn probabilistically

bschool.cms.ac.in
Stratified v/s Cluster sampling

Stratified sampling Cluster sampling

Homogeneity within group Homogeneity between groups

Heterogeneity between groups Heterogeneity within group

All groups are included Random selection of groups

Sampling efficiency improved by increasing accuracy Sampling efficiency improved by decreasing cost at a
at a faster rate than cost faster rate than accuracy

bschool.cms.ac.in

You might also like