Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

Module 4: Measures of Association

Module No. 4
Topic Measures of Association
No. of Hours 6
No. of Sessions 5
Session numbers 16 to 20
Session 16 Comparative Analysis
Session 17 Association Analysis of Ranked Order
Session 18 Business Prediction Models
Session 19 Analytical Linear Regression
Session 20 Multiple Linear Regression
Use correlation to identify variables that may be used in model building and
Learning Outcomes
performing simple linear regression for finding association relationship
Ability to identify and understand association between business indicators
Skillset acquired and apply this to functional, semi-structured decision making including, but
not limited to, forecasting

In simple terms, Association can be understood as connection/relationship. We have built


our professions in an environment where understanding the association between a lot of
entities has become vital. Consciously or otherwise, we work with a lot of variables in our
daily routine and thrive to understand the association between many of them. Let’s take
few examples as below:
 Is there any association between the number of hours an employee logs in everyday
and the productivity?
 Is there be any association between the demand of my product/service and the
Covid-19 pandemic outbreak?
 As people age, will their maturity improve?
 Will the end-to-end travel time decrease if the frequency of the public transport
increases?
Also, there are other questions like,
 If the USD exchange value decreases, will the demand for my product/service
decrease? If yes, by what extent?
 Will there be increase in the attrition rate in my organization if there are no
performance based bonuses given? If yes by how much?
We can find the answers for the above questions by measuring the association between the
variables. This means, we need to identify the following details:
 Is there any association between the selected variables?
 If yes, then in what direction – positive or negative?
 If yes, then how strong the relationship is?
 What is the magnitude of the relationship?
Most popular measures of association are correlation analysis and regression analysis.
Correlation analysis helps in understanding the strength and direction of the association
whereas regression analysis helps in understanding the magnitude of the association
between given variables.
Correlation analysis can be conducted by measuring coefficient of correlation between the
variables. Karl Pearson’s coefficient of correlation can be used to measure the strength and
direction of the association. Spearman’s rank correlation coefficient can be used to measure
the association when the data is ordinal.
Regression analysis can be conducted to understand the magnitude of relationship between
the variables. This can be done by arriving at a linear equation (also known as modeling)
that states the relation between a dependent variable and independent variable(s).
In this module, the above measures of association will be dealt-with.
Session 16: Comparative Analysis

Interactive book and board problem solving using Activity


Pedagogy
Manual
Special Activity designed NO
Software MS Excel
 Correlation Basics - https://www.youtube.com/watch?
v=n5AmAUgZdlc
Preparatory material  Perfect Correlation Street -
https://www.youtube.com/watch?v=OCKs_WdNN8Q

Correlation analysis: significance of measuring correlation,


Detailed Syllabus correlation and causation. Karl Pearson’s coefficient of
correlations.
 Application of Correlation in Investment Analysis -
https://www.gurufocus.com/news/1091866/correlation
Post session practice -market-crashes-and-investment-returns
 DANCING STATISTICS: CORRELATION THROUGH DANCE -
https://www.youtube.com/watch?v=VFjaBh12C6s

We live in an environment full of variables. Some we understand and most we don’t. The
environment we progress in demands for maximum understanding of these variables. One
of the major requirements in understanding these variables is their association. There is a
need to answer the questions:
 How are the variables around us related to each other?
 Are they really related to each other?
 If yes then how strong is their relationship or in which way they are related?
This session is clearly going to address all these issues by bringing an important statistical
concept used in relation analysis – Correlation Analysis. By adapting this concept, one will be
able to understand the relationship between the variables they are interested in.
Introduction to Correlation Analysis:

More often an analysis of data concerning two or more quantitative variables is needed to
look for any statistical relationship or association between them. The knowledge of such
relationship is important to make inferences in a given situation.
Let us consider an example. Typically, in the summer as the temperature increases people
are thirstier. Consider the two numerical variables, temperature and water consumption.
We would expect the higher the temperature, the more water a given person would
consume. Thus we would say that in the summer, temperature and water consumption are
positively correlated.
For seven random summer days, a person recorded the temperature and their water
consumption, during a three-hour period spent outside.
 Temperature (F) Water Consumption
(Ounces)
75 16
83 20
85 25
85 27
92 32
97 48
99 48

The graph below helps visualize what appears to be a somewhat linear relationship between
temperature and the amount of water one drinks.

Similarly, we come across various examples in our daily life like,


 Budget on ration and number of visitors / special occasions at home
 Family income and expenditure on luxury items
 Frequency of smoking and lung damage
 Age and sign legibility distance
 No. of occupants in a hotel and its water / electricity consumption
The list is never ending.
Hence, correlation can be defined as “a measure of association between two numerical
variables”.

Significance of measuring correlation:


 Correlation analysis contributes to the understanding of economic behavior, aids in
locating the critically important variables on which others depend, may reveal to the
economist the connections by which disturbances spread and suggest to him the
paths through which stabilizing forces may become effective.
- W.A Neiswanger
 The effect of correlation is to reduce the range of uncertainty of our prediction. The
prediction based on correlation analysis will be more reliable and near to reality.
- Tippett
 In economic theory we come across several types of variables which show some kind
of relationship. For example, there exists a relationship between price, supply and
quantity demanded; convenience, amenities, and service standards are related to
customer retention; yield of a crop related to quantity of fertilizer applied, type of
soil, quality of seeds, rainfall, and so on. Correlation analysis helps in quantifying
precisely the degree of association and direction of such relationships
 Correlations are useful in the areas of healthcare such as determining the validity
and reliability of clinical measures or in expressing how health problems are related
to certain biological or environmental factors. For example, correlation coefficient
can be used to determine the degree of inter-observer reliability for two doctors
who are assessing a patient’s disease.

Sample Correlation Coefficient, ‘r’, measures the direction and the strength of the linear
association between two numerically paired variables. It varies between +1 and -1. The
values can be interpreted as mentioned in the tables below.
Direction of the Association: The association can be either positive or negative.
Positive Correlation: as the ‘X’ variable increases so does the ‘Y’ variable.

r value Positive Correlation Interpretation


+1 Perfect positive linear relationship
0 No linear relationship
0.9 Strong Positive Association
0.5 Moderate Positive Association
0.25 Weak Positive Association

Example: In the summer, as the temperature increases, so does thirst.


If ‘r’ equals zero, then there is no linear association between the two variables. 

Negative Correlation: as the ‘X’ variable increases, the ‘Y’ variable decreases.

r value Negative Correlation Interpretation


-1 Perfect Negative linear relationship
-0.9 Strong Negative Association
-0.5 Moderate Negative Association
-0.25 Weak Negative Association

The closer r is to one (in magnitude) the stronger the linear association.  
Example: As the price of an item increases, the number of items sold decreases.

Measures of correlation:
The degree of relationship between the two variables can be measured using the following
methods
a. Scatter diagram.
b. Karl –Pearson’s coefficient of correlation.
c. Spearman’s Rank correlation coefficient.

Scatter Diagram
It is a graphical presentation of bi-variate data. Here are variable (X) is taken along the x axis
and the other variable (Y) is taken along the y axis and each pair of (X, Y) values are
represented by a point of the graph. The rough estimate of correlation can be obtained
using the following scatters diagrams.
 If the variables form a positives slope (a line moving in the upward direction) they
are said to be perfectly positively correlated.
 If the variables are clustered around the positive slope then they are positively
correlated.

 If the variables form a negative slope (a line moving in downward direction) they are
said to be perfectly negatively correlated.
 If the variables are clustered around the negative slopes, they are negatively
correlated.

 If the variables are spread all over the graph, then they are not correlated.

KARL PEARSON’S COEFFICIENT OF CORRELATION


It is a mathematical measure based on covariance and variances. Covariance is a descriptive
measure of the linear association between two variables. Covariance describes the extent to
which a change in one variable (x) is paired with a comparable change in another variable
(y).

PROPERTIES OF PEARSON’S CORRELATION COEFFICIENT


 The value of r does not depend upon the units of measurement.
 The value of r does not depend upon which variable is labelled ‘X’ and which is
labelled ‘Y’
 Correlation coefficient lies between -1 and 1. A positive value of r means a positive
linear relationship, a negative value means a negative linear relationship
 If r = ±1, then all the points of the scatter diagram lie exactly on a straight line and
the correlation is said to be positive perfect if r = +1 and negative perfect if r = -1.
 ‘r’ measures only the linear relationship between ‘X’ and ‘Y’

Formulae:

Direct Method:
r =N ∑ XY −¿ ¿ ¿

Alternate Method:

r=
Cov (x , y )
=
∑ xy
σx σy (N−1) σ x σ y

r=
∑ xy
√ ∑ x2 × ∑ y2
where x=x−x∧ y = y− y

[N1] A travel and leisure magazine provides an annual list of the 500 best hotels in the
world. The magazine provides a rating for each hotel along with a brief description that
includes the size of the hotel, amenities and the cost per night for a double room. A sample
of 12 of the top-rated hotels in India is as follows:

Hotel Location No. of Cost/night Rs. ’00)


Rooms
Cubs Trail Resort Kanha, MP 220 499
Seasons Resort and Spa Cochin, Kerala 727 340
Buffalo Inn Coorg, Karnataka 285 585
Swasti Heritage Hotel Udaipur, Rajasthan 273 495
Tiger Den Jim Corbett, Uttarakhand 145 495
Snowden Spa and resorts Dharmashala, HP 213 279
Sun & Sand Beach Resort Panjim, Goa 398 279
Sand Stone Beach Resort Mahabalipuram, TN 343 455
Snow View Towers Gangtok, Sikkim 250 595
Six Seasons Beach Resort Vizag, AP 414 367
Golden Sands Mapusa, Goa 400 675
Chiru Towers Hyderabad, Telangana 700 420

Questions:
a. Develop a scatter diagram with the number of rooms on the horizontal axis and the
cost per night on the vertical axis. Does there appear to be a relationship between
the number of rooms and the cost per night? Discuss.
b. What is the sample correlation coefficient? What does it tell you about the
relationship between the number of rooms and the cost per night for a double
room? Does this appear reasonable? Discuss.

[N2] Newly appointed finance secretary receives a feedback from his team in a review
meeting about the rising unemployment in the country. Coming from the science
background, he decides to take various parameters to understand the real reason behind
the rise in the unemployment rate. One of the parameters he selects is the industrial
production. He seeks the data about the industrial production index and number of
unemployed people between 2012 and 2019 from his team.
He gets the following table that gives indices of industrial production and number of
registered unemployed people (in lakh). He decides to use the correlation analysis to
understand the relationship between the given data.
Use the Karl Pearson’s Coefficient of Correlation analysis to find out what the finance
secretary discovers from the given data.

Year 2012 2013 2014 2015 2016 2017 2018 2019


Index of Production 100 102 104 107 105 112 103 99
Number Unemployed 15 12 13 11 12 12 19 26

[N3] A financial analyst wanted to find out whether inventory turnover influences any
company’s earnings per share (in percent). A random sample of 7 companies listed in a
stock exchange was selected and the following data was recorded for each. Find the
strength of association between inventory turnover and earnings per share. Interpret this
finding to the analyst.

Company Inventory Turnover Earnings per share


(no. of times) (percent)
A 4 11
B 5 9
C 7 13
D 8 7
E 6 13
F 3 8
G 5 8

[N4] A nutritionist well-known for her nutritional prescriptions to pregnant women wishes
to estimate the association between gestational age and infant birth weight in order to
enhance her prescriptions. For this, a small study is conducted involving 10 infants to
investigate the association between gestational age at birth, measured in weeks, and birth
weight, measured in grams. Calculate the association and give recommendations to the
nutritionist.

Infant ID Gestational Age (In Weeks) Birth Weight (In Grams)


1 35 1895
2 36 2030
3 29 1440
4 40 2835
5 36 3090
6 42 3827
7 40 3260
8 37 2690
9 41 3285
10 38 2920

[N5] The success of a shopping center can be represented as a function of the distance (in
miles) from the center of the population and the number of clients (in hundreds of people)
who will visit. The data is given in the table below:
No. Customer 8 7 6 4 2 1
Distance 15 19 25 23 34 40
Calculate the linear correlation coefficient.

[N6] As part of manpower planning exercise by an organization, the following data is


collected on number of units of output and manpower usage per week for a product during
the year 2019. Calculate and comment on the association between the output and
manpower used.

Output (no. of units) 60 48 35 30 55 40 80 70

Man hours used 650 45 250 500 55 380 750 700


0 0

[N7] A company is introducing a job evaluation scheme in which all jobs are graded by
points for skill, responsibility, and so on. Monthly pay scales (Rs. In ‘000) are then drawn up
according to the number of points allocated and other factors such as experience and local
conditions. To date the company has applied this scheme to 9 jobs. Find out if there is any
association between the monthly pay scales and the number of points and comment on the
result.

Job A B C D E F G H I
Points 5 25 7 19 10 12 15 28 16
Pay (Rs. ‘000) 3.0 5.0 3.25 6.5 5.5 5.6 6.0 7.2 6.1

[N8] An advertising consultant decides to identify the association between the amount
spent by his clients on advertisements and the sales they achieved. For this, he selects 10
major brands of beer and the following data is accumulated. The following data show the
media expenditures (Rs. Million) and the shipments. Assist the consultant in identifying the
association and provide your recommendation based on the results.

Brand Media Expenditure Shipments (in


(Rs. In Million) ‘000 cases)
Budweiser 120 36
Bud Light 69 21
Miller Lite 100 16
Coors Light 77 13
Busch 9 8
Natural Light 1 7
Miller Genuine Draft 21 6
Miller High Life 2 4
Busch Light 5 4
Milwaukee’s Best 2 5

[N9] The McDonald’s Corporation is the leading global foodservice retailer with more than
30,000 local restaurants serving nearly 50 million people in more than 119 countries each
day. This global presence, in addition to its consistency in food offerings and restaurant
operations, makes McDonald’s a unique and attractive setting for economists to make salary
and price comparisons around the world. Because the Big Mac hamburger is a standardized
hamburger produced and sold in virtually every McDonald’s around the world, the
Economist, a weekly newspaper focusing on international politics and business news and
opinion, as early as 1986 was compiling information about Big Mac prices as an indicator of
exchange rates. Building on this idea, researchers Ashenfelter and Jurajda proposed
comparing wage rates across countries and the price of a Big Mac hamburger. Shown below
are Big Mac prices and net hourly wage figures (in U.S. dollars) for 27 countries. Note that
net hourly wages are based on a weighted average of 12 professions.
Country Big Mac Price (U.S $) Net Hourly Wage (U.S $)
Argentina 1.42 1.70
Australia 1.86 7.80
Brazil 1.48 2.05
Britain 3.14 12.30
Canada 2.21 9.35
Chile 1.96 2.80
Czech Republic 1.96 2.40
Denmark 4.09 14.40
Source: TB 2 Pg 465
a. Is there a relationship between the price of a Big Mac and the net hourly wages of
workers around the world? If so, how strong is the relationship?
b. Is it possible to develop a model to predict or determine the net hourly wage of a
worker around the world by the price of a Big Mac hamburger in that country? If so,
how good is the model?
c. If a model can be constructed to determine the net hourly wage of a worker around
the world by the price of a Big Mac hamburger, what would be the predicted net
hourly wage of a worker in a country if the price of a Big Mac hamburger was $3.00?

[N10] List of highest grossing Indian films can be found in


“https://en.wikipedia.org/wiki/List_of_highest-grossing_Indian_films ”. Segregate the
top ten of them and find the Karl Pearson’s coefficient of correlation between the
budget of the film and the world wide gross. What do you infer by the result obtained?

[N11] The world has witnessed one of the worst pandemic corona virus disease in 2020
namely COVID19. The number of corona positive cases and deaths reported is updated
regularly in the worldometer “https://www.worldometers.info/coronavirus/”. Consider
the worst affected 12 nations and calculate the Karl Pearson’s coefficient of correlation
between the number of positive cases as against the deaths. Comment on the result.
Session 17: Association Analysis of Ranked Order

Pedagogy Engaging through Interactive problem solving


Special Activity designed NO
Software MS Excel

Correlation Animated -
Preparatory material
https://www.youtube.com/watch?v=Kz6-clwb8AA
Detailed Syllabus Spearman’s Rank Correlation
1. Numericals from Activity manual
2. Rank Correlation in Bit Coin Analysis -
Post session practice
https://bitcoinist.com/no-bitcoin-and-gold-prices-are-
not-correlated/

After understanding the Karl Pearson’s coefficient of correlation, we move forward to learn
another method of arriving at correlation coefficient between paired data of given variables
– Spearman’s Rank method. Spearman's rank correlation coefficient is a measure of rank
correlation (statistical dependence between the rankings of two variables).

SPEARMAN’S RANK CORRELATION:

At times we need to measure the strength of the linear relationship between variables using
data which can be trusted only to the extent of its rank ordering. The rank correlation
coefficient may be used in many situations, for which the conventional correlation
coefficient is unsuitable.

Formulae:
6 ∑ D2
R=1− 3
N −N
With Rank Repetition:
Correlation factor:
m13−m1 m 23−m 2 m 33−m 3
CF = + + +…
12 12 12

6 {∑ D2 +C . F . }
R=1−
[ 3
n −n ]
Where
 Di = RXi – RYi and N = no. of observations
 C.F = Correction Factor
 mi = Number of times an observation is repeated
When the given pairs of observations in the data set are not ranked, the ranks are assigned
by taking either the highest or the lowest value as 1 for both the variable’s value.
Source: http://www.brainkart.com/article/Spearman---s-Rank-Correlation-
Coefficient_39249/#:~:text=Interpretation,the%20stronger%20the%20monotonic
%20relationship.

While attempting to rank the observations as mentioned above, we may come across a
situation of more than one observations being of equal size. In such a case, the rank to be
assigned to individual observations is an average of the ranks which these individual
observations would have got had they differed from each other. For example, if two

(4+5)
observations are ranked equal at fourth place, then the average rank of = 4.5 is
2
assigned to these two observations. If there are three, then the rank would be (3 + 4 + 5) / 3
= 4.

[N1] In one of the recruitment drives, Tata Motors Limited (TML) decided that they would
select a group of employees for skill based training on the basis of aptitude tests. On
completion of training, the quality of their work is assessed and they are again ranked as
follows where ‘X’ denotes aptitude ranking and ‘Y’ denote quality ranking. Calculate the
rank correlation and comment on the selection of employees.

X 2 1 3 7 6 8 4 5 10 9
Y 3 2 1 8 4 9 5 6 10 7

[N2] The following table provides data about the percentage of students who have qualified
for a scholarship offered by the state universities and their CGPA scores. Calculate the
Spearman’s Rank Correlation between the two and interpret the result.

State University % of Students qualified for % of students scoring above 8.5


scholarship CGPA
Bangalore 14 54
Delhi 7 64
Mumbai 27 44
Jaipur 33 32
Kolkata 38 37
Raipur 16 68
Vishakhapatnam 5 62
Trichy 8 43
Bhopal 29 49
Cuttack 18 52

[N3] Following the tradition followed for years, the department of Horticulture, Karnataka
organized its annual Republic Dar flower show at Lalbagh Botanical Garden, Bengaluru from
17th to 28th January 2020. As is the practice, the best theme would be awarded. To judge the
display of various flowers a panel comprising of three judges was appointed. There were
eight participants who were ranked by the panel based on mutually agreed criteria. The
panel’s rankings are as follows:

Participant No. 1 2 3 4 5 6 7 8
Judge 1 4 5 2 1 6 8 7 3
Judge 2 3 2 6 8 1 5 7 4
Judge 3 1 5 3 6 8 7 4 2

Using Spearman’s rank correlation coefficient, name two among the three judges who have
closer views regarding the display of flowers.

[N4] Covid-19 data as compiled by the Ministry of Health and Family Welfare, India, the
following data is selected to measure the association between the number of active cases,
number of cured cases and the number of deaths. Using Spearman’s rank correlation
coefficient, name two among the three variables which have closer association.

State No. of active cases No. of cured cases No. of


deaths
Karnataka 405 426 31
Gujrat 5248 2780 513
Bihar 364 377 6
Rajasthan 1611 2264 113
Kerala 26 489 6
Madhya Pradesh 1817 1747 221
Assam 29 34 2
Punjab 1678 168 31
Data Source: https://www.mohfw.gov.in/ as on 12 May 2020.

[N5] The following data corresponds to the scores of a student of MBA at Jain University in
continuous assessment in 2nd Semester. His mentor wishes to know if there is any
association between the marks scored by the student in two subjects. Use Spearman’s rank
correlation analysis to measure the association and interpret the result.

X 78 42 90 24 73 80 91 62 65 42
Y 84 51 92 43 75 54 86 54 54 43

[N6] TVS Motor Company is about to launch their new 100 CC stylish scooter targeted at the
youth. As part of the testing processes, they decide to invite the general public to test drive
the scooters in order to evaluate its mileage. For this experiment, the company selects two
youths as test drivers from two different colleges in Bengaluru and Chennai. Each driver is
supposed to travel a distance on 9 random routes and record observations. The
observations are as follows:
X 41 49 52 35 41 42 30 50 48
Y 51 44 44 47 49 51 28 39 22

Us Spearman’s rank correlation analysis to measure the association between two drivers
and interpret the result.
Session 18: Business Prediction Models

Pedagogy Interactive book and board problem solving


Special Activity designed NO
Software NO
 Regression Analysis Animated Explainer for Harvard
Preparatory material Business Review - https://www.youtube.com/watch?
v=XYU8WT86R6o
Regression analysis: need for regression, advantages of
Detailed Syllabus
regression analysis, and types of regression models.
1. Numericals from Activity manual
2. Forecasting Models -
Post session practice
https://www.logisticsmanager.com/logistics-manager-
analysis-forecasting-and-sop-big-data-big-decisions/

INTRODUCTION

It is a human tendency to know things that might happen to them in the future.
Organizations too are no different. After all they are the result of human intelligence.
Irrespective of the sector, the organizations are literally in the race to predict the future of
their organization, be it in terms of opportunities or challenges. They are finding ways for
prediction.
On the other hand, there are other set of organizations who are trying their maximum to
take advantage of this attitude of prediction. They are mixing up the old strategies with the
IT explosion to arrive at the best possible prediction.
Looking at some of the oldest prediction (from now onwards, let us use more appropriate
corporate term - forecasting) strategies, the most common one would be to use the historic
data and forecast. As the data grew, the need to analyse this data using relevant tools came
up. Statistical tools were the result. Among many such statistical forecasting tools was
regression analysis. Others are simulation technique, exponential smoothing, etc.
Regression analysis is one of the most tried & tested, popular statistical tools for forecasting.
REGRESSION ANALYSIS

It was Sir Francis Galton who first used the term regression as a statistical concept in 1877.
He made a statistical study that showed that the height of children born to tall parents
tends to ‘regress’ towards the mean height of population. Galton used the term regression
as a statistical technique to predict one variable (the height of children) from another
variable (the height of parents). This is called ‘regression’ or ‘simple regression’ confined to
bivariate data.
In many business decisions it is necessary to predict the value of unknown variables.
Regression analysis tells us how one variable is related to another by providing an equation
that allows us to use the known value of one or more variables, to estimate the unknown
value of the remaining variable.
A statistical model is a set of mathematical formulas and assumptions which describe a real
world situation. In this sense, simple linear regression as also multiple regression are
statistical models. A statistical model tries to capture the systematic behaviour of the given
data, leaving out those factors that cannot be foreseen or predicted. These factors are the
errors.
A good statistical model is one which provides as large a systematic component as possible,
minimising errors.
As a first step, we choose a particular model, say a linear regression model, for describing
the relationship between the two variables. As a second step, we work out the estimates of
the model parameters on the basis of random sample data. The third step is to consider the
errors that are called residuals, arising on the fit of the model to the data. When we are
convinced that the residuals contain only pure randomness, we consider our model quite
appropriate for its intended purpose, which invariably happens to make predictions.

Regression analysis is a mathematical measure, which helps to determine the probable


form of the relationship between variables and it is used to predict or estimate the value of
one variable, corresponding to a given value of another variable. The variable being
predicted is called dependent variable and variable used to predict the value of dependent
variable is called independent variable
The simplest type of regression analysis involving one independent variable and one
dependent variable in which the relationship between the variables is approximated by a
straight line is called linear regression.
Regression analysis involving two or more independent variables is called multiple
regression analysis.
The relationship between two variables is quantified by representing the line of best fit as a
mathematical equation known as regression equation. In other words, the linear
relationship between two variables can be described by a straight line, which is known as
regression line.

The equation for a straight line is Y =a+bXwhere ‘Y’ is the dependent variable, ‘X’ is the
independent variable, ‘a’ is the Y-intercept, which is the point at which the regression line
crosses the Y-axis (the vertical axis) and ‘b’ is the slope of the regression line. It should be
noted that the values of both ‘a’ and ‘b’ will remain constant for any given straight line.

The most commonly used regression lines are straight lines whose equations are,
Y = a 1 + b1 X (Equation 1)
X = a 2 + b2 Y (Equation 2)
In equation 1, X is the I.V and Y is the D.V, a1 is the Y intercept (it is the point at which the
line crosses the Y-axis) and b1 is the regression coefficient or slope
In equation 2, Y is the I.V and X is the D.V, a2 is the X intercept (it is the point at which the

1
line crosses the X-axis) and is the regression coefficient or slope
b2

LEAST SQUARE METHOD

This provides an estimated regression equation that minimizes the sum of squares of
deviations between the observed values of the D.V (Y) and the estimated values of the
dependent Y^ variable. This is the least square criterion for choosing the equation that
provides the best fit.
Our objective is to find a straight line Y =a+bX to all the given points (we try to find the line
of best fit in such a way that this line will pass through all the points). But this is not possible
in most of the cases. So we try to find a line that will pass through maximum number of
points and all the other points are as close to the line as possible.

The goal is to minimize the sum of the square of the errors of the data points using
Ei = Yi - (a+bX). This minimizes the Mean Square Error

REGRESSION EQUATIONS
Regression equation of y on x is
(Y −Ý ¿=b( y , x)∗( X− X́ )

b ( y , x ) =N ∑ XY −¿ ¿ ¿

Regression equation of x on y
b ( x , y ) =N ∑ XY −¿ ¿ ¿

(X− X́ ¿=b(x , y)∗(Y −Ý )

σx
( X − X́ ¿=r ( )(Y −Ý )
σy
REGRESSION COEFFICIENTS
σx
The factor r ( ) which represents the increment in the value of dependent variable y
σy
corresponding to a unit change in the value of independent variable x is known as
regression coefficient of y on x and is denoted by byx.
σy
byx = r ( )
σx
σx
bxy = r ( )
σy

Properties of Regression Coefficients:


• Correlation coefficient is the geometric mean between the regression coefficients.
• Arithmetic mean of the regression coefficient is greater than or equal to the
correlation coefficient.
• Regression coefficients are independent of change of origin but not of scale
• If one of the regression coefficient is greater than unity, the other must be less than
unity
• Both the regression coefficients will have the same sign, either positive or negative.
Session 19: Analytical Linear Regression

Pedagogy Interactive book and board problem solving


Special Activity designed NO
Software MS Excel
Linear Regression: Fun and Easy -
Preparatory material
https://www.youtube.com/watch?v=CtKeHnfK5uA
Detailed Syllabus Simple Linear Regression
1. 1. Numericals from Activity manual
2. IIMR prediction on Covid-19 using Regression Analysis -
https://theprint.in/india/iim-study-predicts-1-5-lakh-
Post session practice covid-19-cases-by-may-first-week-as-result-of-tablighi-
event/397810/
3. Simple Linear Regression -
https://www.youtube.com/watch?v=nHf8PdSvbJ0

Introduction
In session 20, we theoretically understood what regression analysis is. Unfortunately, it is
difficult to understand the application of regression analysis, for that matter any statistical
tool, unless we practically apply these tools on the data and see how it works. In this
session, we shall use data sets from multiple sectors to apply regression analysis on them
and understand what the data has to tell us.

[N1] The government of India is announcing plenty of reforms during this pandemic period.
In continuation with this activity, it has assigned the ministry of commerce and industry to
predict the relationship between import and export values of electronic sector in the
country. The ministry has gathered the data between 2013-14 and 2018-19 from DGCIS for
the prediction. Use regression analysis to model the relation between import & export and
vice versa of the electronic data. Also predict the import for the year 2020-21 given that the
export will be USD 11 Billion.

Year 2013-14 2014-15 2015-16 2016-17 2017-18 2018-19


Imports (US $ 32 36 40 42 51 55
Billion)
Exports (US $ 8 6 6 7 7 9
Billion)
Data Source: https://commerce.gov.in/
[N2] Ralison Appliances Pvt. Ltd. manufactures different types of electrical appliances in
India. It has been using radio (FM) for advertising its products. The following table shows the
amounts of radio time and the number of electrical appliances sold over seven weeks. Fit
linear equations of radio time on the number of electrical appliances sold and vice-versa.
Also calculate the sales when the radio time is 24 minutes.

Radio Time (Minutes) 25 18 32 21 35 28 30


No. of appliances sold 16 11 20 15 26 32 20

[N3] The editor-in-chief of Bangalore Mirror has been trying to convince the paper’s owner
to improve the working conditions in the press room. He is convinced that the noise level,
when the presses are running, creates unhealthy levels of tension and anxiety. He recently
had a psychologist conduct a test during which pressmen were placed in rooms with varying
levels of noise and then given a test to measure mood and anxiety levels. The following
table shows the index of their degrees of nervousness and the level of noise to which they
were exposed (5 is low and 10 is high).

Noise Level 7.0 6.5 5.5 6.0 8.0 8.5 6.0 6.5
Degree of Nervousness 23 38 45 36 16 18 39 41

Develop estimating equations. Also predict the degrees of nervousness that we might
expect when the noise level is 7.5

[N4] As a part of a study on transportation safety, Karnataka State Government collected


data on number of fatal accidents per 1000 licenses and percentage of licensed drivers
under the age of 21 in 10 cities of the state. The data is tabulated as below:

Fatal Accidents per 1000 2.6 3.8 0.8 1.2 0.6 1 2.8 1.4 1.8 2
licenses
Percent of licensed drivers 17 18 8 13 6 9 16 12 9 10
under 21 years
Fit linear equations to the above data.

[N5] A researcher at Jain University wished to investigate if there is any relationship


between atmospheric temperature (in oC) and the number of Covid-19 cases. In this regard,
the researcher collected the following data from among 12 random states in India. Model
the relation between the temperature and the Covid-19 cases. Also estimate the number of
Covid-19 cases when the temperature is 8oC

Average No. of Covid-19


State
Temperature (oC) Cases
HP 17 38
J&K 18 540
UP 33 2055
Delhi 32.5 5894
Chhattisgarh 35 29
West Bengal 30.5 1818
Karnataka 27 535
Andhra Pradesh 32 952
Tamil Nadu 34 8423
Gujarat 38 6906
Goa 30 10
Rajasthan 37 2522
Data Source: www.accuweather.com and www.hindustantimes.com

[N6] Hyundai Motor India Ltd has recently held 3-day road-side exhibits on the introduction
of its new model of Creta. The number of sales personnel employed at each of a sample of
10 exhibitions and the number of cars booked at each one are given as follows:

No. of Salesman 5 8 6 8 9 3 5 4 6 6
No. of Cars booked 132 160 148 156 168 102 142 98 152 142

Using these data, regress the number of cars booked on the number of salesmen and obtain
the regression equation. Also estimate the number of cars booked if 10 salesmen are
employed on an exhibition.

[N7] ITI Limited recorded data showing the experience of machine operators and their
performance rating as given by the number of good parts turned out per 100 pieces.

Operator 1 2 3 4 5 6 7 8
Experience (Years) 16 12 18 4 3 10 5 12
Performance Rating 87 88 89 68 78 80 75 82

Obtain the regression equation of performance rating on experience. Use this equation to
estimate the probable performance if an operator has 7 years of experience.

[N8] People in the aerospace industry believe the cost of a space project is a function of the
weight of the major object being sent into space. Use the following data to develop a
regression model to predict the cost of a space project by the weight of the space object.

Weight (Tons) 2 3 1 1.5 1.25 2 2.5


Cost ($ millions) 53.5 185 7 24 34 110 104
Session 20: Multiple Linear Regression

Pedagogy Discussion and illustration using real-time examples


Special Activity designed Correlation and Regression through Fitness and Games – CA 3
Software NO
Multiple Linear Regression, The Very Basics -
Preparatory material
https://www.youtube.com/watch?v=dQNpSa-bq4M
Concepts of Multiple Linear Regression, Numerical of Curve Fitting in
Detailed Syllabus
Quadratic and Exponential Methods.
 Application of Multiple Linear Regression -
Post session practice
https://explorable.com/multiple-regression-analysis

Introduction
People have learnt it the hard way that most of the outcomes are not the result of one reason but
many. It is like many causes lead to one single effect. If it can be expressed in terms of dependent
and independent variables, a dependent variable may not be dependent on only one independent
variable. It can be dependent on many. For example, sale of a particular (dependent variable)
product not only depends on its price, but also the quality, features, competition, alternate
products, market condition (independent variables), etc. In order model this relationship, we use
multiple linear regression analysis.

Multiple Linear Regression


Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable. The goal of
multiple linear regression (MLR) is to model the linear relationship between the explanatory
(independent) variables and response (dependent) variable.

In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that
involves more than one explanatory variable.

A simple linear regression is a function that allows an analyst or statistician to make predictions
about one variable based on the information that is known about another variable. Linear regression
can only be used when one has two continuous variables—an independent variable and a
dependent variable. The independent variable is the parameter that is used to calculate the
dependent variable or outcome. A multiple regression model extends to several explanatory
variables.
For example, an analyst may want to know how the movement of the market affects the price of
Exxon Mobil (XOM). In this case, his linear equation will have the value of the S&P 500 index as the
independent variable, or predictor, and the price of XOM as the dependent variable.
In reality, there are multiple factors that predict the outcome of an event. The price movement of
Exxon Mobil, for example, depends on more than just the performance of the overall market. Other
predictors such as the price of oil, interest rates, and the price movement of oil futures can affect
the price of XOM and stock prices of other oil companies. To understand a relationship in which
more than two variables are present, a multiple linear regression is used.

Multiple linear regression (MLR) is used to determine a mathematical relationship among a number
of random variables. In other terms, MLR examines how multiple independent variables are related
to one dependent variable. Once each of the independent factors has been determined to predict
the dependent variable, the information on the multiple variables can be used to create an accurate
prediction on the level of effect they have on the outcome variable. The model creates a relationship
in the form of a straight line (linear) that best approximates all the individual data points.

A two-variable multiple linear regression equation is given as:


Y = a + b1X1 + b2X2

CURVE FITTING
Curve fitting is the process of constructing a curve, or mathematical function that has the best fit to a
series of data points, possibly subject to constraints. Fitted curves can be used as an aid for data
visualization, to infer values of a function where no data are available, and to summarize the
relationships among two or more variables.
CA3: CORRELATION AND REGRESSION THROUGH FITNESS AND GAMES

Assessment 3:
Syllabus coverage: Module 4
Timeline: Session 22
Assessment marks: 10
Objective: This activity is aimed at evaluating the ability of students to identify and establish
association between variables. This is conducted based on the activity 4 – Correlation and Regression
through fitness and games.
Debriefing duration: 10 minutes

Project duration: 40 minutes

Activity Space: Class rooms, corridors and empty spaces meant for students’ activities

Submission style: A report consisting of customized covering sheet, data set, solution,
interpretation, color photograph of the group holding the project, conclusion.

Evaluation: Respective professors will evaluate based on the rubrics mentioned below.

ASSESSMENT RUBRICS
Completion of Activity 2 marks
Data compilation and preparation a. marks
Data analysis (calculation) 2 marks
Drawing inference (interpretation) 2 marks
Viva-voce 2 marks
TOTAL 10 Marks

You might also like