Professional Documents
Culture Documents
BSADM Module 4 Session 17 22 KSR
BSADM Module 4 Session 17 22 KSR
Module No. 4
Topic Measures of Association
No. of Hours 6
No. of Sessions 5
Session numbers 16 to 20
Session 16 Comparative Analysis
Session 17 Association Analysis of Ranked Order
Session 18 Business Prediction Models
Session 19 Analytical Linear Regression
Session 20 Multiple Linear Regression
Use correlation to identify variables that may be used in model building and
Learning Outcomes
performing simple linear regression for finding association relationship
Ability to identify and understand association between business indicators
Skillset acquired and apply this to functional, semi-structured decision making including, but
not limited to, forecasting
We live in an environment full of variables. Some we understand and most we don’t. The
environment we progress in demands for maximum understanding of these variables. One
of the major requirements in understanding these variables is their association. There is a
need to answer the questions:
How are the variables around us related to each other?
Are they really related to each other?
If yes then how strong is their relationship or in which way they are related?
This session is clearly going to address all these issues by bringing an important statistical
concept used in relation analysis – Correlation Analysis. By adapting this concept, one will be
able to understand the relationship between the variables they are interested in.
Introduction to Correlation Analysis:
More often an analysis of data concerning two or more quantitative variables is needed to
look for any statistical relationship or association between them. The knowledge of such
relationship is important to make inferences in a given situation.
Let us consider an example. Typically, in the summer as the temperature increases people
are thirstier. Consider the two numerical variables, temperature and water consumption.
We would expect the higher the temperature, the more water a given person would
consume. Thus we would say that in the summer, temperature and water consumption are
positively correlated.
For seven random summer days, a person recorded the temperature and their water
consumption, during a three-hour period spent outside.
Temperature (F) Water Consumption
(Ounces)
75 16
83 20
85 25
85 27
92 32
97 48
99 48
The graph below helps visualize what appears to be a somewhat linear relationship between
temperature and the amount of water one drinks.
Sample Correlation Coefficient, ‘r’, measures the direction and the strength of the linear
association between two numerically paired variables. It varies between +1 and -1. The
values can be interpreted as mentioned in the tables below.
Direction of the Association: The association can be either positive or negative.
Positive Correlation: as the ‘X’ variable increases so does the ‘Y’ variable.
Negative Correlation: as the ‘X’ variable increases, the ‘Y’ variable decreases.
The closer r is to one (in magnitude) the stronger the linear association.
Example: As the price of an item increases, the number of items sold decreases.
Measures of correlation:
The degree of relationship between the two variables can be measured using the following
methods
a. Scatter diagram.
b. Karl –Pearson’s coefficient of correlation.
c. Spearman’s Rank correlation coefficient.
Scatter Diagram
It is a graphical presentation of bi-variate data. Here are variable (X) is taken along the x axis
and the other variable (Y) is taken along the y axis and each pair of (X, Y) values are
represented by a point of the graph. The rough estimate of correlation can be obtained
using the following scatters diagrams.
If the variables form a positives slope (a line moving in the upward direction) they
are said to be perfectly positively correlated.
If the variables are clustered around the positive slope then they are positively
correlated.
If the variables form a negative slope (a line moving in downward direction) they are
said to be perfectly negatively correlated.
If the variables are clustered around the negative slopes, they are negatively
correlated.
If the variables are spread all over the graph, then they are not correlated.
Formulae:
Direct Method:
r =N ∑ XY −¿ ¿ ¿
Alternate Method:
r=
Cov (x , y )
=
∑ xy
σx σy (N−1) σ x σ y
r=
∑ xy
√ ∑ x2 × ∑ y2
where x=x−x∧ y = y− y
[N1] A travel and leisure magazine provides an annual list of the 500 best hotels in the
world. The magazine provides a rating for each hotel along with a brief description that
includes the size of the hotel, amenities and the cost per night for a double room. A sample
of 12 of the top-rated hotels in India is as follows:
Questions:
a. Develop a scatter diagram with the number of rooms on the horizontal axis and the
cost per night on the vertical axis. Does there appear to be a relationship between
the number of rooms and the cost per night? Discuss.
b. What is the sample correlation coefficient? What does it tell you about the
relationship between the number of rooms and the cost per night for a double
room? Does this appear reasonable? Discuss.
[N2] Newly appointed finance secretary receives a feedback from his team in a review
meeting about the rising unemployment in the country. Coming from the science
background, he decides to take various parameters to understand the real reason behind
the rise in the unemployment rate. One of the parameters he selects is the industrial
production. He seeks the data about the industrial production index and number of
unemployed people between 2012 and 2019 from his team.
He gets the following table that gives indices of industrial production and number of
registered unemployed people (in lakh). He decides to use the correlation analysis to
understand the relationship between the given data.
Use the Karl Pearson’s Coefficient of Correlation analysis to find out what the finance
secretary discovers from the given data.
[N3] A financial analyst wanted to find out whether inventory turnover influences any
company’s earnings per share (in percent). A random sample of 7 companies listed in a
stock exchange was selected and the following data was recorded for each. Find the
strength of association between inventory turnover and earnings per share. Interpret this
finding to the analyst.
[N4] A nutritionist well-known for her nutritional prescriptions to pregnant women wishes
to estimate the association between gestational age and infant birth weight in order to
enhance her prescriptions. For this, a small study is conducted involving 10 infants to
investigate the association between gestational age at birth, measured in weeks, and birth
weight, measured in grams. Calculate the association and give recommendations to the
nutritionist.
[N5] The success of a shopping center can be represented as a function of the distance (in
miles) from the center of the population and the number of clients (in hundreds of people)
who will visit. The data is given in the table below:
No. Customer 8 7 6 4 2 1
Distance 15 19 25 23 34 40
Calculate the linear correlation coefficient.
[N7] A company is introducing a job evaluation scheme in which all jobs are graded by
points for skill, responsibility, and so on. Monthly pay scales (Rs. In ‘000) are then drawn up
according to the number of points allocated and other factors such as experience and local
conditions. To date the company has applied this scheme to 9 jobs. Find out if there is any
association between the monthly pay scales and the number of points and comment on the
result.
Job A B C D E F G H I
Points 5 25 7 19 10 12 15 28 16
Pay (Rs. ‘000) 3.0 5.0 3.25 6.5 5.5 5.6 6.0 7.2 6.1
[N8] An advertising consultant decides to identify the association between the amount
spent by his clients on advertisements and the sales they achieved. For this, he selects 10
major brands of beer and the following data is accumulated. The following data show the
media expenditures (Rs. Million) and the shipments. Assist the consultant in identifying the
association and provide your recommendation based on the results.
[N9] The McDonald’s Corporation is the leading global foodservice retailer with more than
30,000 local restaurants serving nearly 50 million people in more than 119 countries each
day. This global presence, in addition to its consistency in food offerings and restaurant
operations, makes McDonald’s a unique and attractive setting for economists to make salary
and price comparisons around the world. Because the Big Mac hamburger is a standardized
hamburger produced and sold in virtually every McDonald’s around the world, the
Economist, a weekly newspaper focusing on international politics and business news and
opinion, as early as 1986 was compiling information about Big Mac prices as an indicator of
exchange rates. Building on this idea, researchers Ashenfelter and Jurajda proposed
comparing wage rates across countries and the price of a Big Mac hamburger. Shown below
are Big Mac prices and net hourly wage figures (in U.S. dollars) for 27 countries. Note that
net hourly wages are based on a weighted average of 12 professions.
Country Big Mac Price (U.S $) Net Hourly Wage (U.S $)
Argentina 1.42 1.70
Australia 1.86 7.80
Brazil 1.48 2.05
Britain 3.14 12.30
Canada 2.21 9.35
Chile 1.96 2.80
Czech Republic 1.96 2.40
Denmark 4.09 14.40
Source: TB 2 Pg 465
a. Is there a relationship between the price of a Big Mac and the net hourly wages of
workers around the world? If so, how strong is the relationship?
b. Is it possible to develop a model to predict or determine the net hourly wage of a
worker around the world by the price of a Big Mac hamburger in that country? If so,
how good is the model?
c. If a model can be constructed to determine the net hourly wage of a worker around
the world by the price of a Big Mac hamburger, what would be the predicted net
hourly wage of a worker in a country if the price of a Big Mac hamburger was $3.00?
[N11] The world has witnessed one of the worst pandemic corona virus disease in 2020
namely COVID19. The number of corona positive cases and deaths reported is updated
regularly in the worldometer “https://www.worldometers.info/coronavirus/”. Consider
the worst affected 12 nations and calculate the Karl Pearson’s coefficient of correlation
between the number of positive cases as against the deaths. Comment on the result.
Session 17: Association Analysis of Ranked Order
After understanding the Karl Pearson’s coefficient of correlation, we move forward to learn
another method of arriving at correlation coefficient between paired data of given variables
– Spearman’s Rank method. Spearman's rank correlation coefficient is a measure of rank
correlation (statistical dependence between the rankings of two variables).
At times we need to measure the strength of the linear relationship between variables using
data which can be trusted only to the extent of its rank ordering. The rank correlation
coefficient may be used in many situations, for which the conventional correlation
coefficient is unsuitable.
Formulae:
6 ∑ D2
R=1− 3
N −N
With Rank Repetition:
Correlation factor:
m13−m1 m 23−m 2 m 33−m 3
CF = + + +…
12 12 12
6 {∑ D2 +C . F . }
R=1−
[ 3
n −n ]
Where
Di = RXi – RYi and N = no. of observations
C.F = Correction Factor
mi = Number of times an observation is repeated
When the given pairs of observations in the data set are not ranked, the ranks are assigned
by taking either the highest or the lowest value as 1 for both the variable’s value.
Source: http://www.brainkart.com/article/Spearman---s-Rank-Correlation-
Coefficient_39249/#:~:text=Interpretation,the%20stronger%20the%20monotonic
%20relationship.
While attempting to rank the observations as mentioned above, we may come across a
situation of more than one observations being of equal size. In such a case, the rank to be
assigned to individual observations is an average of the ranks which these individual
observations would have got had they differed from each other. For example, if two
(4+5)
observations are ranked equal at fourth place, then the average rank of = 4.5 is
2
assigned to these two observations. If there are three, then the rank would be (3 + 4 + 5) / 3
= 4.
[N1] In one of the recruitment drives, Tata Motors Limited (TML) decided that they would
select a group of employees for skill based training on the basis of aptitude tests. On
completion of training, the quality of their work is assessed and they are again ranked as
follows where ‘X’ denotes aptitude ranking and ‘Y’ denote quality ranking. Calculate the
rank correlation and comment on the selection of employees.
X 2 1 3 7 6 8 4 5 10 9
Y 3 2 1 8 4 9 5 6 10 7
[N2] The following table provides data about the percentage of students who have qualified
for a scholarship offered by the state universities and their CGPA scores. Calculate the
Spearman’s Rank Correlation between the two and interpret the result.
[N3] Following the tradition followed for years, the department of Horticulture, Karnataka
organized its annual Republic Dar flower show at Lalbagh Botanical Garden, Bengaluru from
17th to 28th January 2020. As is the practice, the best theme would be awarded. To judge the
display of various flowers a panel comprising of three judges was appointed. There were
eight participants who were ranked by the panel based on mutually agreed criteria. The
panel’s rankings are as follows:
Participant No. 1 2 3 4 5 6 7 8
Judge 1 4 5 2 1 6 8 7 3
Judge 2 3 2 6 8 1 5 7 4
Judge 3 1 5 3 6 8 7 4 2
Using Spearman’s rank correlation coefficient, name two among the three judges who have
closer views regarding the display of flowers.
[N4] Covid-19 data as compiled by the Ministry of Health and Family Welfare, India, the
following data is selected to measure the association between the number of active cases,
number of cured cases and the number of deaths. Using Spearman’s rank correlation
coefficient, name two among the three variables which have closer association.
[N5] The following data corresponds to the scores of a student of MBA at Jain University in
continuous assessment in 2nd Semester. His mentor wishes to know if there is any
association between the marks scored by the student in two subjects. Use Spearman’s rank
correlation analysis to measure the association and interpret the result.
X 78 42 90 24 73 80 91 62 65 42
Y 84 51 92 43 75 54 86 54 54 43
[N6] TVS Motor Company is about to launch their new 100 CC stylish scooter targeted at the
youth. As part of the testing processes, they decide to invite the general public to test drive
the scooters in order to evaluate its mileage. For this experiment, the company selects two
youths as test drivers from two different colleges in Bengaluru and Chennai. Each driver is
supposed to travel a distance on 9 random routes and record observations. The
observations are as follows:
X 41 49 52 35 41 42 30 50 48
Y 51 44 44 47 49 51 28 39 22
Us Spearman’s rank correlation analysis to measure the association between two drivers
and interpret the result.
Session 18: Business Prediction Models
INTRODUCTION
It is a human tendency to know things that might happen to them in the future.
Organizations too are no different. After all they are the result of human intelligence.
Irrespective of the sector, the organizations are literally in the race to predict the future of
their organization, be it in terms of opportunities or challenges. They are finding ways for
prediction.
On the other hand, there are other set of organizations who are trying their maximum to
take advantage of this attitude of prediction. They are mixing up the old strategies with the
IT explosion to arrive at the best possible prediction.
Looking at some of the oldest prediction (from now onwards, let us use more appropriate
corporate term - forecasting) strategies, the most common one would be to use the historic
data and forecast. As the data grew, the need to analyse this data using relevant tools came
up. Statistical tools were the result. Among many such statistical forecasting tools was
regression analysis. Others are simulation technique, exponential smoothing, etc.
Regression analysis is one of the most tried & tested, popular statistical tools for forecasting.
REGRESSION ANALYSIS
It was Sir Francis Galton who first used the term regression as a statistical concept in 1877.
He made a statistical study that showed that the height of children born to tall parents
tends to ‘regress’ towards the mean height of population. Galton used the term regression
as a statistical technique to predict one variable (the height of children) from another
variable (the height of parents). This is called ‘regression’ or ‘simple regression’ confined to
bivariate data.
In many business decisions it is necessary to predict the value of unknown variables.
Regression analysis tells us how one variable is related to another by providing an equation
that allows us to use the known value of one or more variables, to estimate the unknown
value of the remaining variable.
A statistical model is a set of mathematical formulas and assumptions which describe a real
world situation. In this sense, simple linear regression as also multiple regression are
statistical models. A statistical model tries to capture the systematic behaviour of the given
data, leaving out those factors that cannot be foreseen or predicted. These factors are the
errors.
A good statistical model is one which provides as large a systematic component as possible,
minimising errors.
As a first step, we choose a particular model, say a linear regression model, for describing
the relationship between the two variables. As a second step, we work out the estimates of
the model parameters on the basis of random sample data. The third step is to consider the
errors that are called residuals, arising on the fit of the model to the data. When we are
convinced that the residuals contain only pure randomness, we consider our model quite
appropriate for its intended purpose, which invariably happens to make predictions.
The equation for a straight line is Y =a+bXwhere ‘Y’ is the dependent variable, ‘X’ is the
independent variable, ‘a’ is the Y-intercept, which is the point at which the regression line
crosses the Y-axis (the vertical axis) and ‘b’ is the slope of the regression line. It should be
noted that the values of both ‘a’ and ‘b’ will remain constant for any given straight line.
The most commonly used regression lines are straight lines whose equations are,
Y = a 1 + b1 X (Equation 1)
X = a 2 + b2 Y (Equation 2)
In equation 1, X is the I.V and Y is the D.V, a1 is the Y intercept (it is the point at which the
line crosses the Y-axis) and b1 is the regression coefficient or slope
In equation 2, Y is the I.V and X is the D.V, a2 is the X intercept (it is the point at which the
1
line crosses the X-axis) and is the regression coefficient or slope
b2
This provides an estimated regression equation that minimizes the sum of squares of
deviations between the observed values of the D.V (Y) and the estimated values of the
dependent Y^ variable. This is the least square criterion for choosing the equation that
provides the best fit.
Our objective is to find a straight line Y =a+bX to all the given points (we try to find the line
of best fit in such a way that this line will pass through all the points). But this is not possible
in most of the cases. So we try to find a line that will pass through maximum number of
points and all the other points are as close to the line as possible.
The goal is to minimize the sum of the square of the errors of the data points using
Ei = Yi - (a+bX). This minimizes the Mean Square Error
REGRESSION EQUATIONS
Regression equation of y on x is
(Y −Ý ¿=b( y , x)∗( X− X́ )
b ( y , x ) =N ∑ XY −¿ ¿ ¿
Regression equation of x on y
b ( x , y ) =N ∑ XY −¿ ¿ ¿
σx
( X − X́ ¿=r ( )(Y −Ý )
σy
REGRESSION COEFFICIENTS
σx
The factor r ( ) which represents the increment in the value of dependent variable y
σy
corresponding to a unit change in the value of independent variable x is known as
regression coefficient of y on x and is denoted by byx.
σy
byx = r ( )
σx
σx
bxy = r ( )
σy
Introduction
In session 20, we theoretically understood what regression analysis is. Unfortunately, it is
difficult to understand the application of regression analysis, for that matter any statistical
tool, unless we practically apply these tools on the data and see how it works. In this
session, we shall use data sets from multiple sectors to apply regression analysis on them
and understand what the data has to tell us.
[N1] The government of India is announcing plenty of reforms during this pandemic period.
In continuation with this activity, it has assigned the ministry of commerce and industry to
predict the relationship between import and export values of electronic sector in the
country. The ministry has gathered the data between 2013-14 and 2018-19 from DGCIS for
the prediction. Use regression analysis to model the relation between import & export and
vice versa of the electronic data. Also predict the import for the year 2020-21 given that the
export will be USD 11 Billion.
[N3] The editor-in-chief of Bangalore Mirror has been trying to convince the paper’s owner
to improve the working conditions in the press room. He is convinced that the noise level,
when the presses are running, creates unhealthy levels of tension and anxiety. He recently
had a psychologist conduct a test during which pressmen were placed in rooms with varying
levels of noise and then given a test to measure mood and anxiety levels. The following
table shows the index of their degrees of nervousness and the level of noise to which they
were exposed (5 is low and 10 is high).
Noise Level 7.0 6.5 5.5 6.0 8.0 8.5 6.0 6.5
Degree of Nervousness 23 38 45 36 16 18 39 41
Develop estimating equations. Also predict the degrees of nervousness that we might
expect when the noise level is 7.5
Fatal Accidents per 1000 2.6 3.8 0.8 1.2 0.6 1 2.8 1.4 1.8 2
licenses
Percent of licensed drivers 17 18 8 13 6 9 16 12 9 10
under 21 years
Fit linear equations to the above data.
[N6] Hyundai Motor India Ltd has recently held 3-day road-side exhibits on the introduction
of its new model of Creta. The number of sales personnel employed at each of a sample of
10 exhibitions and the number of cars booked at each one are given as follows:
No. of Salesman 5 8 6 8 9 3 5 4 6 6
No. of Cars booked 132 160 148 156 168 102 142 98 152 142
Using these data, regress the number of cars booked on the number of salesmen and obtain
the regression equation. Also estimate the number of cars booked if 10 salesmen are
employed on an exhibition.
[N7] ITI Limited recorded data showing the experience of machine operators and their
performance rating as given by the number of good parts turned out per 100 pieces.
Operator 1 2 3 4 5 6 7 8
Experience (Years) 16 12 18 4 3 10 5 12
Performance Rating 87 88 89 68 78 80 75 82
Obtain the regression equation of performance rating on experience. Use this equation to
estimate the probable performance if an operator has 7 years of experience.
[N8] People in the aerospace industry believe the cost of a space project is a function of the
weight of the major object being sent into space. Use the following data to develop a
regression model to predict the cost of a space project by the weight of the space object.
Introduction
People have learnt it the hard way that most of the outcomes are not the result of one reason but
many. It is like many causes lead to one single effect. If it can be expressed in terms of dependent
and independent variables, a dependent variable may not be dependent on only one independent
variable. It can be dependent on many. For example, sale of a particular (dependent variable)
product not only depends on its price, but also the quality, features, competition, alternate
products, market condition (independent variables), etc. In order model this relationship, we use
multiple linear regression analysis.
In essence, multiple regression is the extension of ordinary least-squares (OLS) regression that
involves more than one explanatory variable.
A simple linear regression is a function that allows an analyst or statistician to make predictions
about one variable based on the information that is known about another variable. Linear regression
can only be used when one has two continuous variables—an independent variable and a
dependent variable. The independent variable is the parameter that is used to calculate the
dependent variable or outcome. A multiple regression model extends to several explanatory
variables.
For example, an analyst may want to know how the movement of the market affects the price of
Exxon Mobil (XOM). In this case, his linear equation will have the value of the S&P 500 index as the
independent variable, or predictor, and the price of XOM as the dependent variable.
In reality, there are multiple factors that predict the outcome of an event. The price movement of
Exxon Mobil, for example, depends on more than just the performance of the overall market. Other
predictors such as the price of oil, interest rates, and the price movement of oil futures can affect
the price of XOM and stock prices of other oil companies. To understand a relationship in which
more than two variables are present, a multiple linear regression is used.
Multiple linear regression (MLR) is used to determine a mathematical relationship among a number
of random variables. In other terms, MLR examines how multiple independent variables are related
to one dependent variable. Once each of the independent factors has been determined to predict
the dependent variable, the information on the multiple variables can be used to create an accurate
prediction on the level of effect they have on the outcome variable. The model creates a relationship
in the form of a straight line (linear) that best approximates all the individual data points.
CURVE FITTING
Curve fitting is the process of constructing a curve, or mathematical function that has the best fit to a
series of data points, possibly subject to constraints. Fitted curves can be used as an aid for data
visualization, to infer values of a function where no data are available, and to summarize the
relationships among two or more variables.
CA3: CORRELATION AND REGRESSION THROUGH FITNESS AND GAMES
Assessment 3:
Syllabus coverage: Module 4
Timeline: Session 22
Assessment marks: 10
Objective: This activity is aimed at evaluating the ability of students to identify and establish
association between variables. This is conducted based on the activity 4 – Correlation and Regression
through fitness and games.
Debriefing duration: 10 minutes
Activity Space: Class rooms, corridors and empty spaces meant for students’ activities
Submission style: A report consisting of customized covering sheet, data set, solution,
interpretation, color photograph of the group holding the project, conclusion.
Evaluation: Respective professors will evaluate based on the rubrics mentioned below.
ASSESSMENT RUBRICS
Completion of Activity 2 marks
Data compilation and preparation a. marks
Data analysis (calculation) 2 marks
Drawing inference (interpretation) 2 marks
Viva-voce 2 marks
TOTAL 10 Marks