Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Business Analytics II

Prof. Shweta Sharma


Scatter Plot
Ice Cream Sales vs Temperature

Let us plot Temperature °C

14.2°
Ice Cream Sales

$215

16.4° $325

11.9° $185
The local ice cream shop keeps
track of how much ice cream 15.2° $332
they sell versus the noon 18.5° $406
temperature on that day. Here 22.1° $522
are their figures for the last 12
days: 19.4° $412

25.1° $614

23.4° $544

18.1° $421

22.6° $445

17.2° $408
Interpolation

• is where we find a
value inside our set of
data points.
Extrapolation

• is where we find a
value outside our set of
data points.
Correlation
• Correlation is a statistical measure that expresses the extent to
which two variables are linearly related
• meaning they change together at a constant rate
• Correlation is Positive when the values increase together, and
• Correlation is Negative when one value decreases as the other
increases
Correlation
Pearson Correlation
• 2 continuous variables
• Linear relationship
• Association between height and weight
• Measures the degree of linear
association between two interval scaled
variables.
• Analysis of the relationship between
two quantitative outcomes like height
and weight
Examples
1. The consumption of ice-cream increases during the summer months. There is a strong
correlation between the sales of ice-cream units. In this particular example, we see there
is a causal relationship also as the extreme summers do push the sale of ice-creams up.
2. Ice-creams sales also have a strong correlation with shark attacks. Now as we can see
very clearly here, the shark attacks are most definitely not caused due to ice-creams. So,
there is no causation here.
The table below
demonstrates
how to interpret
the size (strength)
of a correlation
coefficient
SR NO AGE (X) WEIGHT (Y)

1 40 78
• 6 people having a different age and
different weights given below for 2 21 70

the calculation of the value of the 3 25 60


Pearson R
4 31 55

5 38 80

6 47 66
Spearman's rank correlation
• It is a nonparametric measure of rank correlation (statistical dependence between
the rankings of two variables). It assesses how well the relationship between two
variables can be described using a monotonic function.
• Pooja participating in a beauty pageant
• Overall, there were 7 participants in the
beauty pageant JUDGE 1 JUDGE 2 RANK 1 RANK 2
(SCORES) (SCORES) (BASED ON (BASED ON
• Two judges were judging the SCORES OF
JUDGE 1)
SCORES OF
JUDGE 2
participants and each one of them were
ranked by respective judges based on 85 87 7 6
the scores
95 84 2 7
• Scores and ranks given to beauty
contestants by respective judges 88 88 5 5

• Pooja wanted to see that how the ranks 86 95 6 1

given by respective judges are related 92 89 4 4

• Pooja used the concept of correlation 97 91 1 2


given by spearman called as spearman’s 93 90 3 3
correlation
RANK 2
RANK 1 (BASED ON
JUDGE 1 JUDGE 2 (BASED ON SCORES SCORES OF JUDGE D=R1- 𝐷
(SCORES) (SCORES) OF JUDGE 1) 2 R2

85 87 7 6 1 1
• Table to calculate spearman’s
rank correlation coefficient 95 84 2 7 5 25

88 88 5 5 0 0

86 95 6 1 5 25
6 ∗ 52
𝜌=1− = .07143 92 89 4 4 0 0
7 7 −1
97 91 1 2 1 1

93 90 3 3 0 0

52
Linear Regression
Define
• Modeling and establishing the relationship between one
dependent variable and one independent variable is known as
Simple Linear Regression.
Find linear regression
equation for the
following two sets of
data:

x 2 4 6 8

y 3 7 5 10
Construct the
following table
Calculation
Calculate predicted y
• Substitute in the equation
• State bank of India recently
established a new policy of
linking savings account interest
rate to Repo rate, and the auditor
of the state bank of India wants
to conduct an independent
analysis on the decisions taken by
the bank regarding interest rate
changes whether those have
been changes whenever there
have been changes in the Repo
rate. Following is the summary of
the Repo rate and Bank’s savings
account interest rate that
prevailed in those months are
given below.
• The auditor of state bank has
approached you to conduct an
analysis and provide a
presentation on the same in the
next meeting. Use regression
formula and determine whether
Bank’s rate changed as and when
the Repo rate was changed?
Excel Question

• We are going to do a simple


linear regression in Excel. What
we have is a list of average
monthly rainfall for the last 24
months in column B, which is
our independent variable
(predictor), and the number of
umbrellas sold in column C,
which is the dependent
variable.
Interpret regression analysis output
• Multiple R:
• It is the Correlation Coefficient that measures the strength of
a linear relationship between two variables. The correlation
coefficient can be any value between -1 and 1, and
its absolute value indicates the relationship strength. The
larger the absolute value, the stronger the relationship:
• 1 means a strong positive relationship
• -1 means a strong negative relationship
• 0 means no relationship at all
• R Square:
• It is the Coefficient of Determination, which is used as an indicator
of the goodness of fit. It shows how many points fall on the
regression line. The R2 value is calculated from the total sum of
squares, more precisely, it is the sum of the squared deviations of
the original data from the mean.
• For example, R2 is 0.91 (rounded to 2 digits), which is fairy good. It
means that 91% of our values fit the regression analysis model. In
other words, 91% of the dependent variables (y-values) are
explained by the independent variables (x-values). Generally, R
Squared of 95% or more is considered a good fit.
• Adjusted R Square. It is the R square adjusted for the number of
independent variable in the model. You will want to use this value
instead of R square for multiple regression analysis.
• Standard Error. It is another goodness-of-fit measure that shows
the precision of your regression analysis - the smaller the
number, the more certain you can be about your regression
equation. While R2 represents the percentage of the dependent
variables variance that is explained by the model, Standard Error
is an absolute measure that shows the average distance that the
data points fall from the regression line.
• Observations. It is simply the number of observations in your
model.
The second part of the output is Analysis of
Variance (ANOVA):
• The Significance F value gives an idea of how reliable
(statistically significant) your results are.
• If Significance F is less than 0.05 (5%), your model is OK.
• If it is greater than 0.05, you'd probably better choose
another independent variable.
Regression
analysis output:
coefficients Y = Rainfall Coefficient * x + Intercept
Equipped with a and b values rounded to three decimal places, it turns into:
Y=0.45*x-19.074
For example, with the average monthly rainfall equal to 82 mm, the umbrella
sales would be approximately 17.8:
0.45*82-19.074=17.8
Regression
analysis output:
residuals

• Estimated: 17.8 (calculated


above)
• Actual: 15 (row 2 of the
source data)
• Why's the difference?
Because independent
variables are never perfect
predictors of the
dependent variables. And
the residuals can help you
understand how far away
the actual values are from
the predicted values:

You might also like