Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 50

Demystifying Data-Driven

Strategies and Policies with Excel


(GFQR 1046)
Predicting Pregnancies of Our Customers I:
Regression Model
The Topic this Week: Regression
Model

• The Target Story Revisited


• Simple Linear Regression Model
• Multiple Linear Regression Model
Revisiting The Target Story
• This high school girl in the US got pregnant, but her father did not know this from the beginning
• One day he browsed the girl’s mobile phone and found Target was sending her coupons for baby goods
• The father got furious and rushed to a Target store in Minnesota to argue with its manager, “My daughter
got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs?
Are you trying to encourage her to get pregnant?”
• The store subsequently checked with the HQ and found that the girl received the baby goods coupons as
the data analytics team predicted she had a high probability of being pregnant
• The store manager called the father to apologize, but the father apologized instead, “I had a talk with my
daughter. It turns out there’s been some activities in my house I haven’t been completely aware of ……
She’s due in August. I owe you an apology.”
Questions to Think About

• How to predict which of our


customers are pregnant?

• What are the possible problems


when we send baby-goods
coupons to those we believe
pregnant? How to address those
problems?

Photos Source: https://www.yohttps://www.youtube.com/watch?v=f2Kji24833


Y
Simple Linear
Regression Model
Regression Model
• Regression represents a set of statistical models which
explores the possible dependence of one variable (response
variable denoted by Y) on one or more other variables
(predictor variables denoted by X)
What Can We Do with Regression Model?
• Establishing and quantifying the relation between the
response variable and a predictor variable

• Making prediction on the response variable


Types of Regression Model
• Simple linear regression model
o Only 1 predictor variable

• Multiple linear regression model


o More than 1 predictor variable
Simple Linear Regression
Model Prediction
on Y Prediction Error

Y = 𝛼 + βX + μ
Response Coefficients Predictor
Variable Variable
Basic Concepts
• Y (the response variable) is the variable we are interested in
predicting its value

• We predict Y by linking it with another variable X (the


predictor variable) which we believe will affect the value of Y
based on some theoretical considerations
Basic Concepts
• If we can collect data for both X and Y, we can apply the
regression methodology to estimating 𝛼 & β (i.e. the
coefficients of the regression model)

• On top of estimation, the regression methodology also


includes tools for evaluating if our estimates for 𝛼 & β or the
whole model are valid or not
Basic Concepts
• If the estimates for 𝛼 & β and the whole model are found to
be valid, we are ready to make prediction on Y
What Does “Linear” Mean?
• Both 𝛼 & β are in power 1

• If we plot the model in a graph, it is simply a straight line

• Estimating a simple linear regression model is therefore


fitting a straight line to the data in a proper way
Simple Linear Regression - An Example
• A firm selling consumer goods plans to commence sales in a
new city

• It wants to have a good prediction about the sales it can


achieve in this new market
Selling in 100 Cities Now Y = 𝛼 + βX + μ
1
2
3
:
:
10
:
:
:
:
:
:
:
:
:
:
100
101: Plans to sell in a new city
- Product sales in this new market?
Key Steps of Simple Linear Regression
• Identifying the relevant X
• Collecting and exploring the data for X and Y
• Estimating 𝛼 & β
• Testing the validity of the model
• Using the model for making prediction
Identifying the Relevant X
• The firm believes the sales of its products is driven by its
advertising spending

• The more it spends on advertising, the more sales it would


expect for its products (i.e. positive relation)

• Y = Product sales; X = Advertising spending


Collecting Data
for X & Y
• The firm draws a SAMPLE of
10 cities to which it is selling
now and collects the data of
product sales and advertising
last year

• This data set is called training


data, which is composed of
10 observations and 2
variables
Exploring the Data - Advertising Spending

• Advertising spending is
measured in million $

• It ranges from $11.7 mil


(City H) to $55.7 mil (City
E)

• Average = $41.6 mil


Exploring the Data - Product Sales

• Product sales is measured


in million units

• It ranges from 3.8 mil


units (City B) to 7.6 mil
units (City E)

• Average = 5.6 mil units


Exploring the Data - Scatter Plot

• Positive relation between


advertising spending &
products sales

• Both are related linearly


(i.e. a straight line is a
good description of their
relation)
Model Specification
• Product Sales = 𝛼 + β(Advertising Spending) + μ

• Both 𝛼 & β are expected to be positive (Why?)


Estimating Regression Model • Fit any straight line to the data,
- The Least Square Method which represents one of the
many possible models

• Find the vertical distances


between all the points & the
straight line, which are the
errors

• Find the sum of squared errors


(i.e. Square all the errors and
Error
sum them up)

• Least square method dictates


that the best line fitted to the
data should minimize the sum
of squared errors
Estimating Regression Model in Excel
1. Go to “Data”

2. Press “Data Analysis”


button and choose
“Regression”
Estimating Regression Model in Excel

3. Data for product sales as Y range

4. Data for advertising spending as X range

5. Output to be displayed in a new


worksheet and press “OK”
Estimation Results
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.82297009
R Square
Adjusted R Square
0.677279769
0.63693974
R Square
Standard Error 0.716440178
Observations 10
Estimated 𝛼 and β
ANOVA
df SS MS F Significance F
Regression 1 8.617707776 8.617708 16.78927 0.00344979
Residual 8 4.106292224 0.513287
Total 9 12.724
t Statistics
Coefficients Standard Errors t Stat P-value Lower 95% Upper 95% Lower 95.0%Upper 95.0%
Intercept 2.866328588 0.713829783 4.015423 0.003866 1.22023416 4.51242302 1.22023416 4.51242302
X1 0.066642754 0.01626436 4.097472 0.00345 0.02913707 0.10414844 0.02913707 0.10414844
Estimation Results
• Estimated 𝛼 = 2.8663; Estimated β = 0.0666

• Predicted Product Sales = 2.8663 + 0.0666(Advertising Spending)


Interpreting Estimated 𝛼 = 2.8663
• 𝛼 represents the value of Y when X = 0

• In the context of our example, it means even the firm does


not spend any money on advertising (i.e. X = 0), it can still
achieve 2.8663 million units of sales

• Its positive sign is in line with our initial expectation


Interpreting the Estimated β = 0.0666
• β represents how much Y would change when X increases by
1 unit

• In the context of our example, it means when the firm


spends $1 million more on advertising, the product sales
will be increased by 0.0666 million units, or 66,600 units

• Its positive sign is in line with our initial expectation


Things to Check Before Using the Model to
Make Prediction
• Statistical significance tests of the estimated 𝛼 & β
• R Square
• Others
Statistical Significance Test for Estimated β
• The estimated β = 0.0666 is based on a sample of 10 cities

• Using another sample or even the data of all cities for


estimation, we may generate other results (i.e. estimation
based on sample data is subject to sampling error)
Statistical Significance Test for Estimated β
• Particularly, while estimated β = 0.0666, it is still possible
that the true β = 0 (i.e. β value of all cities)

• Statistical significance test for β: To test whether we can


reject the hypothesis “true β = 0” (i.e. null hypothesis)
Statistical Significance Test for Estimated β
• If null hypothesis is true, there is no relation between
advertising spending and product sales

• Condition for rejecting null hypothesis:


Absolute Value of t-Statistic > Critical t Value
Statistical Significance Test for
Estimated β

• t-Statistic for β = 4.0975 (from Excel worksheet)


• Absolute value of t-Statistic =│4.0975│= 4.0975
• Critical t value =T.INV.2T(0.05,8) = 2.306
o 0.05: Significance level
o 8: Degrees of freedom = No. of observations (10) -
No. of coefficients estimated (2)

• As 4.0975 > 2.306, the null hypothesis that “true β =


0” can be rejected
R - Explanatory Power of the Model
2

• R2: Proportion of Y’s variation that can be explained by the


regression model

• The higher the R2, the greater the proportion of Y’s variation
is explained by the model, the better its explanatory power
R2 - Explanatory Power of the Model
• R2 for our example is 0.6773 (i.e. 67.73% of the variation of
the product sales can be explained by our model)

• We may improve our model further by including more


predictor variables
Multiple Linear Regression
Product Sales
Model Again
• New predictor variable:
GDP of a city

• Again the firm wants to


predict its product sales
in a new market.
Advertising budget for
this market is $30m and
It’s GDP is expected to
grow at $2.8 trillion
Selling in 100 Cities Now Y = 𝛼 + βX + μ
1
2
3
:
:
10
:
:
:
:
:
:
:
:
:
:
100
101: Plans to sell in a new city
- Product sales in this new market?
Specifying the Model
• Product Sales = 𝛼 + β1(Advertising Spending) + β2(GDP) + μ

• All coefficients are expected to be positive


Estimation Results
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.955813532
R Square 0.913579508
Adjusted R Square 0.888887939
Standard Error 0.396342986
Observations 10

ANOVA
df SS MS F Significance F
Regression 2 11.62438566 5.812193 36.99965 0.00018974
Residual 7 1.099614338 0.157088
Total 9 12.724

Coefficients Standard Errors t Stat P-value Lower 95% Upper 95% Lower 95.0%Upper 95.0%
Intercept 0.761431039 0.62243651 1.223307 0.260792 -0.710397428 2.23325951 -0.7103974 2.23325951
X1 0.049889688 0.009778604 5.101923 0.001397 0.026766963 0.07301241 0.02676696 0.07301241
X2 1.094593808 0.250196202 4.374942 0.003254 0.502973802 1.68621381 0.5029738 1.68621381
Estimation Results
• Estimated 𝛼 = 0.7614, β1 = 0.0499 & β2 = 1.0946

• Predicted Product Sales = 0.7614 + 0.0499(Advertising


Spending) + 1.0946(GDP)
Interpreting Estimated 𝛼 = 0.7614
• In multiple linear regression, 𝛼 represents the value of Y
when ALL X=0

• It means even the firm does not spend any money on


advertising and GDP = $0, it can still achieve 0.7614 million
units of sales

• It has a positive sign, consistent with initial expectation


Interpreting Estimated β1 = 0.0499
• With more than 1 predictor variable, the interpretation of
each β is slightly different

• βi represents how much Y would change when Xi increases by


1 unit, while holding other predictor variables constant
Interpreting Estimated β1 = 0.0499
• In our context, β1 = 0.0499 means holding GDP constant, if
the firm increases $1 million spending on advertising, the
product sales will be increased by 0.0499 million units, or
49,900 units

• It has a positive sign, consistent with our expectation that


advertising spending drives product sales positively
Interpreting Estimated β2 = 1.0946
• Similarly, β2 = 1.0946 means holding advertising spending
constant, if GDP increases by $1t, then product sales will be
increased by 1.0946 million units

• It also has a positive sign, consistent with our expectation


that GDP growth should drive product sales positively
Statistical Significance Test for Estimated
β1

• Null hypothesis: True β1 = 0


• t-Statistic for estimated β1 = 5.1019
• Absolute value of t-Statistic = 5.1019
• Critical t value (5% significance level & 7 degrees of
freedom) = 2.365

• As 5.1019 > 2.365, the null hypothesis that “true β1 =


0” can be rejected
Statistical Significance Test
for Estimated β2
• Null hypothesis: True β2 = 0
• t-Statistic for estimated β2 = 4.3749
• Absolute value of t-Statistic = 4.3749
• Critical t value (5% significance level & 7
degrees of freedom) = 2.365

• As 4.3749 > 2.365, the null hypothesis that


“true β2 = 0” can be rejected
R2
• By adding GDP growth to the model, R2 is raised to 0.9136

• It means the explanatory power of our model has improved


as it can now explain 91.36% of the variation of product sales
Making Predicted Product Sales of the New Market
Prediction = 0.7614 + 0.0499(Advertising Spending) + 1.0946(GDP)
with the = 0.7614+ 0.0499(30) + 1.0946(2.8)

Model = 0.7614 + 1.497 + 3.0649


= 5.3233 million units

Assumptions for new market,


• Advertising budget: $30m
• 2.8t GDP

You might also like