Predicting Pregnancies of Our Customers I - Regression Model

Demystifying Data-Driven
Strategies and Policies with Excel

(GFQR 1046)
Predicting Pregnancies of Our Customers I:
Regression Model
The Topic this Week: Regression
Model
• The Target Story Revisited

• Simple Linear Regression Model
• Multiple Linear Regression Model
Revisiting The Target Story
• This high school girl in the US got pregnant, but her father did not know this from the beginning
• One day he browsed the girl’s mobile phone and found Target was sending her coupons for baby goods
• The father got furious and rushed to a Target store in Minnesota to argue with its manager, “My daughter
got this in the mail! She’s still in high school, and you’re sending her coupons for baby clothes and cribs?
Are you trying to encourage her to get pregnant?”
• The store subsequently checked with the HQ and found that the girl received the baby goods coupons as
the data analytics team predicted she had a high probability of being pregnant
• The store manager called the father to apologize, but the father apologized instead, “I had a talk with my
daughter. It turns out there’s been some activities in my house I haven’t been completely aware of ……
She’s due in August. I owe you an apology.”
Questions to Think About
• How to predict which of our

customers are pregnant?
• What are the possible problems

when we send baby-goods
coupons to those we believe
pregnant? How to address those
problems?
Photos Source: https://www.yohttps://www.youtube.com/watch?v=f2Kji24833

Y
Simple Linear
Regression Model
Regression Model
• Regression represents a set of statistical models which
explores the possible dependence of one variable (response
variable denoted by Y) on one or more other variables
(predictor variables denoted by X)
What Can We Do with Regression Model?
• Establishing and quantifying the relation between the
response variable and a predictor variable
• Making prediction on the response variable

Types of Regression Model
• Simple linear regression model
o Only 1 predictor variable
• Multiple linear regression model

o More than 1 predictor variable
Simple Linear Regression
Model Prediction
on Y Prediction Error
Y = 𝛼 + βX + μ
Response Coefficients Predictor
Variable Variable
Basic Concepts
• Y (the response variable) is the variable we are interested in
predicting its value
• We predict Y by linking it with another variable X (the

predictor variable) which we believe will affect the value of Y
based on some theoretical considerations
Basic Concepts
• If we can collect data for both X and Y, we can apply the
regression methodology to estimating 𝛼 & β (i.e. the
coefficients of the regression model)
• On top of estimation, the regression methodology also

includes tools for evaluating if our estimates for 𝛼 & β or the
whole model are valid or not
Basic Concepts
• If the estimates for 𝛼 & β and the whole model are found to
be valid, we are ready to make prediction on Y
What Does “Linear” Mean?
• Both 𝛼 & β are in power 1
• If we plot the model in a graph, it is simply a straight line
• Estimating a simple linear regression model is therefore

fitting a straight line to the data in a proper way
Simple Linear Regression - An Example
• A firm selling consumer goods plans to commence sales in a
new city
• It wants to have a good prediction about the sales it can

achieve in this new market
Selling in 100 Cities Now Y = 𝛼 + βX + μ
1
2
3
:
:
10
:
:
:
:
:
:
:
:
:
:
100
101: Plans to sell in a new city
- Product sales in this new market?
Key Steps of Simple Linear Regression
• Identifying the relevant X
• Collecting and exploring the data for X and Y
• Estimating 𝛼 & β
• Testing the validity of the model
• Using the model for making prediction
Identifying the Relevant X
• The firm believes the sales of its products is driven by its
advertising spending
• The more it spends on advertising, the more sales it would

expect for its products (i.e. positive relation)
• Y = Product sales; X = Advertising spending

Collecting Data
for X & Y
• The firm draws a SAMPLE of
10 cities to which it is selling
now and collects the data of
product sales and advertising
last year
• This data set is called training

data, which is composed of
10 observations and 2
variables
Exploring the Data - Advertising Spending
• Advertising spending is
measured in million $
• It ranges from $11.7 mil

(City H) to $55.7 mil (City
E)
• Average = $41.6 mil

Exploring the Data - Product Sales
• Product sales is measured

in million units
• It ranges from 3.8 mil

units (City B) to 7.6 mil
units (City E)
• Average = 5.6 mil units

Exploring the Data - Scatter Plot
• Positive relation between

advertising spending &
products sales
• Both are related linearly

(i.e. a straight line is a
good description of their
relation)
Model Specification
• Product Sales = 𝛼 + β(Advertising Spending) + μ
• Both 𝛼 & β are expected to be positive (Why?)

Estimating Regression Model • Fit any straight line to the data,
- The Least Square Method which represents one of the
many possible models
• Find the vertical distances

between all the points & the
straight line, which are the
errors
• Find the sum of squared errors

(i.e. Square all the errors and
Error
sum them up)
• Least square method dictates

that the best line fitted to the
data should minimize the sum
of squared errors
Estimating Regression Model in Excel
1. Go to “Data”
2. Press “Data Analysis”

button and choose
“Regression”
Estimating Regression Model in Excel
3. Data for product sales as Y range
4. Data for advertising spending as X range
5. Output to be displayed in a new

worksheet and press “OK”
Estimation Results
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.82297009
R Square
Adjusted R Square
0.677279769
0.63693974
R Square
Standard Error 0.716440178
Observations 10
Estimated 𝛼 and β
ANOVA
df SS MS F Significance F
Regression 1 8.617707776 8.617708 16.78927 0.00344979
Residual 8 4.106292224 0.513287
Total 9 12.724
t Statistics
Coefficients Standard Errors t Stat P-value Lower 95% Upper 95% Lower 95.0%Upper 95.0%
Intercept 2.866328588 0.713829783 4.015423 0.003866 1.22023416 4.51242302 1.22023416 4.51242302
X1 0.066642754 0.01626436 4.097472 0.00345 0.02913707 0.10414844 0.02913707 0.10414844
Estimation Results
• Estimated 𝛼 = 2.8663; Estimated β = 0.0666
• Predicted Product Sales = 2.8663 + 0.0666(Advertising Spending)

Interpreting Estimated 𝛼 = 2.8663
• 𝛼 represents the value of Y when X = 0
• In the context of our example, it means even the firm does

not spend any money on advertising (i.e. X = 0), it can still
achieve 2.8663 million units of sales
• Its positive sign is in line with our initial expectation

Interpreting the Estimated β = 0.0666
• β represents how much Y would change when X increases by
1 unit
• In the context of our example, it means when the firm

spends $1 million more on advertising, the product sales
will be increased by 0.0666 million units, or 66,600 units
• Its positive sign is in line with our initial expectation

Things to Check Before Using the Model to
Make Prediction
• Statistical significance tests of the estimated 𝛼 & β
• R Square
• Others
Statistical Significance Test for Estimated β
• The estimated β = 0.0666 is based on a sample of 10 cities
• Using another sample or even the data of all cities for

estimation, we may generate other results (i.e. estimation
based on sample data is subject to sampling error)
• Particularly, while estimated β = 0.0666, it is still possible
that the true β = 0 (i.e. β value of all cities)
• Statistical significance test for β: To test whether we can

reject the hypothesis “true β = 0” (i.e. null hypothesis)
• If null hypothesis is true, there is no relation between
advertising spending and product sales
• Condition for rejecting null hypothesis:

Absolute Value of t-Statistic > Critical t Value
Statistical Significance Test for
Estimated β
• t-Statistic for β = 4.0975 (from Excel worksheet)

• Absolute value of t-Statistic =│4.0975│= 4.0975
• Critical t value =T.INV.2T(0.05,8) = 2.306
o 0.05: Significance level
o 8: Degrees of freedom = No. of observations (10) -
No. of coefficients estimated (2)
• As 4.0975 > 2.306, the null hypothesis that “true β =

0” can be rejected
R - Explanatory Power of the Model
2
• R2: Proportion of Y’s variation that can be explained by the

regression model
• The higher the R2, the greater the proportion of Y’s variation
is explained by the model, the better its explanatory power
R2 - Explanatory Power of the Model
• R2 for our example is 0.6773 (i.e. 67.73% of the variation of
the product sales can be explained by our model)
• We may improve our model further by including more

predictor variables
Multiple Linear Regression
Product Sales
Model Again
• New predictor variable:
GDP of a city
• Again the firm wants to

predict its product sales
in a new market.
Advertising budget for
this market is $30m and
It’s GDP is expected to
grow at $2.8 trillion
Selling in 100 Cities Now Y = 𝛼 + βX + μ
1
2
3
:
:
10
:
:
:
:
:
:
:
:
:
:
100
101: Plans to sell in a new city
- Product sales in this new market?
Specifying the Model
• Product Sales = 𝛼 + β1(Advertising Spending) + β2(GDP) + μ
• All coefficients are expected to be positive

Estimation Results
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.955813532
R Square 0.913579508
Adjusted R Square 0.888887939
Standard Error 0.396342986
Observations 10
ANOVA
df SS MS F Significance F
Regression 2 11.62438566 5.812193 36.99965 0.00018974
Residual 7 1.099614338 0.157088
Total 9 12.724
Coefficients Standard Errors t Stat P-value Lower 95% Upper 95% Lower 95.0%Upper 95.0%
Intercept 0.761431039 0.62243651 1.223307 0.260792 -0.710397428 2.23325951 -0.7103974 2.23325951
X1 0.049889688 0.009778604 5.101923 0.001397 0.026766963 0.07301241 0.02676696 0.07301241
X2 1.094593808 0.250196202 4.374942 0.003254 0.502973802 1.68621381 0.5029738 1.68621381
Estimation Results
• Estimated 𝛼 = 0.7614, β1 = 0.0499 & β2 = 1.0946
• Predicted Product Sales = 0.7614 + 0.0499(Advertising

Spending) + 1.0946(GDP)
Interpreting Estimated 𝛼 = 0.7614
• In multiple linear regression, 𝛼 represents the value of Y
when ALL X=0
• It means even the firm does not spend any money on

advertising and GDP = $0, it can still achieve 0.7614 million
units of sales
• It has a positive sign, consistent with initial expectation

Interpreting Estimated β1 = 0.0499
• With more than 1 predictor variable, the interpretation of
each β is slightly different
• βi represents how much Y would change when Xi increases by

1 unit, while holding other predictor variables constant
• In our context, β1 = 0.0499 means holding GDP constant, if
the firm increases $1 million spending on advertising, the
product sales will be increased by 0.0499 million units, or
49,900 units
• It has a positive sign, consistent with our expectation that

advertising spending drives product sales positively
• Similarly, β2 = 1.0946 means holding advertising spending
constant, if GDP increases by $1t, then product sales will be
increased by 1.0946 million units
• It also has a positive sign, consistent with our expectation

that GDP growth should drive product sales positively
Statistical Significance Test for Estimated
β1
• Null hypothesis: True β1 = 0

• t-Statistic for estimated β1 = 5.1019
• Absolute value of t-Statistic = 5.1019
• Critical t value (5% significance level & 7 degrees of
freedom) = 2.365
• As 5.1019 > 2.365, the null hypothesis that “true β1 =

0” can be rejected
Statistical Significance Test
for Estimated β2
• Null hypothesis: True β2 = 0
• t-Statistic for estimated β2 = 4.3749
• Absolute value of t-Statistic = 4.3749
• Critical t value (5% significance level & 7
degrees of freedom) = 2.365
• As 4.3749 > 2.365, the null hypothesis that

“true β2 = 0” can be rejected
R2
• By adding GDP growth to the model, R2 is raised to 0.9136
• It means the explanatory power of our model has improved

as it can now explain 91.36% of the variation of product sales
Making Predicted Product Sales of the New Market
Prediction = 0.7614 + 0.0499(Advertising Spending) + 1.0946(GDP)
with the = 0.7614+ 0.0499(30) + 1.0946(2.8)
Model = 0.7614 + 1.497 + 3.0649

= 5.3233 million units
Assumptions for new market,

• Advertising budget: $30m
• 2.8t GDP

Predicting Pregnancies of Our Customers I - Regression Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Pregnancies of Our Customers I - Regression Model

Uploaded by

Copyright:

Available Formats

Demystifying Data-Driven

Strategies and Policies with Excel

• The Target Story Revisited

• How to predict which of our

• What are the possible problems

Photos Source: https://www.yohttps://www.youtube.com/watch?v=f2Kji24833

• Making prediction on the response variable

• Multiple linear regression model

• We predict Y by linking it with another variable X (the

• On top of estimation, the regression methodology also

• If we plot the model in a graph, it is simply a straight line

• Estimating a simple linear regression model is therefore

• It wants to have a good prediction about the sales it can

• The more it spends on advertising, the more sales it would

• Y = Product sales; X = Advertising spending

• This data set is called training

• It ranges from $11.7 mil

• Average = $41.6 mil

• Product sales is measured

• It ranges from 3.8 mil

• Average = 5.6 mil units

• Positive relation between

• Both are related linearly

• Both 𝛼 & β are expected to be positive (Why?)

• Find the vertical distances

• Find the sum of squared errors

• Least square method dictates

2. Press “Data Analysis”

3. Data for product sales as Y range

4. Data for advertising spending as X range

5. Output to be displayed in a new

• Predicted Product Sales = 2.8663 + 0.0666(Advertising Spending)

• In the context of our example, it means even the firm does

• Its positive sign is in line with our initial expectation

• In the context of our example, it means when the firm

• Its positive sign is in line with our initial expectation

• Using another sample or even the data of all cities for

• Statistical significance test for β: To test whether we can

• Condition for rejecting null hypothesis:

• t-Statistic for β = 4.0975 (from Excel worksheet)

• As 4.0975 > 2.306, the null hypothesis that “true β =

• R2: Proportion of Y’s variation that can be explained by the

• We may improve our model further by including more

• Again the firm wants to

• All coefficients are expected to be positive

• Predicted Product Sales = 0.7614 + 0.0499(Advertising

• It means even the firm does not spend any money on

• It has a positive sign, consistent with initial expectation

• βi represents how much Y would change when Xi increases by

• It has a positive sign, consistent with our expectation that

• It also has a positive sign, consistent with our expectation

• Null hypothesis: True β1 = 0

• As 5.1019 > 2.365, the null hypothesis that “true β1 =

• As 4.3749 > 2.365, the null hypothesis that

• It means the explanatory power of our model has improved

Model = 0.7614 + 1.497 + 3.0649

Assumptions for new market,

You might also like