Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Quantitative Methods

for Business
COMM5005
Lecture 8
Statistics Flow Chart
Statistics
Lecture 5

Probability
Descriptive Inferential

Distributions of Sampling distributions Lecture 6


Moments
Variables

First Moment: Mean Confidence Interval


Bi-variate Distribution Estimation
Lecture 7
Second Moment: Variance Binomial Distribution Lecture 6 Hypothesis Testing

Third Moment: Skewness Uniform Distribution Simple linear regression

Lecture 8

Normal Distribution Multiple linear regression


Lecture 8 topics 4-2

In this lecture we will cover:


• Simple linear regression
• Introduction to multiple regression
Objectives

Ø Conduct a simple regression and interpret the meaning of the


regression coefficients
Ø Use regression analysis to make prediction and inferences
Ø Evaluate the assumptions of regression analysis
Ø Construct a multiple regression model and analyse model output
Ø Incorporate categorical into a regression model
Readings
4-4

Sections of Berenson, M. et al. 5th ed. will help you to


understand this week’s topics more clearly.

Chapter Name Pages


12.1 – 12.5 Simple linear regression 455-476
13.1-13.4! 13.6 Introduction to multiple regression 504-519!
525-534
1. Simple linear regression
Motivating example:
Does the number of COVID19 cases in Sydney affect consumer
demand in Sydney? If so, how much is the impact? What is the
impact of an increase of 100 cases in a week?

What you need to answer the questions?


• Data on weekly consumer expenditure and the number of
COVID19 cases for the last 6 months in Sydney.
• A regression model to conduct regression analysis.
Regression analysis
• Regression analysis is used to:
• predict the value of a dependent variable (Y) based on the
value of at least one independent variable (X)
• explain the impact of changes in an independent variable on
the dependent variable
• Dependent variable (Y): the variable we wish to predict or
explain (response variable)
• Independent variable (X): the variable used to explain the
dependent variable (explanatory variable)
Introduction to regression analysis

Simple linear regression:


• Only one independent variable (X)
• Relationship between X and Y is described by a linear
function
• Changes in Y are assumed to be caused by changes in X
Types of linear relationships

Examples of types of relationships found in scatter diagrams


Simple linear regression model
Simple linear regression model
Prediction line

• The simple linear regression equation provides an estimate of the population


regression line.

• The individual random error terms ei have a mean of zero.


Estimation: Least squares method
• b0 and b1 are obtained by finding the values of b0 and b1 that minimise the sum of
the squared differences between actual values (Y) and predicted values ( Ŷ)

min å (Yi -Ŷi )2 = min å (Yi - (b0 + b1Xi ))2

• b0 is the estimated average value of Y when the value of X is zero


• b1 is the estimated change in the average value of Y as a result of a one-unit
change in X
Simple linear regression example (1 of 6)

• A manager of a local computer games store wishes to:


• examine the relationship between weekly sales (Y) and the number of
customers making purchases (X) over a 10 week period; and
• use the results of that examination to predict future weekly sales
Simple linear regression example (2 of 6)

Weekly sales in Number of


$1,000s Customers
(Y) (X)

245 1400
500
312 1600
400

Weekly sales
279 1700
300
308 1875
200
199 1100 100
219 1550 0
405 2350 0 500 1000 1500 2000 2500 3000

324 2450 Number of customers

319 1425
255 1700
Simple linear regression example (3 of 6)
Simple linear regression example (4 of 6)
Simple linear regression example (5 of 6)

Weekly sales = 98.24833 + 0.10977 (customers)


b0 is the estimated average value of Y when the value of X is zero (if X =
0 is in the range of observed X values)
• Here, for no customers, b0 = 98.2483 which appears nonsensical.
However, the intercept simply indicates that over the sample size selected,
the portion of weekly sales not explained by number of customers is
$98,248.33. Also note that X=0 is outside the range of observed values
b1 measures the estimated change in the average value of Y as a result
of a one-unit change in X
• Here, b1 = .10977 tells us that the average value of weekly sales increases
by .10977($1,000) = $109.77, on average, for each additional customer
Simple linear regression example (6 of 6)

• Predict the weekly sales for the local store for 2,000 customers

Weekly sales = 98.25 + 0.1098 (2000)


= 98.25 + 0.1098(2000)
= 317.85

• The predicted weekly sales for the local computer games store for 2,000
customers is 317.85 ($1,000s) = $317,850.
Measures of variation
• Total variation is made up of two parts.

SST = SSR + SSE


SST: Total sum SSR: SSE: Error sum
of squares Regression sum of squares
of squares

SST = å ( Yi - Y )2 SSR = å (Yˆi - Y ) 2 SSE = å (Yi - Yˆi ) 2

Measures the Explained variation Variation attributable


variation of the Yi attributable to the to factors other than
values around their relationship between the relationship
mean Y X and Y between X and Y
Measures of variation
Coefficient of determination, r2
The coefficient of determination is the portion of the total
variation in the dependent variable that is explained by
variation in the independent variable

The coefficient of determination is also called r-squared and


is denoted as r2

SSR regression sum of squares


r2 = =
SST total sum of squares
Note:
0 £ r2 £ 1
Assumptions
Use the acronym LINE:
Linearity
• The underlying relationship between X and Y is linear

Independence of errors
• Error values are statistically independent

Normality of error
• Error values (ε) are normally distributed for any given value of X

Equal variance (homoscedasticity)


• The probability distribution of the errors has constant variance
Residual analysis for linearity
Residual analysis for independence
Residual analysis for normality
A normal probability plot of the residuals can be used to check for normality.
Residual analysis for equal variance
(homoscedasticity)
Now, please Click the link in
Moodle to complete the
MyExperience Survey

Secondary details go here


2. Multiple regression model

Idea: Examine the linear relationship between


1 dependent (Y) and 2 or more independent variables (Xi)

Multiple Regression Model with k Independent Variables:

Y- Population
intercept slopes Random Error

Yi = β0 + β1X1i + β2 X2i + ! + βk Xki + ε i


Multiple Regression Equation (1 of 2)
The coefficients of the multiple regression model are estimated using sample
data

Multiple regression equation with k independent


variables:

Estimated Estimated slope


(or predicted) Estimated coefficients
value of Y intercept

Ŷi = b0 + b1X1i + b 2 X 2i + ! + bk Xki


We will always use Excel to obtain the regression slope coefficients and other
regression summary measures
Multiple Regression Equation (2 of 2)
Two variable model:
Pie Sales
Pie Price Advertising A distributor of frozen dessert
Week Sales ($) ($100s) pies wants to evaluate
1 350 5.50 3.3 factors thought to influence
2 460 7.50 3.3 demand
3 350 8.00 3.0
Dependent variable:
4 430 8.00 4.5
Pie sales (units per week)
5 350 6.80 3.0
6 380 7.50 4.0 Independent variables:
7 430 4.50 3.0 Advertising ($100s), Price (in $)
8 470 6.40 3.7 Data are collected for 15 weeks
9 450 7.00 3.5
Multiple regression equation:
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2 Sales = b0 + b1 (Price)
13 440 5.90 4.0
14 450 5.00 3.5
+ b2 (Advertising)
15 300 7.00 2.7
Multiple Regression Output
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Sales = 306.526 - 24.975(Price) + 74.131(Advertising)
Square 0.44172
Standard Error 47.46341
Observations 15

Significance
ANOVA df SS MS F F

Regression 2 29460.027 14730.01 6.53861 0.01201


Residual 12 27033.306 2252.776
Total 14 56493.333

Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404

Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392

Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888


The multiple regression equation
Sales = 306.526 - 24.975(Price) + 74.131(Adv ertising)

b1 = -24.975: sales b2 = 74.131: sales will


will decrease, on increase, on average,
average, by 24.975 by 74.131 pies per
pies per week for week for each $100
each $1 increase in increase in advertising,
selling price, keeping all other
keeping all other factor(s) fixed.
factor(s) fixed.
Using the equation to make predictions
Predict sales for a week in which the selling price is $5.50 and advertising is $350:

Sales = 306.526 - 24.975(Price) + 74.131(Advertising)


= 306.526 - 24.975 (5.50) + 74.131 (3.5)
= 428.62

Predicted sales is Note: Advertising is in


428.62 pies $100s, so $350
means that X2 = 3.5
Coefficient of multiple determination
Reports the proportion of total variation in Y explained by all X variables
taken together

SSR regression sum of squares


r2 = =
SST total sum of squares
Coefficient of multiple determination

Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R
Square 0.44172
Standard Error 47.46341
Observations 15

Significance
ANOVA df SS MS F F

Regression 2 29460.027 14730.01 6.53861 0.01201


Residual 12 27033.306 2252.776
Total 14 56493.333

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Adjusted r2 (1 of 3)
• r2 never decreases when a new X variable is added to the
model
• this can be a disadvantage when comparing models

• What is the net effect of adding a new variable?


• we lose a degree of freedom when a new X variable is
added
• did the new X variable add enough explanatory power to
offset the loss of one degree of freedom?
Adjusted r2 (2 of 3)
Adjusted r2 shows the proportion of variation in Y explained by all X variables
adjusted for the number of X variables used

é 2 æ n - 1 öù
r 2
adj = 1 - ê(1 - r )ç ÷ú
ë è n - k - 1 øû
(where: n = sample size, k = number of independent variables)
• Penalises excessive use of unimportant independent variables
• Smaller than r2
• Useful in comparing among models
Adjusted r2 (3 of 3)
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R
Square 0.44172
Standard Error 47.46341
Observations 15

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.01 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Is the model significant?
• F Test for Overall Significance of the Model
• Shows if there is a linear relationship between all of the X variables
considered together and Y

Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent variable
affects Y)
F test for overall significance (1 of 3)
Test statistic

SSR
MSR k
F= =
MSE SSE
n - k -1

where MSR: mean squares of the regression,


MSE: mean squares of the error,
and F has:
• (numerator) = K degrees of freedom, and
• (denominator) = (n – k – 1) degrees of freedom
F test for overall significance (2 of 3)
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R
Square 0.44172
Standard Error 47.46341
Observations 15

Significance
ANOVA df SS MS F F
Regression 2 29460.027 14730.01 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
F test for overall significance (3 of 3)
Are individual variables significant?

We want to examine if there is a linear relationship between the


variable Xj and Y.

Hypotheses:
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist)

Use t tests of individual variable slopes (between Xj and Y)


Are individual variables significant?
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist between xj and y)

Test Statistic:
bj - 0
t=
Sb j

where the degree of freedom for this statistic is n – k – 1, and 𝑆!! is the
standard error of estimate 𝑏" .
Are individual variables significant?
Regression Statistics
Multiple R 0.72213
t-value for Price is t =
R Square 0.52148 -2.306, with p-value .0398
Adjusted R
Square 0.44172
Standard Error 47.46341
t-value for Advertising is t =
Observations 15 2.855, with p-value .0145

Significance
ANOVA df SS MS F F
Regression 2 29460.027 14730.01 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Are individual variables significant?
From Excel output:

H0: βi = 0 Coefficients
Standard
Error t Stat P-value
-
H1: βi ≠ 0 Price -24.97509 10.83213 2.30565 0.03979
Advertising 74.13096 25.96732 2.85478 0.01449
d.f. = 15-2-1 = 12
a = .05 ta/2 = 2.1788
Check: The statistic for each variable falls in
the rejection region (p-values < .05)

a/2=.025 a/2=.025
Conclusion: Reject H0 for each
variable.
Reject Do not reject H0 Reject There is evidence that both
H0 -tα/2 0 tα/2 H0
Price and Advertising affect
-2.1788 2.1788 pie sales at a = .05
Confidence Interval Estimate for the Slope
Confidence interval for the population slope βj

b j ± t n - k -1Sb j Where t has:


(n – k – 1) d.f.

Standard
Coefficients Error Here, t has:
Intercept 306.52619 114.25389 (15 – 2 – 1) = 12
Price -24.97509 10.83213 d.f.
Advertising 74.13096 25.96732

Example: Form a 95% confidence interval for the effect of changes in price (X1) on
pie sales: -24.975 ± (2.1788)(10.832)

So the interval is (-48.576, -1.374)


(This interval does not contain zero, so price has a significant effect on sales)
Confidence Interval Estimate for the Slope
Confidence interval for the population slope βi

Standard
Coefficients Error … Lower 95% Upper 95%
Intercept 306.52619 114.25389 … 57.58835 555.46404
Price -24.97509 10.83213 … -48.57626 -1.37392
Advertising 74.13096 25.96732 … 17.55303 130.70888

Example: Excel output also reports these interval


endpoints:

Weekly sales are estimated to be reduced by between


1.37 and 48.58 pies for each increase of $1 in the selling
price
Using dummy variables (1 of 3)
A dummy variable is a categorical explanatory variable with two levels:
• yes or no, on or off, male or female
• coded as 0 or 1
• if more than two levels, the number of dummy variables needed is
number of levels minus 1
Using dummy variables (2 of 3)
Example (with 2 levels)

Ŷ = b0 + b1 X1 + b 2 X 2

Let: Y = pie sales


X1 = price
X2 = holiday (dummy variable)
(X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
Using dummy variables (3 of 3)

Ŷ = b0 + b1 X1 + b 2 (1) = (b0 + b 2 ) + b1 X1 Holiday

Ŷ = b0 + b1 X1 + b 2 (0) = b0 + b1 X1 No holiday

Different Same
Y (sales) intercept slope

If H0: β2 = 0 is
b0 + b2
Holi rejected, then
day
(X = 1) ‘Holiday’ has a
b0 2
No h significant effect on
olida
y (X pie sales
2 = 0)

X1 (Price)
Notices
1-53
• The final online test will be held on 5 December 2022 from
9.00 am 1.00 pm. See the Moodle site for details.
• Topics examined in the final exam will be mostly statistics
(80% of questions), as well as financial mathematics and
calculus.
• This is an open-book test, so textbook and notes are
allowed. You can also use excel and any calculators you
want.
• Last eLearning tutorial and online quiz #3 will be released
this week.
• Last seminar will be next week (on week 10).
• Group assignment due at 6.00 pm on Friday 18 November.
• Finally, I am available for consultation until the final exam.

You might also like