Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 38

QMT 109: Business Research

and Statistical Methods


Wilson Gan
Lecture Eleven

Reference for Definitions and Formulas: Bowerman et al. (2009)


Simple Linear Regression
Analysis
PART I: Introduction
Our running example: Fuel Consumption
Case
 Cities in the USA use natural gas to heat their homes when
temperatures are low. As is obvious for most of you, natural gas
consumption is highest during the winter. Assume that we are
analysts in a management consulting firm hired by a natural gas
company serving a small city to predict weekly natural gas demand
based on average temperature so they can prepare their supply chain
accordingly. They have provided us with 8 week’s worth of data on
the city’s average temperature and corresponding weekly fuel
consumption.
 The natural gas firm will tolerate an error rate of at most 10 percent,
meaning that the difference between our prediction and actual usage
should not be more than 10 percent. This is because the natural gas
firm will have to pay gas transmission companies in case it over or
under-predicts demand.
In SRM, we attempt to fit a straight line to
establish a linear relationship
 The dependent (or response) variable is the variable we wish to
understand or predict
 The independent (or predictor) variable is the variable we will use
to understand or predict the dependent variable
 Regression analysis is a statistical technique that uses observed data
to relate the dependent variable to one or more independent
variables. The objective of regression analysis is to build a regression
model (or predictive equation) that can be used to describe, predict
and control the dependent variable on the basis of the independent
variable

Remember, correlation is not the same as causation!


Let us examine the resulting regression
line…

When we say that μy|x is related to x by a straight line, we mean the different mean
weekly fuel consumptions and average hourly temperatures lie in a straight line.
Perhaps using this simpler rendition…
And the general form of the regression
equation…

 μy = β0 + β1x + ε is the mean value of the dependent variable y when


the value of the independent variable is x
 β0 is the y-intercept; the mean of y when x is 0
 β1 is the slope; the change in the mean of y per unit change in x
 ε is an error term that describes the effect on y of all factors other
than x
 Remember that we do not know the real values of regression
parameters β0 and β1. In the fuel consumption case, the resulting
equation is just an estimate of β0 and β1 based on our sample:
 y = 15.838 – 0.128x
 b is the estimate of β and b is the estimate of β
0 0 1 1
The Simple Linear Regression Model
Illustrated
We use the least squares method to
estimate the regression line

MAIN
FORMULA

MAIN
FORMULA
Exercise: Least Square Point Estimates for
Fuel Consumption Case
THIS AREA HAS INTENTIONALLY BEEN LEFT BLANK

 Suppose the weather forecasting service predicts that the average


hourly temperature next week will be 42 °F (5.6 °C), estimate fuel
consumption next week. Your answer will be the same as:
 The point estimate of the mean fuel consumption for all weeks that
have an average hourly temperature of 42 °F
 The point prediction of the amount of fuel consumed in a single
week that has an average hourly temperature of 42 °F
Beware the dangers of extrapolation!
Sample Regression Output in Excel
MSE and the SE are point estimates for the
variance and SD of error term populations
 Sum of Squared Residuals (difference between actual and predicted)
 The goal of the least squares method is to produce an equation of the
line that minimizes SSE
 The Mean Standard Error is the point estimate of the residual
variance σ2
SSE
s  MSE 
2

n-2
SSE   ei2   ( yi  yˆ i ) 2
 The Standard Error is the point estimate of the residual standard deviation
σ
SSE
s  MSE 
n-2
We can use hypothesis testing to determine
whether a significant association exists
 A regression model is not likely to be useful unless there is a
significant relationship between x and y
 To test significance, we use the null hypothesis: H0: β1 = 0
 Versus the alternative hypothesis: Ha: β1 ≠ 0
 Note: The test will only be valid if SRM assumptions hold.
Alternative Hypo Reject H0 if P-value
|t| > t /2 Twice the area under the t
or distribution to the right of |t|
Ha: β1 ≠ 0 t > t/2
or
t < -t/2
b1 s
 Remember: In this case, degrees t= where sb1 
of freedom = n-2 sb1 SS xx
The same goes in determining whether the
y-intercept β0
 To test significance, we use the null hypothesis: H0: β0 = 0
 Versus the alternative hypothesis: Ha: β0 ≠ 0
 Note: The test will only be valid if SRM assumptions hold.
 If the intercept is not statistically significant, we can drop it from the
model (unless common sense dictates otherwise)
Alternative Hypo Reject H0 if P-value
|t| > t /2 Twice the area under the t
or distribution to the right of |t|
Ha: β0 ≠ 0 t > t/2
or
t < -t/2
 Remember: In this case, b0 1 x2
t= where sb0  s 
degrees of freedom = n-2 sb0 n SS xx
F Test checks the significance of the overall
regression relationship between x and y
 For simple regression, this is
another way to test the null
hypothesis
H0: β1 = 0
 Very useful in multiple
regression
 Test statistic based on F
distribution. Reject H0 if F(model)
> F or p-value < 
 F is based on 1 numerator and n-2
denominator degrees of freedom
Explained variation
F
(Unexplain ed variation)/(n - 2)
And we can apply the concept of the
confidence interval for β1 and β0 as well…
 Note that the confidence intervals here apply to β0 and β1 separately
 100(1-α)% Confidence Interval for β1 1

[b1 ± t α/2 Sb1]


2
 100(1-α)% Confidence Interval for β0
[b0 ± tα/2 Sb0]
 Among the two, it is more useful to know the Confidence Interval for
β1
 Note that there is a different formula to determine the confidence
interval for the estimate of the mean value or the point prediction
3
of
an individual value of the dependent variable

3 2 1
Confidence and Prediction Intervals
 The point on the regression line corresponding to a particular value
of x0 of the independent variable x is ŷ = b0 + b1x0. It is unlikely that
this value will equal the mean value of y when x equals x 0
 Therefore, we need to place bounds on how far the predicted value
might be from the actual value
 We can do this by calculating a confidence interval mean for the
value of y and a prediction interval for an individual value of y

CONFIDENCE
INTERVAL
[ŷ  t /2 s Distance value ]
PREDICTION
INTERVAL [ŷ  t /2 s 1  Distance value ]

 Remember: In this case,


degrees of freedom = n-2
Difference between Confidence Interval and
Prediction Interval Illustrated

The prediction interval will always be


wider than the confidence interval
The Simple Coefficient of Determination and
Correlation gauges usefulness of regression
 The coefficient of determination, r2, is the proportion of the total
variation in the n observed values of the dependent variable that is
explained by the simple linear regression model

 Note: Total Variation = Explained Variation + Unexplained Variation


 Unexplained Variation = SSE

 The simple coefficient of correlation measures the strength of the linear


relationship between y and x and is denoted by r
Simple Linear Regression
Analysis
PART II: Correlation
Different Values of the Correlation
Coefficient Illustrated

The value of the simple correlation coefficient (r) is not the slope of the least
square line. The latter is b1.
Note: Coefficient of correlation is related to
covariance
 A measure of the strength of a linear relationship is the
covariance sxy

 x  x y 
n

i i y
s xy  i 1
n 1

 Sample correlation coefficient (r) is a measure of the strength


of the relationship that does not depend on the magnitude of
the data
s xy
r
sx s y
Spearman vs. Pearson Correlation
 Both are measures of correlation (r)
 Both measure the statisticalr relationship between two variables
 Both have a range from -1 (perfect negative relationship) to 1 (perfect
positive relationship) with a 0 value meaning no relationship
 Pearson Product Moment Correlation is a parametric measure
 Used for interval and ratio variables
 Assumes that the population for each variable is normally distributed
 Spearman Correlation is a non-parametric measure
 Sometimes called Spearman’s rho
 Used for small data sets (remember your CLT)
 Used for ordinal variables
 No assumption about population distribution
 In a way, Spearman is the more “robust” measure
Formulas for the Spearman and Pearson
correction
PEARSON’S CORRELATION SPEARMAN’S CORRELATION

 Where d is the difference in


ranks of each pair

You do not need to bother with these formulas if you will be using statistical
software to compute correlation
Sample Calculation for Spearman’s
Correlation
Diff2

Source: Bowerman (2009)


Appendix: Testing the Significance of the
Population Correlation Coefficient
 The simple correlation coefficient (r) measures the linear relationship
between the observed values of x and y from the sample while the
population correlation coefficient (ρ) measures the linear
relationship between all possible combinations of observed values of
x and y
 Can actually be used for both Pearson and Spearman
 r is an estimate of ρ
 We can test to see if the correlation is significant using the
hypotheses
H0: ρ = 0 r n2
t
Ha: ρ ≠ 0 1 r 2
 With degree of freedom = n - 2
This test will give the same results as the test for significance on
the slope coefficient b1
Simple Linear Regression
Analysis
PART III: Regression Assumptions
Note the assumptions that govern the SRM
1. Mean of Zero
At any given value of x, the population of potential error term
values has a mean equal to zero
2. Constant Variance Assumption
Homoscedasticity. At any given value of x, the population of
potential error term values has a variance that does not depend on
the value of x. The different populations of potential error term
values corresponding to different values of x have equal variances
(σ2).
3. Normality Assumption
At any given value of x, the population of potential error term
values has a normal distribution
4. Independence Assumption
Any one value of the error term ε is statistically independent of any
other value of ε
SRM Model Assumptions Illustrated

 Residuals (e) are defined as the difference between the observed


value of y and the predicted value of y; e = y – ŷ
 Note that e is the point estimate of ε
How do we check that the regression
assumptions hold?
 Checks of regression assumptions are performed by analyzing the
regression residuals
 Residuals versus independent variable
 Residuals versus predicted y’s
 Residuals in time order (if the response is a time series)
 The residuals should look like they have been randomly and
independently selected from normally distributed populations having
mean zero and variance σ2
 Furthermore, the different error terms will be statistically
independent.
 With any real data, assumptions will not hold exactly. Mild
departures do not affect our ability to make statistical inferences.
 In checking assumptions, we are looking for pronounced departures
from the assumptions.
Back up: Interpreting Residual Plots

Null plot Nonlinearity Heteroscedasticity

Heteroscedasticity Time-based dependence Event-based dependence

Normal histogram Nonlinearity and heteroscedasticity


Checking for constant variance of the
residuals
Checking for normal distribution of
residuals

Not so good

Good
Source: http://people.duke.edu/~rnau/testing.htm
Autocorrelation and time series data
 Independence assumption frequently violated in time series data.
Time-ordered error terms may be autocorrelated. There are two
types of autocorrelation as seen in a residuals in time order plot…
 Positive autocorrelation:
 Residual plot has a cyclical pattern
 positive error term in time period i tends to be followed by another
positive value in i+k (some later period); negative error term in
time period i tends to be followed by another negative value in i+k
 Negative autocorrelation:
 Residual plot has an alternating pattern
 positive error term in time period i tends to be followed by a
negative value in i+k (some later period); negative error term in
time period i tends to be followed by a positive value in i+k

Source: Bowerman (2009)


Positive and Negative Autocorrelation

POSITIVE AUTOCORRELATION NEGATIVE AUTOCORRELATION

 Note: When residuals are independent of each other, the residual plot
should display a random pattern
Source: Bowerman (2009)
The Durbin-Watson Test for Autocorrelation

REJECT H0 DO NOT REJECT H0 REJECT H0


+ AUTO NO AUTO -- AUTO
INCONCLUSIVE INCONCLUSIVE

0 4
dl du 4-du 4-dl

 Estimating the Durbin-Watson test statistic:


What’s the effect of potential violations of
the regression assumptions?
Violation Effect on Regression Potential Fix
Model
Non-linear Low fit. Predictions Nonlinear transformation
unreliable.
Autocorrelation Seasonality adjustment.
(Serial Autoregressive time series
Correlation) models.
Parameter estimates
Heteroscedasticity Data transformation (log).
unaffected. Hypothesis
ARCH model
tests and confidence
Non-normal intervals unreliable. Should not be a problem for
large sample sizes, though it
may be a symptom of a
different violation
 Note: Since this is meant to be introductory, this is not a
comprehensive discussion

You might also like