Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

CUHK

Chapter 8
Residual Analysis

HU, Qinlu
Email: qinlu.hu@link.cuhk.edu.hk
Date: 2024.03.18

DSME 2021
CV

Name: Qinlu Hu
PhD Candidate, Department of Decisions, Operations and Technology, CUHK
Business School, qinlu.hu@link.cuhk.edu.hk

Major : Management Information Systems

Research interests:
two-sided online platforms
CONTENTS

02 04 06
Detecting Detecting Outliers and
Regression Analysis Unequal Variance Identifying Influential
01 03 05 Observations
07

Introduction Detecting Check the Normality Detecting Residual


Lack of fit Assumption Correlations: The Durbin-
Watson Test
8.1 Introduction

The validity of many of the inferences associated with a regression


analysis depends on the error term, satisfying certain assumptions.

There are four assumptions:

• is normally distributed;
• with a mean of 0;
• the variance is constant;
• all pairs of error terms are uncorrelated;

Based on these assumptions, least squares regression analysis


produces reliable statistical tests and confidence intervals.

Chapter 08 4
8.1 Introduction

What if these assumptions are not satisfied?

What will happen:

• if is not normally distributed; -- it’s fine for large sample


• The Central Limit Theorem supports this assumption for large sample sizes, suggesting that the sampling
distribution of the mean of the residuals will approximate a normal distribution regardless of the shape of the
distribution of the error term in the population
• If with a mean of not 0;
• If the mean of the error terms were not zero, it would indicate that the model has a bias. (Estimation)
• If the variance is not constant;
• Homoscedasticity (Equal variance) is crucial for the reliability of standard errors, confidence intervals, and
hypothesis tests. (inference)
• If all pairs of error terms are correlated;
• Will affect the inference-- standard errors, confidence intervals, and hypothesis tests.

Based on these assumptions, least squares regression analysis produces reliable statistical tests and
confidence intervals. Violations of these assumptions may lead to inefficiency of the OLS estimators
and incorrect inferences

Chapter 08 5
8.1 Introduction

How can we know these assumptions are not satisfied?


How to detect it?

• ε is normally distributed --- Checking the Normality Assumption


• with a mean of 0 –- detect lack of fit
• the variance σ2 is constant --- detect unequal variance
• all pairs of error terms are uncorrelated --- detecting Residual Correlation

provide you with both graphical tools and statistical tests that will aid
in identifying significant departures from the assumptions.

Chapter 08 6
CONTENTS

02 04 06
Detecting Detecting Outliers and
Regression Residuals Unequal Variance Identifying Influential

01 03 05 Observations
07

Introduction Detecting Check the Normality Detecting Residual


Lack of fit Assumption Correlations: The Durbin-
Watson Test
8.2 Regression Residuals

• The definition of Regression Residuals:


• Consider the model

• Use the data to obtain least squares estimates :

• The regression residual is the observed value of the dependent variable minus the
predicted value:

𝜀^ =𝑦 − ^𝑦 =𝑦 − ( ^𝛽 0 + ^𝛽 1 𝑥 1+ ⋯ + ^𝛽 𝑘 𝑥 𝑘)

Chapter 08 8
8.2 Regression Residuals

• Properties of Regression Residuals:

(1) The mean of the residuals is equal to 0. This property follows from the fact
that the sum of the differences between the observed y-values and their
least squares predicted ˆy values is equal to 0.
𝑛 𝑛

∑ 𝜀^ 𝑖 =∑ ( 𝑦 𝑖 − ^𝑦 𝑖 ) =0
𝑖=1 𝑖=1
(2) The standard deviation of the residuals is equal to the standard deviation of
the fitted regression model, s.

𝑛 𝑛

∑ (𝜀−0) =∑ (𝑦𝑖− ^𝑦𝑖) =𝑆𝑆𝐸


2 2

Chapter 08
𝑖=1 𝑖=1 9
8.2 Regression Residuals

• Examples:
• Google Colab: https://colab.research.google.com

• Python tutorial:
https://colab.research.google.com/drive/1LBD-pZPYm_GopWQDv3THOp
5waicPq_-S?usp=sharing

• Examples for Chapter:


https://colab.research.google.com/drive/12IEGdJbkKcCM3AZGpizcjqgMO
yeDzKUx?usp=sharing

Chapter 08 10
CONTENTS

02 04 06
Detecting Detecting Outliers and
Regression Analysis Unequal Variance Identifying Influential

01 03 05 Observations
07

Introduction Detecting Check the Normality Detecting Residual


Lack of fit Assumption Correlations: The Durbin-
Watson Test
8.3 Detecting Lack of Fit
• Definition of Lack of Fit:
• True model:
𝑦=𝐸 ( 𝑦 ) +𝜀
𝐸 (𝑦 )= 𝛽 0+ 𝛽 1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘
𝑦 = 𝛽0 + 𝛽1 𝑥 1 +𝛽 2 𝑥 2 + ⋯ +𝛽 𝑘 𝑥 𝑘 +𝜀
𝐸 (𝜀)=0
• Mis-specified model:
E.g.: some variables are dropped

𝐸𝑚 (𝑦)≠ 𝐸 (𝑦)
𝐸 ( 𝜀𝑚 ) ≠ 0
Chapter 08 12
8.3 Detecting Lack of Fit
• Detecting Model Lack of Fit with Residuals:

1. Residual plot : x: independent variable, y: residuals

2. Residual plot: x: predicted value, y: residuals

In each plot, look for trend, dramatic changes in variability, and/or more than
5% of residuals that lie outside 2s of 0. Any of these patterns indicates a
problem with model fit.

Chapter 08 13
8.3 Detecting Lack of Fit
• Detecting Model Lack of Fit with Residuals:
Partial regression residuals plot: y: partial residual; x: Xj

We can use partial residual plot to find the trend between y and x1.

1. Partial residual plot– model with more than one independent variable

In each plot, look for trend, dramatic changes in variability, and/or more than
5% of residuals that lie outside 2s of 0. Any of these patterns indicates a
problem with model fit.

We can use partial residual plot to find the trend between y and x1.

Chapter 08 14
8.3 Detecting Lack of Fit

• Examples:
• Google Colab:
• Python tutorial:
https://colab.research.google.com/drive/1LBD-pZPYm_GopWQDv3THOp
5waicPq_-S?usp=sharing

• Examples for Chapter:


https://colab.research.google.com/drive/12IEGdJbkKcCM3AZGpizcjqgMO
yeDzKUx?usp=sharing

Chapter 08 15
CONTENTS

02 04 06

Detecting Detecting Outliers and


Regression Analysis
Unequal Variance Identifying Influential

01 03 05 Observations
07

Introduction Detecting Check the Normality Detecting Residual


Lack of fit Assumption Correlations: The Durbin-
Watson Test
8.4 Detecting Unequal Variances
• Definition of unequal variance:

• One of the assumptions necessary for the validity of regression inferences is


that the error term have constant variance for all levels of the
independent variable(s).
• Variances that satisfy this property are called homoscedastic.
• Unequal variances for different settings of the independent variable(s) are
said to be heteroscedastic.

Chapter 08 17
8.4 Detecting Unequal Variances
• When data fail to be homoscedastic, the reason is often that the variance of
the response y is a function of its mean E(y).
• Examples:
1. If the response y is a count that has a Poisson distribution, the variance
will be equal to the mean E(y).

Chapter 08 18
8.4 Detecting Unequal Variances
• When data fail to be homoscedastic, the reason is often that the variance of
the response y is a function of its mean E(y).
• Examples:
1. If the response y is a count that has a Binomial distribution, the variance
will be equal to: 𝑝 𝑖 ( 1− 𝑝𝑖 ) 𝐸 ( 𝑦 𝑖 ) [ 1 − 𝐸 ( 𝑦 𝑖 ) ]
𝑉𝑎𝑟 ( 𝑦 𝑖 ) = =
𝑛𝑖 𝑛𝑖

Chapter 08 19
8.4 Detecting Unequal Variances
• When data fail to be homoscedastic, the reason is often that the variance of
the response y is a function of its mean E(y).
• Examples:
1. If the response y is a count that has a multiplicative
2 2 model, the variance
will be equal to:
𝑉𝑎𝑟 ( 𝑦 )=[ 𝐸 ( 𝑦 ) ] 𝜎

Chapter 08 20
Poisson Distribution Formula

−𝜆 𝑥
𝑒 𝜆
𝑃 ( 𝑋=𝑥∨ 𝜆)=
𝑋!
where:
x = number of events in an area of opportunity
 = expected number of events
e = base of the natural logarithm (2.71828...)

5-21
Poisson Distribution Formula

• Mean
𝜇= 𝜆
 Variance and Standard Deviation
2
𝜎 =𝜆
𝜎 =√ 𝜆
where  = expected number of events

5-22
Binomial Distribution Formula

where:
n = the number of experiments
x = the number of successful experiment: 0, 1, 2…
p = Probability of Success in a single experiment

5-23
Binomial Distribution Formula

• Mean
𝜇=𝑛𝑝
 Variance and Standard Deviation
2
𝜎 =𝑛𝑝(1 −𝑝 )

𝜎 = √ n𝑝(1 −𝑝)

5-24
Multiplicative Model Formula
• The random error component has been assumed to be
additive in all the models.
𝑦 =𝐸 ( 𝑦 ) +𝜀
• Another useful type of model is the multiplicative model. In
this model, the response is written as the product of its
mean and the random error component,
𝑦 =[ 𝐸 ( 𝑦 ) ] 𝜀
• The variance of this response will growth proportionally to
the square of the mean
2
𝑉𝑎𝑟 ( 𝑦 )=[ 𝐸 ( 𝑦 ) ] 𝜎
2

Chapter 08 25
8.4 Detecting Unequal Variances
• Solution:
• Variance-stabilizing transformations
• When the variance of y is a function of its mean, we can often satisfy the
least squares assumption of homoscedasticity by transforming the
response to some new response that has a constant variance.
• For example, if the response y is a count that follows a Poisson
distribution, the square root transform can be shown to have
approximately constant variance. Consequently, if the response is a
Poisson random variable, we would let

𝑦 =√ 𝑦


𝑦 =𝛽 0 +𝛽 1 𝑥 1+ 𝛽 2 𝑥 2+ ⋯ + 𝛽 𝑘 𝑥 𝑘+𝜀
• This model will satisfy approximately the least squares assumption of
homoscedasticity.

Chapter 08 26
8.4 Detecting Unequal Variances
• Solution:
• Variance-stabilizing transformations
• When the variance of y is a function of its mean, we can often satisfy the
least squares assumption of homoscedasticity by transforming the
response to some new response that has a constant variance.

Chapter 08 27
8.4 Detecting Unequal Variances

• Examples:
• Google Colab:
• Python tutorial:
https://colab.research.google.com/drive/1LBD-pZPYm_GopWQDv3THOp
5waicPq_-S?usp=sharing

• Examples for Chapter:


https://colab.research.google.com/drive/12IEGdJbkKcCM3AZGpizcjqgMO
yeDzKUx?usp=sharing

Chapter 08 28

You might also like