05 16 Simple Regression 2

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 84

Introduction to Econometrics

Ekki Syamsulhakim
Undergraduate Program
Department of Economics
Universitas Padjadjaran
Last Week
• Chapter 2 – Wooldridge
– Simple regression model
• Definition
• Zero Conditional Mean Assumption
• Derivation of OLS estimates
Today
• Chapter 2 – Wooldridge
– Simple regression model continues
• Goodness of fit
• Interpretation of simple regression parameters

• Chapter 3 – Wooldridge
– Multiple regression model
– Omitted Variable Bias and Multiple Regression Model
– Gauss Markov Assumptions
OLS estimates of and

• Can be easily calculated using softwares


• Excel example
OLS estimates of and
Once we have determined the OLS intercept and
slope estimates, we form the OLS regression
line (Sample Regression Function / SRF):

[2.23]
Fitted Values and Residuals
• We assume that the
intercept and slope
estimates, and , have
been obtained for the
given sample of data.
• Given and , we can
obtain the fitted value
for each observation.
Fitted Values and Residuals
• By definition, each fitted value of is on the
OLS regression line.
• The OLS residual associated with observation ,
, is the difference between and its fitted
value ().
– If is positive, the line underpredicts ;
– if is negative, the line overpredicts
• STATA example (sysuse auto)
Fitted Values and Residuals
clear
sysuse auto
reg price mpg
predict pricehat
predict resid, residual
browse price pricehat resid
scatter price mpg || line pricehat mpg
Algebraic Properties of OLS Statistics
1. The sum, and therefore the sample average of the OLS
residuals, is zero.
Mathematically,

remember that the residuals are defined by

In other words, the OLS estimates and are chosen


to make the residuals add up to zero (for any data set)
Algebraic Properties of OLS Statistics

2. The sample covariance between the


regressors and the OLS residuals is zero.

because

then (why?)
making
Algebraic Properties of OLS Statistics

3. The point is always on the OLS regression


line. In other words, if we take
equation (2.23) and plug in for , then the
predicted value is .
Algebraic Properties of OLS Statistics
• Writing each as its fitted value, plus its residual,
provides another way to interpret an OLS
regression. For each , write

(2.32)
• From property (1), the average of the residuals is
zero; equivalently, the sample average of the fitted
values, , is the same as the sample average of the ,
or .
Algebraic Properties of OLS Statistics
• Further, properties (1) and (2) can be used to
show that the sample covariance between
and is zero.

• Thus, we can view OLS as decomposing each


into two parts, a fitted value and a residual.
The fitted values and residuals are
uncorrelated in the sample.
decomposition by OLS
decomposition by OLS
Define the total sum of squares (SST), the
explained sum of squares (SSE), and the
residual sum of squares (SSR) (also known as
the sum of squared residuals), as follows:

(2.33)
decomposition by OLS
• Define the total sum of squares (SST), the
explained sum of squares (SSE), and the
residual sum of squares (SSR) (also known as
the sum of squared residuals), as follows:

(2.34)
decomposition by OLS
• Define the total sum of squares (SST), the explained
sum of squares (SSE), and the residual sum of
squares (SSR) (also known as the sum of squared
residuals), as follows:

(2.35)
and
SST= SSE + SSR
(2.36)
Fitted values and residuals
• We want to know if the model can explain if
SSR=
the variations in the dependent variable due
to the variations in the explanatory variable
• We want to measure how well the proportion
SSE
of variation in the explanatory variable
influence the proportion of variation in the
dependent variable
Important Notes
• Some words of caution about SST, SSE, and
SSR are in order. There is no uniform
agreement on the names or abbreviations for
the three quantities defined in equations.
• The total sum of squares is called either SST or
TSS, so there is little confusion here.
Important Notes
• Unfortunately, the explained sum of squares is
sometimes called the “regression sum of
squares.”
– If this term is given its natural abbreviation, it can
easily be confused with the term “residual sum of
squares.”
– Some regression packages refer to the explained
sum of squares as the “model sum of squares.”
Important Notes
• To make matters even worse, the residual sum
of squares is often called the “error sum of
squares.”
• This is especially unfortunate because, as we
will see in Section 2.5, the errors and the
residuals are different quantities.
• We prefer to use the abbreviation SSR to
denote the sum of squared residuals, because
it is more common in econometric packages.
Model’s goodness of fit
• Do we have a good or bad model?
– It is often useful to compute a number that summarizes
how well the OLS regression line fits the data.

• How we can conclude that the model is acceptable?

• We use R2 and SE (of regression) to check the


goodness of fit of the model
– R2 is applicable to models with the same dependent
variables only
R 2

• We want to know if the model can explain the


variations in the dependent variable due to
the variations in the explanatory variable

• We want to measure how well the proportion


of variation in the explanatory variable
influence the proportion of variation in the
dependent variable
R 2

• Assuming that the total sum of squares, SST, is


not equal to zero—which is true except in the
very unlikely event that all the equal the
same value—we can divide (2.36) by SST to
get
1=SSE/SST + SSR/SST
• The R-squared of the regression, sometimes
called the coefficient of determination, is
defined as
R 2

• Coefficient of “determination”
• Good to compare models using the same
dependent variable
R 2

• From (2.36), the value of R2 is always between


zero and one, because SSE can be no greater
than SST
• When interpreting R2, we usually multiply it by
100 to change it into a percent: is the
percentage of the sample variation in that is
explained by .
R 2

• If the data points all lie on the same line, OLS


provides a perfect fit to the data.
– In this case,
– A value of R2 that is nearly equal to zero indicates
a poor fit of the OLS line: very little of the variation
in the is captured by the variation in the (which all
lie on the OLS regression line)
R 2

• In fact, it can be shown that R2 is equal to the


square of the sample correlation coefficient
between and
• This is where the term “R-squared” came
from.
– (The letter R was traditionally used to denote an
estimate of a population correlation coefficient,
and its usage has survived in regression analysis.)
Some extension about R 2
R 2
R issue
2

• R2 in microeconometric analysis tend to be


lower than time-series regression
– At individual level, many factors may play a large
role in determining responses
– This factors are often unobserved, leading to
goodness of fit of the model

• Read: Wooldridge, Introductory… 2002, p40


R issue
2
R issue
2
Adjusted
• It is inevitable that will increase as we add more
and more independent variables
• Another measure, namely adjusted- or , tackle
this issue
• “imposes a penalty for adding additional
independent variables into a model”

We may go back to this and issue after


discussing inference
Standard Error of Regression
(may be discussed more later)
• It measures an estimate of the magnitude of
the average error that will be produced from
the model
• The closer SE to zero, the better the model
STATA example
INTERPRETATION OF SIMPLE
REGRESSION PARAMETERS
OLS – Simple Regression Model
• Suppose the true relationship is

• Then, using OLS method


Interpretation of
intercept
• : The model predicted [dependent variable]
to be [ [unit of dependent variable] if
[independent variable] is zero [unit of
independent variable]

• Example
Interpretation of
intercept
• : The model predicted [dependent variable]
to be [ [unit of dependent variable] if
[independent variable] is zero [unit of
independent variable]

• Should we keep ?
• Regression through the origin?
Interpretation of

• : A one [unit of independent variable] increase


in [independent variable]
[increases/decreases] [dependent variable] by
[ [unit of dependent variable]

• If is continuous
• If is linear
• Example
OMITTED VARIABLE BIAS &
MULTIPLE REGRESSION
Multiple Regression Model : Introduction

• Suppose the true relationship is

• What is our and ?


Multiple Regression Model : Introduction

• Suppose the true relationship is

• What is our and ?


Multiple Regression Model : Introduction

• Suppose the true relationship is

• What is our and ?


Multiple Regression Model : Introduction

• Suppose the true relationship is

• But we omit x2 and estimate:

• We will have Omitted Variable Bias


Omitted Variable Bias

By taking the expected value :

Similar for
Bias of , When is omitted
• Direction of bias:

• But size of bias does also matter


Example: OVB (AAY)
Other Source of Bias
• Biasedness
– Omitted variable bias
– Endogeneity bias
– Simultaneity bias
– Selection bias
Multiple Regression Analysis :
Reasons
• There is no way an economic variable is
influenced by ONLY ONE variable
– Except, occasionally, a rare case

• Because multiple regression models can


accommodate many explanatory variables
that may be correlated, we can hope to infer
causality in cases where simple regression
analysis would be misleading.
Multiple Regression Analysis :
Reasons
• Multiple regression analysis can be used to
build better models for predicting the
dependent variable.
• An additional advantage of multiple regression
analysis is that it can incorporate fairly general
functional form relationships.
– In the simple regression model, only one function of
a single explanatory variable can appear in the
equation. As we will see, the multiple regression
model allows for much more flexibility.
Multiple Regression Analysis :
Example

• Example
Interpretation of
estimated coefficients
• : A one [unit of independent variable] increase
in [independent variable] [increases/decreases]
[dependent variable] by [ [unit of dependent
variable], assuming other variables constant

• If is continuous
• If is linear
• Example
Assumption of S/MLR

1. LR 1: Linear in Parameter

2. LR 2: Random Sampling

3. LR 3: Zero Conditional Mean

4. SLR 4: Sample Variation in the Indep. Var.


MLR 4: No Perfect Collinearity

5. LR 5: Homoskedasticity
The primary drawback in using simple regression analysis for empirical
work is that it is very difficult to draw ceteris paribus conclusions
about how x affects y : the key assumption, SLR.3—that all other
factors affecting y are uncorrelated with x—is often unrealistic

Multiple regression analysis is more amenable to ceteris paribus


analysis because it allows us to explicitly control for many other factors
which simultaneously affect the dependent variable.
^ 𝐶𝑜𝑣 ( 𝑥 1𝑖 , 𝑦 𝑖)
𝛽 1=
𝑉𝑎𝑟 ( 𝑥 1 𝑖)
• For example, suppose we want to estimate the
effect of campaign spending on campaign
outcomes.

• For simplicity, assume that each election has two


candidates. Let voteA be the percent of the vote for
Candidate A, let expendA be campaign
expenditures by Candidate A, let expendB be
campaign expenditures by Candidate B, and let
totexpend be total campaign expenditures;
Can we interpret ?

This model violates assumption MLR.4 because by definition  has


an exact linear relationship with and
Solution  Drop

Example
Remember…LR1
• We are doing a LINEAR regression model
– SLR 1  Linear in parameter
– We cannot estimate because and are not
linearly related to
– We must use non-linear regression model

• But we can do linear regression model even


though the variables are not linear
Functional Form
• OLS can be used for relationships that are not
strictly linear in x and y by using nonlinear
functions of x and y – will still be linear in the
parameters
– Can use quadratic forms of x
– Can take the natural log of x, y or both
– Can use interactions of x variables

Economics 20 - Prof. Anderson 66


Quadratic Models
• For a model of the form y = b0 + b1x + b2x2 + u we
can’t interpret b1 alone as measuring the change
in y with respect to x, we need to take into
account b2 as well, since


yˆ  ˆ1  2 ˆ 2 x x, so 
yˆ ˆ
 1  2 ˆ 2 x
x

Economics 20 - Prof. Anderson 67


More on Quadratic Models
• Suppose that the coefficient on x is positive and
the coefficient on x2 is negative
• Then y is increasing in x at first, but will eventually
turn around and be decreasing in x

For ˆ1  0 and ˆ 2  0 the turning point


will be at x  ˆ1 2 ˆ 2
*
 
Economics 20 - Prof. Anderson 68
More on Quadratic Models
• Suppose that the coefficient on x is negative and
the coefficient on x2 is positive
• Then y is decreasing in x at first, but will
eventually turn around and be increasing in x

For ˆ1  0 and ˆ 2  0 the turning point


 
will be at x  ˆ1 2 ˆ 2 , which is
*

the same as when ˆ1  0 and ˆ 2  0


Economics 20 - Prof. Anderson 69
Quadratic Models: Example
• Suppose the relationship between Income (million
Rp) and age (year) is positive and linear
• We hence believe that as age increases 1 year,
income increases according to the coefficient of our
regression.
– If then if age of an individual increases one year then on
average income increases by 1.5 million Rp
– Income of a person aged 78 years old > income of a
person aged 40 (is this sensible?)
• OR, any other examples
Detecting model mispecification
• We can use residual plot to check
– If we have omitted important variable(s)
– If we should incorporate non-linearity in our
independent variable(s)
• A residual plot is a scatter plot of residual
against a particular independent variable
• This method is subject to many things, such as
– number of observation
– outlier
Outlier(s)
• Outliers are values of a variable which lie way
above or below average of the variable
• Before conducting any regression analysis,
always check your data for outlier(s)
– That’s why plotting a scatterplot is always
important

– Example
Quadratic Models: Example
• Now let’s go back to our discussion on Income
and age
• According to theory, there should be an
inverted-U shaped relationship between the
two variable
• Hence we need to specify
Quadratic Models: Example
• Let’s use Wooldridge data “SMOKE” in GRETL
– Number of cigarettes smoked each day depends
on age
Logarithmic Models
• Log model  natural log, even tough written
“log”

• Why we need log of the variable(s)


– Theoretical issues, needs functional transformation
– Efficiency issues

• Double log (or log – log model)


• Semi log model(s)
Functional Transformation
• Some theoretical economic models involves
non-linearity
– Cobb Douglass Production Function

we can transform this to a linear form:


Functional Transformation

The estimated version would be:

The estimated coefficients are elasticities


Functional Transformation
• Consider
(1)
and
(2)

• Y is not the same as Log Y


• We cannot use R2 to compare, as the ANOVA
statistics (ESS, RSS) are totally different
Interpretation of Log (or…=ln) Models
• If the model is ln(y) = b0 + b1ln(x) + u
b1 is the elasticity of y with respect to x

• If the model is ln(y) = b0 + b1x + u


b1 is approximately the percentage change in y given a 1 unit
change in x  2 ways of computing (accuracy issue)

• If the model is y = b0 + b1ln(x) + u


b1 is approximately the change in y for a 100 percent change in x

Economics 20 - Prof. Anderson 79


hprice2.dta
• Price of a house ($) is a function of pollution
(nitrogen oxide in the air measured in parts
per 100m) and number of rooms

• Interpretation of
Accuracy: semi-elasticity
Accuracy: semi-elasticity
Accuracy: semi-elasticity
Accuracy: semi-elasticity

You might also like