05 16 Simple Regression 2

Introduction to Econometrics
Ekki Syamsulhakim
Undergraduate Program
Department of Economics
Universitas Padjadjaran
Last Week
• Chapter 2 – Wooldridge
– Simple regression model
• Definition
• Zero Conditional Mean Assumption
• Derivation of OLS estimates
Today
– Simple regression model continues
• Goodness of fit
• Interpretation of simple regression parameters
– Multiple regression model
– Omitted Variable Bias and Multiple Regression Model
– Gauss Markov Assumptions
OLS estimates of and
• Can be easily calculated using softwares

• Excel example
OLS estimates of and
Once we have determined the OLS intercept and
slope estimates, we form the OLS regression
line (Sample Regression Function / SRF):
[2.23]
Fitted Values and Residuals
• We assume that the
intercept and slope
estimates, and , have
been obtained for the
given sample of data.
• Given and , we can
obtain the fitted value
for each observation.
• By definition, each fitted value of is on the
OLS regression line.
• The OLS residual associated with observation ,
, is the difference between and its fitted
value ().
– If is positive, the line underpredicts ;
– if is negative, the line overpredicts
• STATA example (sysuse auto)
clear
sysuse auto
reg price mpg
predict pricehat
predict resid, residual
browse price pricehat resid
scatter price mpg || line pricehat mpg
Algebraic Properties of OLS Statistics
1. The sum, and therefore the sample average of the OLS
residuals, is zero.
Mathematically,
remember that the residuals are defined by
In other words, the OLS estimates and are chosen

to make the residuals add up to zero (for any data set)
2. The sample covariance between the

regressors and the OLS residuals is zero.
because
then (why?)
making
3. The point is always on the OLS regression

line. In other words, if we take
equation (2.23) and plug in for , then the
predicted value is .
• Writing each as its fitted value, plus its residual,
provides another way to interpret an OLS
regression. For each , write
(2.32)
• From property (1), the average of the residuals is
zero; equivalently, the sample average of the fitted
values, , is the same as the sample average of the ,
or .
• Further, properties (1) and (2) can be used to
show that the sample covariance between
and is zero.
• Thus, we can view OLS as decomposing each

into two parts, a fitted value and a residual.
The fitted values and residuals are
uncorrelated in the sample.
decomposition by OLS
Define the total sum of squares (SST), the
explained sum of squares (SSE), and the
residual sum of squares (SSR) (also known as
the sum of squared residuals), as follows:
(2.33)
• Define the total sum of squares (SST), the
explained sum of squares (SSE), and the
residual sum of squares (SSR) (also known as
the sum of squared residuals), as follows:
(2.34)
• Define the total sum of squares (SST), the explained
sum of squares (SSE), and the residual sum of
squares (SSR) (also known as the sum of squared
residuals), as follows:
(2.35)
and
SST= SSE + SSR
(2.36)
Fitted values and residuals
• We want to know if the model can explain if
SSR=
the variations in the dependent variable due
to the variations in the explanatory variable
• We want to measure how well the proportion
SSE
of variation in the explanatory variable
influence the proportion of variation in the
dependent variable
Important Notes
• Some words of caution about SST, SSE, and
SSR are in order. There is no uniform
agreement on the names or abbreviations for
the three quantities defined in equations.
• The total sum of squares is called either SST or
TSS, so there is little confusion here.
Important Notes
• Unfortunately, the explained sum of squares is
sometimes called the “regression sum of
squares.”
– If this term is given its natural abbreviation, it can
easily be confused with the term “residual sum of
squares.”
– Some regression packages refer to the explained
sum of squares as the “model sum of squares.”
Important Notes
• To make matters even worse, the residual sum
of squares is often called the “error sum of
squares.”
• This is especially unfortunate because, as we
will see in Section 2.5, the errors and the
residuals are different quantities.
• We prefer to use the abbreviation SSR to
denote the sum of squared residuals, because
it is more common in econometric packages.
Model’s goodness of fit
• Do we have a good or bad model?
– It is often useful to compute a number that summarizes
how well the OLS regression line fits the data.
• How we can conclude that the model is acceptable?
• We use R2 and SE (of regression) to check the

goodness of fit of the model
– R2 is applicable to models with the same dependent
variables only
R 2
• We want to know if the model can explain the

variations in the dependent variable due to
the variations in the explanatory variable
• We want to measure how well the proportion

of variation in the explanatory variable
influence the proportion of variation in the
dependent variable
R 2
• Assuming that the total sum of squares, SST, is

not equal to zero—which is true except in the
very unlikely event that all the equal the
same value—we can divide (2.36) by SST to
get
1=SSE/SST + SSR/SST
• The R-squared of the regression, sometimes
called the coefficient of determination, is
defined as
R 2
• Coefficient of “determination”
• Good to compare models using the same
dependent variable
R 2
• From (2.36), the value of R2 is always between

zero and one, because SSE can be no greater
than SST
• When interpreting R2, we usually multiply it by
100 to change it into a percent: is the
percentage of the sample variation in that is
explained by .
R 2
• If the data points all lie on the same line, OLS

provides a perfect fit to the data.
– In this case,
– A value of R2 that is nearly equal to zero indicates
a poor fit of the OLS line: very little of the variation
in the is captured by the variation in the (which all
lie on the OLS regression line)
R 2
• In fact, it can be shown that R2 is equal to the

square of the sample correlation coefficient
between and
• This is where the term “R-squared” came
from.
– (The letter R was traditionally used to denote an
estimate of a population correlation coefficient,
and its usage has survived in regression analysis.)
Some extension about R 2
R 2
R issue
2
• R2 in microeconometric analysis tend to be

lower than time-series regression
– At individual level, many factors may play a large
role in determining responses
– This factors are often unobserved, leading to
goodness of fit of the model
• Read: Wooldridge, Introductory… 2002, p40

R issue
2
R issue
2
Adjusted
• It is inevitable that will increase as we add more
and more independent variables
• Another measure, namely adjusted- or , tackle
this issue
• “imposes a penalty for adding additional
independent variables into a model”
We may go back to this and issue after

discussing inference
Standard Error of Regression
(may be discussed more later)
• It measures an estimate of the magnitude of
the average error that will be produced from
the model
• The closer SE to zero, the better the model
STATA example
INTERPRETATION OF SIMPLE
REGRESSION PARAMETERS
OLS – Simple Regression Model
• Suppose the true relationship is
• Then, using OLS method

Interpretation of
intercept
• : The model predicted [dependent variable]
to be [ [unit of dependent variable] if
[independent variable] is zero [unit of
independent variable]
• Example
Interpretation of
intercept
• : The model predicted [dependent variable]
to be [ [unit of dependent variable] if
[independent variable] is zero [unit of
independent variable]
• Should we keep ?
• Regression through the origin?
Interpretation of
• : A one [unit of independent variable] increase

in [independent variable]
[increases/decreases] [dependent variable] by
[ [unit of dependent variable]
• If is continuous
• If is linear
• Example
OMITTED VARIABLE BIAS &
MULTIPLE REGRESSION
Multiple Regression Model : Introduction
• What is our and ?



• But we omit x2 and estimate:
• We will have Omitted Variable Bias

Omitted Variable Bias
By taking the expected value :
Similar for
Bias of , When is omitted
• Direction of bias:
• But size of bias does also matter

Example: OVB (AAY)
Other Source of Bias
• Biasedness
– Omitted variable bias
– Endogeneity bias
– Simultaneity bias
– Selection bias
Multiple Regression Analysis :
Reasons
• There is no way an economic variable is
influenced by ONLY ONE variable
– Except, occasionally, a rare case
• Because multiple regression models can

accommodate many explanatory variables
that may be correlated, we can hope to infer
causality in cases where simple regression
analysis would be misleading.
Reasons
• Multiple regression analysis can be used to
build better models for predicting the
dependent variable.
• An additional advantage of multiple regression
analysis is that it can incorporate fairly general
functional form relationships.
– In the simple regression model, only one function of
a single explanatory variable can appear in the
equation. As we will see, the multiple regression
model allows for much more flexibility.
Example
• Example
Interpretation of
estimated coefficients
• : A one [unit of independent variable] increase
in [independent variable] [increases/decreases]
[dependent variable] by [ [unit of dependent
variable], assuming other variables constant
• If is continuous
• If is linear
• Example
Assumption of S/MLR
1. LR 1: Linear in Parameter
2. LR 2: Random Sampling
3. LR 3: Zero Conditional Mean
4. SLR 4: Sample Variation in the Indep. Var.

MLR 4: No Perfect Collinearity
5. LR 5: Homoskedasticity
The primary drawback in using simple regression analysis for empirical
work is that it is very difficult to draw ceteris paribus conclusions
about how x affects y : the key assumption, SLR.3—that all other
factors affecting y are uncorrelated with x—is often unrealistic
Multiple regression analysis is more amenable to ceteris paribus

analysis because it allows us to explicitly control for many other factors
which simultaneously affect the dependent variable.
^ 𝐶𝑜𝑣 ( 𝑥 1𝑖 , 𝑦 𝑖)
𝛽 1=
𝑉𝑎𝑟 ( 𝑥 1 𝑖)
• For example, suppose we want to estimate the
effect of campaign spending on campaign
outcomes.
• For simplicity, assume that each election has two

candidates. Let voteA be the percent of the vote for
Candidate A, let expendA be campaign
expenditures by Candidate A, let expendB be
campaign expenditures by Candidate B, and let
totexpend be total campaign expenditures;
Can we interpret ?
This model violates assumption MLR.4 because by definition  has

an exact linear relationship with and
Solution  Drop
Example
Remember…LR1
• We are doing a LINEAR regression model
– SLR 1  Linear in parameter
– We cannot estimate because and are not
linearly related to
– We must use non-linear regression model
• But we can do linear regression model even

though the variables are not linear
Functional Form
• OLS can be used for relationships that are not
strictly linear in x and y by using nonlinear
functions of x and y – will still be linear in the
parameters
– Can use quadratic forms of x
– Can take the natural log of x, y or both
– Can use interactions of x variables
Economics 20 - Prof. Anderson 66

Quadratic Models
• For a model of the form y = b0 + b1x + b2x2 + u we
can’t interpret b1 alone as measuring the change
in y with respect to x, we need to take into
account b2 as well, since

yˆ  ˆ1  2 ˆ 2 x x, so 
yˆ ˆ
 1  2 ˆ 2 x
x

More on Quadratic Models
• Suppose that the coefficient on x is positive and
the coefficient on x2 is negative
• Then y is increasing in x at first, but will eventually
turn around and be decreasing in x
For ˆ1  0 and ˆ 2  0 the turning point

will be at x  ˆ1 2 ˆ 2
*
 
More on Quadratic Models
• Suppose that the coefficient on x is negative and
the coefficient on x2 is positive
• Then y is decreasing in x at first, but will
eventually turn around and be increasing in x
For ˆ1  0 and ˆ 2  0 the turning point

 
will be at x  ˆ1 2 ˆ 2 , which is
*
the same as when ˆ1  0 and ˆ 2  0

Quadratic Models: Example
• Suppose the relationship between Income (million
Rp) and age (year) is positive and linear
• We hence believe that as age increases 1 year,
income increases according to the coefficient of our
regression.
– If then if age of an individual increases one year then on
average income increases by 1.5 million Rp
– Income of a person aged 78 years old > income of a
person aged 40 (is this sensible?)
• OR, any other examples
Detecting model mispecification
• We can use residual plot to check
– If we have omitted important variable(s)
– If we should incorporate non-linearity in our
independent variable(s)
• A residual plot is a scatter plot of residual
against a particular independent variable
• This method is subject to many things, such as
– number of observation
– outlier
Outlier(s)
• Outliers are values of a variable which lie way
above or below average of the variable
• Before conducting any regression analysis,
always check your data for outlier(s)
– That’s why plotting a scatterplot is always
important
– Example
• Now let’s go back to our discussion on Income
and age
• According to theory, there should be an
inverted-U shaped relationship between the
two variable
• Hence we need to specify
• Let’s use Wooldridge data “SMOKE” in GRETL
– Number of cigarettes smoked each day depends
on age
Logarithmic Models
• Log model  natural log, even tough written
“log”
• Why we need log of the variable(s)

– Theoretical issues, needs functional transformation
– Efficiency issues
• Double log (or log – log model)

• Semi log model(s)
Functional Transformation
• Some theoretical economic models involves
non-linearity
– Cobb Douglass Production Function
we can transform this to a linear form:

The estimated version would be:
The estimated coefficients are elasticities

• Consider
(1)
and
(2)
• Y is not the same as Log Y

• We cannot use R2 to compare, as the ANOVA
statistics (ESS, RSS) are totally different
Interpretation of Log (or…=ln) Models
• If the model is ln(y) = b0 + b1ln(x) + u
b1 is the elasticity of y with respect to x
• If the model is ln(y) = b0 + b1x + u

b1 is approximately the percentage change in y given a 1 unit
change in x  2 ways of computing (accuracy issue)
• If the model is y = b0 + b1ln(x) + u

b1 is approximately the change in y for a 100 percent change in x

hprice2.dta
• Price of a house ($) is a function of pollution
(nitrogen oxide in the air measured in parts
per 100m) and number of rooms
• Interpretation of
Accuracy: semi-elasticity

05 16 Simple Regression 2

Uploaded by

Copyright:

Available Formats

You might also like

05 16 Simple Regression 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 16 Simple Regression 2

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

• Can be easily calculated using softwares

remember that the residuals are defined by

In other words, the OLS estimates and are chosen

2. The sample covariance between the

3. The point is always on the OLS regression

• Thus, we can view OLS as decomposing each

• How we can conclude that the model is acceptable?

• We use R2 and SE (of regression) to check the

• We want to know if the model can explain the

• We want to measure how well the proportion

• Assuming that the total sum of squares, SST, is

• From (2.36), the value of R2 is always between

• If the data points all lie on the same line, OLS

• In fact, it can be shown that R2 is equal to the

• R2 in microeconometric analysis tend to be

• Read: Wooldridge, Introductory… 2002, p40

We may go back to this and issue after

• Then, using OLS method

• : A one [unit of independent variable] increase

• Suppose the true relationship is

• What is our and ?

• Suppose the true relationship is

• What is our and ?

• Suppose the true relationship is

• What is our and ?

• Suppose the true relationship is

• But we omit x2 and estimate:

• We will have Omitted Variable Bias

By taking the expected value :

• But size of bias does also matter

• Because multiple regression models can

3. LR 3: Zero Conditional Mean

4. SLR 4: Sample Variation in the Indep. Var.

Multiple regression analysis is more amenable to ceteris paribus

• For simplicity, assume that each election has two

This model violates assumption MLR.4 because by definition  has

• But we can do linear regression model even

Economics 20 - Prof. Anderson 66

Economics 20 - Prof. Anderson 67

For ˆ1  0 and ˆ 2  0 the turning point

For ˆ1  0 and ˆ 2  0 the turning point

the same as when ˆ1  0 and ˆ 2  0

• Why we need log of the variable(s)

• Double log (or log – log model)

we can transform this to a linear form:

The estimated version would be:

The estimated coefficients are elasticities

• Y is not the same as Log Y

• If the model is ln(y) = b0 + b1x + u

• If the model is y = b0 + b1ln(x) + u

Economics 20 - Prof. Anderson 79

You might also like