Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 40

To remember

1. If you add more variables and this makes the existing coefficients change, the reason
might be that adding variables reduces the omitted variable bias. If adding a variable
makes the existing coefficient changes, this means that the added variable is correlated
with existing x and with y.
2. If the causal effect doesn’t hold, we probably violate one of the Gauss-Markov
assumptions.
3. Including a variable that doesn’t change (doesn’t variate) is a violation of the third
Gauss-Markov assumption, that there has to be sample variation in x.
4. The annual growth based on quarterly data = [1+_b(time)]4-1
5. By using the command vce(cluster) in Stata, we control for potential heterogeneity.

Terms and expressions


Precision = variance. Higher precision means smaller variance.

Biased = when the average of the estimators doesn’t equal the true population parameter.

Unbiased: When the estimated beta is unbiased, it means that the expected value of the
estimator is equal to the true population parameter, E ¿)= β 1.

Asymptotic = properties of an estimator when n tens to infinity

Consistent = when the estimate is close to the true population value

Restricted model = a model in which we have assumed (under H0) that several of the
estimators are equal to zero. The restricted model always has fewer parameters than the
unrestricted model.

Robust standard errors = the standard errors, and hence the variance, are robust to
heteroskedasticity. A technique to obtain unbiased standard errors of the OLS coefficients
under heteroskedasticity.

Cluster data = when there are subsamples within the data that are related to each other. For
example, data for test scores in a school, those scores might be correlated with classroom
because classrooms share the same teacher.

Cluster robust standard errors = to be used in panel data. When error terms uit are
correlated within clusters but independent across clusters, then regular standard errors, which
assume independence between all observations, will be incorrect. Cluster-robust standard
errors are designed to allow for correlation between observations within cluster.

Unit root = a feature of a stochastic process, such as random walks, that may cause problems
with statistical inference in time series. If there are d unit roots, the process will have to be
differentiated d times in order to become stationary.

When there is a correlation between u and xj we say that xj is endogeneous. If this is not the
case, xj is exogeneous.
Validation can be helpful to avoid overfitting (capturing noise by adding too many
variables/terms) and to assess predictive performance of the model.

Definitions and explanations


Homoskedasticity: Homoskedasticity means that the error term u has the same variance
given any value of the explanatory variable. The assumption of homoskedasticity may be
written as: Var ¿, which also implies that Var ¿. If this does not hold and Var ¿ does depend on
x, the error term exhibits heteroskedasticity. If this assumption is violated, we have
heteroskedasticity. Then the estimated results will still be unbiased but not efficient.
Homoskedasticity may be tested by looking at a plot of residuals and fitted values from the
regression. If we see a pattern in the residual plot this indicates heteroskedasticity, and if there
is no pattern it is an indication of homoskedasticity.
“Analyze the plot”
If there might be a sign of a pattern in the residuals, this might be a sign of heteroskedasticity,
which should be formally tested. For instance, the researcher may want to conduct a Breusch-
Pagan or White test. The researcher may also want to use robust standard errors in the case of
heteroskedasticity, which is a technique to obtain unbiased standard errors unbiased.

Panel data
Fixed effect estimator: The fixed effect estimator allows for a correlation between ai and x.
FE is a stepwise form of OLS. In the first step you subtract the mean value from every value
in the regression. Then, this regression can be estimated by OLS, and the estimator will then
be unbiased, since ai is constant across time and therefore will no longer be a part of the
regression. When using the fixed effect estimator explanatory variables that doesn’t vary over
time can’t be identified, which is the case here with variable x and x. The FE estimators also
have a lot of parameters to estimate, resulting in a high consumption of degrees of freedom.

1. Interpret the coefficients and their significance


2. If possible, see if the relevant coefficient in the RE estimate is closer the FE or the OLS
coefficient.

Simple linear regression


OLS
The simple linear regression is defined as:
y=β 0 + β 1 x +u

u = error term, unobserved factors


I = the population parameters we want to estimate. This is the true parameters, and we use
data to estimate these. We can use the estimates to say something about the population
parameters under some assumptions.

We define the residuals as the difference between actual and fitted values of yi.

u^ i= y i− ^y i= y i− β^ 0 − ^β1 x i
The OLS estimator chooses ^β 0and ^β 1 x i to minimize the sum of squared residuals.
n n

∑ u^ 2i =¿ ∑ ( y i − ^β 0− ^β 1 x i ) ¿
2

i=1 i=1
FOC:
n
^β :−2 ∑ ( y − ^β − ^β x ) =0
0 i 0 1 i
i=1
n
^β 1 :−2 ∑ x i ( yi − ^β 0− ^β 1 x i ) =0
i=1

This yield:

^β = y − ^β x
0 1

∑ ( x i−x ) ( y i− y) sample covariance( x i , y i )


^β 1= i=1 =
n
sample variance(x i)
∑ ( x i−x )2
i=1

Goodness of fit, R2
R2 tells us something about how well the explanatory variable explain the dependent variable,
that means how well does changes in x explain the change in y. R2 measures the fraction of
total variation that is explained by the regression and is always a value between 0 and 1.
Higher value indicates a better fit.
You should be careful when comparing R2 across models and samples. High R2 does not mean
that there is a causal relationship.

Total sum of squares (sample variation):


n
SST =∑ ( y i− y )
2

i=1

Explained sum of squares (part that is explained by x):


n
SSE=∑ ( ^y i− y )
2

i=1

Residual sum of squares (part that is unexplained by x):


n
SSR=∑ u^ 2i
i=1

SST =SSE+ SSR

2 SSE SSR
R= =1−
SST SST
A good estimator
An estimator is a good if it is
 Unbiased: E ( β^ 0 ) =β 0 and E ( β^ 1 ) =β 1, meaning that the average of the estimated values
equals the true population parameters.
 Efficient: An estimator is efficient relative to another estimator if it has a smaller
variance.
Under the following 5 assumptions, the OLS estimator is unbiased and efficient. This is called
the Gauss-Markov Assumptions.

Gauss-Markov Assumptions
1. Linear in parameters
The dependent variable must be related to the independent variable independently. The model
has to be linear in the parameters only, so it is possible to use non-linear functions of x and y.

The definitions of independent and dependent variables (level-log, log-level, etc.) don’t affect
the mechanisms of calculating the estimates but do affect the size and hence the interpretation
of the estimated coefficients.

2. Random sampling
The data has to be drawn from a random sample of the population.

3. Sample variation in x (no perfect multicollinearity in the case of multiple regression)


The sample outcomes of the explanatory variable (xi) are not all equal to the same value. This
is necessary to figure out how changes in x will affect y.

Multiple regression: None of the explanatory variables are constant, and there are no exact
linear relationships among the explanatory variables. If there exist such an exact linear
combination of the other dependent variables, the model suffers from perfect collinearity, and
it cannot be estimated by OLS. This assumption allows the explanatory variables to correlate,
they just cannot be perfectly correlated. If two of the variables are highly correlated, this may
lead to large variances for the OLS slope estimators.

NB! This assumption has nothing to do with the error term.

4. Zero conditional mean


The error term, u, has an expected value of zero given any value of the explanatory variables.
E¿
This assumption is crucial to interpret the relation between x and y in a causal way. It enables
us to think about the impact of changes in x keeping everything else equal.
u can affect y, but not x. If this assumption is violated, we get an endogeneity problem.
If E ¿ then E ( u )=0

This assumption can fail in the following ways:


 Misspecification: when the functional relationship between the explained and
explanatory variables is misspecified in the equation, e.g. if we forget to add the
quadric term of age when we also include age, when we add a variable as a level when
it should be a log
 Omitting an important factor that is correlated with any of the other explanatory
variables
 Measurement errors
If this assumption is violated, xj is said to be an endogenous explanatory variable.

5. Homoskedasticity
The error term u has the same variance given any value of the explanatory variable.
Var ¿
This implies that Var ¿. If this does not hold and Var ¿ does depend on x, the error term
exhibits heteroskedasticity.
Homoskedasticity may be tested by looking at a plot of residuals and fitted values from the
regression.

Implications
 Under assumption 1-4, the OLS estimator is unbiased, and ^β 1 represents the causal effects
of x on y in the population.
 Under assumption 1-5 the OLS estimator is also effective.

Under the 5 assumptions, we have that:


σ2
Var ( β^ 1)= n

∑ ( xi −x )2
i=0

n
σ 2 n−1 ∑ x2i
Var ( β^ 0 )= n
i=0

∑ ( x i−x )2
i=0

These formulas rely on the unknown error variance, σ 2 . We kan use the estimated residuals to
estimate σ 2 .
n
1
^σ 2= ∑
n−2 i=0
u^ 2i

We divide by n-2 instead of n because we have used 2 degrees of freedom when estimating ^β 0
and ^β 1. The smaller the estimated variance, the more precise (efficient) is the OLS estimator.

The variance of Beta1 is affected in the following way if we increase n:


n
We have more residuals to sum, so ∑ u^ i increases, hence σ^ 2 increases
2

i=0
 σ^ 2 decreases since n increases
n
 ∑ ( xi −x ) 2 increases are summed for a larger n
i=0

Interpretations
Estimator: formula for guessing a population parameter from sample data
Estimate: the actual value taken by an estimator in a specific sample
“An increase in x by 1 unit/% is associated with an increase/decrease in y by …. Units/%,
everything else equal”.

Level-level:
y=β 0 + β 1 x +u
∆ y =β 1 ∆ x

A one unit change in x leads to a β 1 unit change in y.

Log-level:
ln ( y )= β0 + β 1 x+u
% ∆ y ≈ (100∗β1 )∆ x

A one unit change in x leads to approximately a (100 * β 1)% change in y.


.

Level-log:
y=β 0 + β 1 ln ( x ) +u

∆ y≈ ( )
β1
100
%∆x

A 1 % change in x leads to approximately a ( )


β1
100
unit change in y. Percentage points?

Log-log:
ln ( y )= β0 + β 1 ln ⁡( x )+u
% ∆ y ≈∗β 1 ∆ x

A 1 % change in x leads to approximately a β 1% change in y. β 1 is the elasticity of y with


respect to x.
The demand elasticity is defined as:
∂q
∗p
∂p
ε=
q
Which is the elasticity for an equation q=q(p).

The price elasticity, for an equation as p=p(q) will be:


∂p
∗q
1 ∂q
=
ε p
If you have y=price, and x = quantity, to get the price elasticity you take 1 /Beta. This is the
normal way of interpreting elasticities and will make it easier intuitively.
Interpret, multiply, divide

Binary variables
A binary variable takes either the value 0 or 1.
Then E ¿ and E ¿
This regression allows the mean value of y to vary with the value of x.
β 1=E ¿

Multiple regression model


In a multiple linear regression, we incorporate more explanatory factors into the model. This
gives a more reliable estimate of y. The multiple linear regression:
y=β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k +u

Why using multiple regression?


There are several motivations to use a multiple regression:
 Controlling for several factors
 Estimating non-linear relationships (the model must be linear in the parameters, not in
the variables)
 Better goodness of fit
o More independent variables can explain more of the variation in y and give a
higher R2 (goodness of fit). R2 will still not be a good criteria for deciding
whether to include a variable or not in a regression.
o Testing joint hypotheses on parameters
 Testing e.g., if β 1> β2

Beta in multiple regression


For a sample with n observations, we want to estimate the β i.
n n
Min ∑ u^ 2i =¿ ∑ ( y i− ^β 0− ^β 1 x i ,1−...− ^β k x i , k ) ¿
2

i=1 i=1

The first order conditions give k+1 linear equations with k+1 unknows. Solving these systems
of equations gives formulas for the OLS estimators ^β 0 , β^ 1 , … β^ k .

^β measures how on average y changes in our sample when xj increases by one unit while
j
holding all other independent variables fixed constant. It shows the partial effect, the ceteris
paribus effect, of xj on y for j = 1,…,k. The ceteris paribus interpretation results from the
linearity of the regression.

A good estimator
1. Linear in parameters
The model relating to xi can be written as following:
y=β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k +u
The population model must be linear in the parameters only. We can use non-linear functions
of xi.

2. Random sampling
The data is drawn from a random sample of the population.

3. No perfect collinearity
None of the variables are constant, and there are no exact linear relationships among the
independent variables. This assumption is violated if there exists (a,b) such that x j=a+b x l.
This will for instance be the case if one variable is a multiple of another, one variable is the
sum of others, and when variables are shares or exclusive categories you cannot include all
shares/categories. STATA will drop one variable if you try to include two parameters that are
perfectly collinear.

4. Zero conditional mean


The error term, u, has an expected value of zero given any value of the explanatory variables.
The error term needs to be uncorrelated with each explanatory variable in the model.
E¿

When this assumption is violated we say that x is endogenous and we have an endogeneity
problem and a problem of identification.
5. Homoskedasticity
The error term u has the same variance given any value of the explanatory variables.
Var ¿

Implications
Under assumption 1-4 the OLS estimators are unbiased, so that E ( β^ j )=β j j=0,1 , … , k ,and we
can interpret the regression causally. This means that the average of the estimated values
equals the true population parameters.

If assumption 1-5 are satisfied, the OLS estimators, ^β j , are the best linear unbiased estimators
(BLUEs) of β j . OLS estimates will have the smallest variance (be most efficient) and be more
precise than other potential estimators.

BLUEs = Best Linear Unbiased Estimators

!Omitted variables
It might happen that we omit a variable that actually belongs in the population model. The
model we should estimate would be as follows:
y= β^ 0 + β^ 1 x 1 + ^β 2 x 2 +u
But, omitting a relevant variable, e.g. x2, by ignorance, we end up estimating the following
model (tilde emphasize that there is an underspecified model):
~y=~ ~
β 0 + β1 x 1
~
There is a simple relationship between β 1 and ^β 1:
~ ^ ^ ~
β 1= β 1 + β 2 δ 1
~ ~
where δ 1is the slope coefficient of x2 on x1. The difference between β 1 and ^β 1 is the omitted
variable bias: ^β 2 δ^ 1 .

~short ^ true /long ^ long, true ,x on y ~ x on x


β1 =β1 + β2 2
δ1 2 1

~
Omitted variable bias: ^β 2 δ 1

~
E ( β 1 ) =β1 if, and only if, either:
 ^β 2 = 0 (the partial effect of x2 is zero)
~
 δ 1 = 0 (x1 and x2 are uncorrelated)

The direction of the bias depends on the sign of β 2and the sign of corr(x1, x2).
The omitted variable bias occurs when we have a variable that impact the outcome and is
correlated with the independent variable of interest.

Precision/variance of OLS
If we have k regressors, then:
σ2
Var ( ^β j ) = , j=0,1,2 ,... , k
SS T j ( 1−R j )
2

2
R j is the R2 from regressing xi on zi without yi, where zi is an added control variable. Since σ 2
is unknown, we need an unbiased estimator for this. Under assumption 1-5, the following
estimator is an unbiased estimator of σ 2 :
n

∑ u^ 2i
SSR
2 i=1
σ^ = =
n−k −1 n−k−1
n-k-1 is the degrees of freedom, which is [the number of observations – number of estimated
parameters].

Adding a new variable may affect the variance of the estimated beta in the following ways:
 Increase the numerator (top) through increasing σ^ 2, since k increases by 1 when we
add a new variable.
 Decrease the numerator (top) through decreasing σ^ 2, since SSR decreases given that
we have one less variable that is unobserved.
2
 Affecting R j and thus decrease the denominator (bottom) if the added explanatory
variable is correlated with the existing explanatory variable (x).

^β is more precise when


j
 There are more observations
 Lower variance of the error term
 Higher variation in the xj’s.
 Less correlation between independent variables
o Less problem with multicollinearity

!Multicollinearity
Multicollinearity = high (but not perfect) correlation between two or more independent
variables.
Multicollinearity is not a violation of any assumptions, and the OLS estimators are still
BLUEs.

Multicollinearity causes a large variance of the estimators and therefore large standard errors,
and therefore statistical inference is more difficult. To solve the problem of multicollinearity
you can collect more data, redefine the research question or try to drop a variable even though
this might lead to an omitted variable bias ( tradeoff between precision and bias).

On the contrary, when there are variables that are not highly correlated with our independent
variable (x), we may want to include them in the regression even when they do not cause
omitted variable biases. It may reduce the variance in the error term, hence increase precision.
 STATA interpretation: In a regression, omitting a highly correlated variable that appear to
have zero effect (coefficient not statistically significant) on the dependent variable (y) will not
severely bias the estimate, and we may therefore want to exclude it. If two variables are
highly correlated including both will diminish the precision of the OLS estimate. Excluding
one of them, giving they are both significant, will on the other hand bias the estimates due to
the omission (omitted variable bias). In this case, you would still like to include both variables
and reduce the precision.

!Multiple Regression Analysis: Inference


Under assumption 1-5, the distribution of ^β j can take any shape. Statements of statistical
inference require knowledge of the full sampling distribution of ^β j .

6. Normality assumption
We now assume that the unobserved error is normally distributed in the population.

u Normal(0 , σ 2 )
We say that the error term is independently and identically distributed. The form and the
variance of the distribution does not depend on any of the explanatory variables.
y |x Normal(β 0 + β 1 x 1 +…+ β k x k , σ )
2

This implies that


^β Normal( β , Var ( ^β ) )
j j j
^β −β
j j
Normal( 0,1)
^
sd ( β j)

The relationship between the normal (N) and standard normal (Z) distributions gives:
β^ j −β j
z= Normal (0,1)
sd ( ^β ) j
Since the population parameter is unknown, we hypothesize its value (H0) and use this value
in the expression for z. σ is unknown in the expression of sd ( ^β j ) and are replaced with the
^β −β
unbiased estimator σ^ =
√ SSR
n−k −1
. This can be shown in the test-statistic t=
j

se ( β^ )
j
j
t n−k−1

Under this assumption we can hypothesize about the unknown population value of β j and use
statistical inference to test such hypothesis.

Implications
Under A1-A6 we have a classical linear model (CLM) assumptions. Under the CLM
assumptions, the OLS estimators have a stronger efficiency property than they would under
the Gauss-Markov assumptions.

Hypothesis testing
Possible conclusions:
 Reject H0
 Fail to reject H0
o We should never “accept” the H0.

1. Specify H0 and H1
2. Determine significance level . The significance level determines the “rejection area”: the
area exactly equal to the probability mass in our test statistic distribution.
3. Rejection rule: reject H0 when test statistic is larger than some critical value c which is
determined by chosen alternative hypothesis (one sided or two-sided test) and significance
level. This means the probability of rejecting a “true” H0 is small.

We also need to consider degrees of freedom in a regression, df = n – k – 1

Relationship between t and F statistic


There are different types of test-statistics.

If H0 involves only on restriction, the test statistic is t-distributed.

If H0 involves several restrictions, the test statistic is F-distributed. The F-statistic should be
used when H0 involves the hypothesis that more than one population parameter is taking a
certain value at the same time: H0: 3=0, 4=0, 5=0.

F statistic for testing exclusion of a single variable (q=1) is equal to the square of the
corresponding t statistic. T2n-k-1 has an F1, n-k-1 distribution, so the two approaches give the exact
same outcome provided that the alternative is two-sided in t-test. The t-statistics are easier to
obtain and can be used to test against one-sided alternatives, so when testing for a single
parameter we therefore normally use the t-statistic.

The t-statistic:

^β j−β j|H 0
t= t n−k−1
se ( β^ )
j

A linear combination of parameters gives the test statistic:


t =¿ ¿ ¿

se ( β^ 1 + ^β 2 ) ≠ se ( ^β ¿ ¿ 1)+ se( ^β ¿ ¿ 2)¿ ¿

Var ( X +Y )=Var ( x ) +Var ( Y )+ 2Cov ( X , Y )



se ( β^ 1 + ^β 2 )= s e ( ^β ¿ ¿ 1) + s e ( ^β ¿ ¿2) +2 s 12 ¿ ¿
2 2
Where s12 is an estimate for Cov ( ^β1 , ^β2 ).

More about the F-statistic


When H0 being that all the betas are equal to zero, this will be equivalent to testing whether
a group of variables has no effect on the dependent variable. If we want to test whether q=3
of the k=5 variables really should be included in the model, the null hypothesis would be:

H0: 3=0, 4=0, 5=0 (variables are jointly insignificant)

The null hypothesis then involves 3 exclusion restrictions, q=3, and that the 3 variables
have no effect on y. The alternative hypothesis will then be that H0 is not true, and that at
least one of the variables have an effect on y.

Given that H0 is true, then we have what is called a restricted model:


y=β 0 + β 1 x 1 + β 2 x 2 +u

The model without restrictions in the population parameters is called the unrestricted
model. The restricted model always has fewer parameters than then unrestricted model.

Adding variables to a model reduced the part of total variation in y that the model cannot
explain: SSRunrestricted < SSRrestriced. The F-test is based on the idea of comparing SSRur and
SSRr, and if SSRr is “sufficiently higher” than SSRur than we reject H0 and the restricted
model.

“Sufficiently higher” is reasonable to define by a relative change:


SS R r−SS Rur
SS Rur
We need to correct for degrees of freedom in the residuals:

( SS Rr −SS R ur )/q
F= F q ,n−k−1
SS R ur /( n−k −1)
q=dfr - dfur
dfur = n-k-1
This test-statistic will always be positive, if you get a negative F you have done something
wrong.

When F is large, we reject H0. If we reject H0 we say that xk-q+1,…,xk are jointly significant
at the appropriate significance level. This test does not tell us which of the variables that
have a partial effect on y. If the null is not rejected, then the variables are jointly
insignificant. Tables of the F-distribution will tell us when a calculated F-statistic is above
the critical values for the chosen significance level.

The R2 version of the F-statistic:


(R2ur −R 2r )/q
F=
(1−R 2ur)/(n−k−1)

Rur = unrestricted model R2 (model with most variables)


Rr = restricted model R2 (model with least variables)
N = number of observations
q = the number of variables added that we assume to be 0, numerator degrees of freedom
k = number of total variables in the unrestricted model

as SSRr = SST(1-R2r) and SSRur = SST(1-R2ur)

P-values:
P-value is the largest significance level at which we would fail to reject H0. Given that H0 is
true, what is the probability that we observe a value of t that is as large as the value of our test
statistic? A low p-value is evidence against H0. How we compute p-values depends on H1
(one- or two-tailed alternative hypothesis).

p−value=P (T >t )
p−value=P (|T |>t )=2 P(T >t )

https://www.statology.org/t-score-p-value-calculator/

https://www.statology.org/how-to-calculate-a-p-value-from-a-t-test-by-hand/

“The p-value is the probability of getting a coefficient of a given magnitude if we assume that
the null hypothesis, stating that the value should be equal to 0, is true (hence there is no
correlation between the y and the x). Since we here find the p-value to be so small, we
therefore choose to reject the null hypothesis to be true, since the probability of viewing a
coefficient of -0,047 to be so small.”

Type I and II errors


Type I error: rejecting a null hypothesis when it is true.
Type II error: failing to reject a null hypothesis when it is false.
The significance level  gives us the probability of a Type I error.

Confidence intervals
The opposite of the rejection area under a 2-sided t-test. The critical value of the t-test for a
given significance level gives the probability mass that we want to allocate to the tails of
distribution, where the tails of the distribution represent “unlikely” values of the population
parameter under H0. A confidence interval represents a range of “likely” values of the
population parameter.

The confidence interval for the unknown j is given by:

^β ± t se ( ^β )
j c j
Where tc is the critical value of the t-distribution that corresponds to a 2-tailed test with
significance level . The t-distribution has degrees of freedom equal to df = (n-k-1), where k
is the number of explanatory variables included in the model.
Asymptotic properties of OLS estimators
Asymptotic properties of OLS estimators are properties of the estimator when n goes towards
infinity (when n is large enough). When n tends to infinity OLS will be consistent, meaning
that there is a very high probability that the estimate is close to the true population value.
Consistency is a minimum requirement for sensible estimators. Under assumption A1-A4 the
OLS estimator ^β j is consistent for all j. Under assumption A1-A5 there is a normal
distribution. When n is large, we can conduct hypothesis testing as usual even when the
normality assumption does not hold. When n is large we can replace the conditional zero
mean assumption by Cov(xj,u), which is a weaker assumption.

Qualitive information
Qualitive information and dummies
Qualitive information could have 2 (binary) or more groups. We use a separate dummy, D,
variable for each of the categories, except one. The category left out of the regression is called
the base/reference/benchmark group. We interpret our coefficients against this excluded base
group. You cannot include a dummy variable for all categories  dummy variable trap.

Dummy affecting the intercept of a regression equation:


y=β 0 + β 1 x + δD+u

Dummy affecting both slope and intercept of a regression equation:


y=β 0 + β 1 x + δD+ γ (D∗x)+u
Using a binary variable allows the mean value of y to vary with the value of x.
β 1=E ¿
δ 1=E ¿

We can also include a dummy for when two variables are being interacted with each
other.

Heteroskedasticity
The homoskedasticity assumption explained previously says that the error term u has the same
variance given any values of the explanatory variables.
Var ¿
The most general form of heteroskedasticity can be written as:
Var ¿
A special case of heteroskedasticity is:
Var ¿
Homoskedasticity was not used when showing that the OLS estimators er unbiased (to be
unbiased A1-A4 have to be fulfilled). Hence, heteroskedasticity does not result in biased
estimators, and heteroskedasticity does not threaten the causal interpretation of the point
estimates). In the case of heteroskedasticity, the interpretation of R2 is also unaffected.

Homoskedasticity was used for calculating the variance of the OLS estimators. With
heteroskedasticity, the variance-formulas that have been derived are invalid. We do not know
how precise our estimators are, and we do not know how to test hypotheses statistically. The
Gauss-Markov theorem that OLS is BLUE no longer holds.

To detect heteroskedasticity there are two tests that can be performed:


 The Breusch-Pagan test (PB-test)
 The White test

The Breusch-Pagan test


Reg u_hat …variables
F-test where the null hypothesis is that all the coefficients = 0
H0: Assuming homoskedasticity, Var ¿.
Given the zero conditional mean assumption, this is equivalent to:
H0: E ¿.
This means that the squared error terms should be uncorrelated with all the explanatory
variables if H0 are true. This is what the BP-test will test.

If there is heteroskedasticity, û2 could be any function of the explanatory variables.


2
u^ =δ 0 + δ 1 x 1 +δ 2 x 2 +…+ δ k x k + μ
The null hypothesis of homoskedasticity translates to:
H0: 1 = 2 = … = k = 0
This can be tested with an F-test for the overall significance of the regression.

The Breusch-Pagan test tests only for a linear relationship between the squared residuals and
the independent variables. Given that the sample is relatively large, the standard errors should
not change much if we decide to correct for heteroskedasticity, while the homoskedasticity
assumption was correct.

Dealing with heteroskedasticity


In general, the variance of ^β 1is given by:
n

∑ ( xi −x )2 σ 2i
Var ( β^ 1)= i=0 2
SS T x
When i =  for all observations i, the variance formula reduces to the familiar 2/SSTx.
2 2

When i2 varies between observations, the usual calculations of standard errors are biased. A
valid estimator of Var ( β^ ¿ ¿ 1)¿ that is robust to heteroskedasticity of any form is:
n

∑ ( x i−x )2 u^ 2i
i=0

SS T 2x
Robust standard errors are a technique to obtain unbiased standard errors of the OLS
coefficients under heteroskedasticity. When estimating this in STATA the estimated
coefficients will be the same, and only the standard errors will differ.

The reason for not using robust standard errors in all cases (regardless of heteroskedasticity or
not) is that when the sample size is “small” and the homoskedasticity assumption holds, t-
statistics using “robust” standard errors are not always close to the appropriate t-distribution
and could throw off inference. Given that the sample is relatively large, the standard errors
should not change much if we decide to correct for heteroskedasticity, while the
homoskedasticity assumption was correct.

Alternative ways of dealing with homoscedasticity assumption:


 If the form of heteroskedasticity is known we can transform the model to get rid of it,
using for example Weighted Least Squares (WLS) or Generalized Least Squares
(GLS).
 If the form of heteroskedasticity is not known, estimate it and then transform the
model using Feasible GFS (FGLS).
 The point of using WLS og GLS when there is evidence of heteroskedasticity is the
efficiency concern (OLS estimators are no longer BLUEs).
Prediction and model specification
Models with quadratic terms
2 2
y=β 0 + β 1 x 1 + β 1 x 1 +u
If 1 > 0 and 2 < 0 the graph has a parabolic shape ( or ). The turning point can be
calculated as follows:
β
x ¿= 1 ∗β 2
2 | |
!Prediction about average y
The population model:
y=β 0 + β 1 x 1 + β 2 x 2 +u
The OLS estimation:
^y = β^ 0 + β^ 1 x 1 + ^β 2 x 2
Predicting the average outcome of y when the independent variables take certain values,
means estimating:
θ0 =E ( y| x1=c 1 , x 2=c 2 )=β 0 + β 1 c1 + β 2 x 2
The estimator is:
θ^ 0 = ^β 0+ ^β1 c 1+ ^β2 c 2

A 95 % confidence interval for θ0 is:


θ^ 0 ± t c∗se ( θ^ 0 )
tc is the critical value of the t-distribution with n-k-1 df, corresponding to a 2-tailed test with
the chosen significance level. If  = 5 % and n is large, tc = 1,96.

A problem is that se( θ^ 0) is not part of the regular output from a regression, as
(θ^ 0 ) ≠ se ( ^β 0 ¿+ se( ^β 1 ¿ c 1 + se( ^β 2 ¿ c2
We can transform the variables, so the regression output gives the right standard errors.
y=θ 0 + β 1 ( x 1−c 1 ) + β 2 ( x 2−c 2) + u

!Prediction about average y when outcome is ln(y)


If the model is
ln ( y)=β 0+ β 1 x1 + β 2 x 2 +u
Assumption A1-A6 implies that:
E ( ln ( y ) ) =β 0+ β1 x1 + β 2 x 2 +u∧Var ( ln ( y ) )=σ
2

To get the model in terms of y we take the “anti-log”:


ln ( y ) β +β x + β x +u
y=e =e 0 1 1 2 2

We need to be aware of the following:


E ( y )=E ( e ln ( y ) ) ≠ e E (ln ( y ))
Given A1-A6 we have that:
2
σ
E ( y| x )=E ( e ln ( y )
)=e ∗e β + β x +β x
2 0 1 1 2 2
2
σ
Where e 2 is the expected value of u2 under A6. So to predict y we have:
2 2
σ^ σ^ ^
^ + ln ⁡( y)
ln ( y)
^y =e ∗e =e
2 2

Where σ^ is the unbiased estimator of σ 2. With ln(y), this is the correct prediction rule.
2

2 SSR
An unbiased estimator of σ 2 is the mean squared error, σ^ = .
n−k −1

Comparing R2 between models with ln(y) and y


You cannot compare R2 values of models with different dependent variables, even though the
controls are the same. We want to measure how much of predicted y from ln(y) model
explains the total variation in y. Sample correlation between y and predicted y from ln(y)
model. In general R2 equals the square of sample correlation coefficient between yi and ^y i.

The following regression procedure will produce the sample correlation between y and
predicted u from the log(y) model.
1. Estimate the log(y) model and obtain the fitted values ^log ⁡( y) .
^
log ( y )
2. Generate a variable m ^ =e i
.
m
^
3. Regress y on i without an intercept. The fitted values for y from this regression
represents the predicted y from the log(y) model. These fitted values we denote as ~y .
4. Find the sample correlation between ~y and y in the sample, Corr(~y ,y)
5. Square this and compare with R2 of the model with y in levels.

Corrected or adjusted R2
R2 will always increase as we add more independent variables. This is because SSR will
always decrease. To compare goodness of fit for nested models of this type we have to use the
corrected R2. Corrected R2 penalizes for adding additional independent variables.

2
R =1−
[
SSR
n−k −1 ]
SST
n−1 [ ]
Outliers
If the model has outliers, this may result in none of the explanatory variables becoming
statistically significant. One solution to this is to drop outliers by dropping 5 and bottom 5 %
in the distribution (problem: ad hoc). We can also winsorize the variable.

Specification issues
When there is a correlation between u and xj we say that xj is endogeneous. If this is not the
case, xj is exogeneous.

If there is violation of the zero conditional mean assumption, E ( u| x )=0 . We have a


specification problem when this assumption doesn’t hold. Issues that can cause this
endogeneity/specification problems are:
 Omitted variables
 Functional form misspecification
 Measurement errors
 Simultaneity
 Reverse causality

Functional form misspecification:


May be caused if the correct x variables are included in the model, but in the wrong functional
form. This can be tested by testing one model against another with a different functional form
of one (or more) variables in initial model. H0 then is that the initial model is correct, which
also satisfies the zero conditional mean assumption.

One way to do this is to use nested and non-nested models:


Initial model: y=β 0 + β 1 x 1 + β 2 x 2 +u (1)
Possible alternative: y=β 0 + β 1 x 1 + β 2 x 2 + β 3 x 22 +u (2)
We then say that model 1 is nested in model 2. An additional alternative could be:
y=β 0 + β 1 ln ⁡( x ¿¿ 1)+ β2 x 2+u (3)¿ . This model though is not nested in in (1) or (2) since it
isn’t possible to remove something from 1 or 2 to get to 3.

Nested models
To test between nested models, we can do a RESET test. H0 is that (1), the initial model, is the
“correct” model that satisifies the zero conditional mean assumption. H1 is a model which
includes additional non-linearities of xi that is better. There are many potential nonlinearities
that are possible. The implication of H0 is that no nonlinear combination of x variables should
matter. The RESET test is based on adding polynomials of the fitted values to the original
regression. The square of the fitted values would pick up several forms of nonlinearity
between the explanatory variables. Based on this, when doing RESET we estimate:
2 3
y=β 0 + β 1 x 1 + β 2 x 2 +…+ δ 1 ^y + δ 2 ^y +u .
The reformulated null hypothesis is then: H0: δ 1=δ 2= 0. Then we conduct the F-test for
testing 2 exclusion restrictions.

Rejecting one model with RESET does not give a clear indication as to what should be the
correct model. Sometimes when RESET is used to discriminate between 2 different functional
forms, the test may fail to reject both of them. The conclusion might change with different
numbers of non-linear terms. We have no clear indication of what is best to do if we conclude
that both models we have tested are misspecified.

Non-nested models
To choose between non-nested models, we can use the Davidson-MacKinnon test. This can be
used to test 2 non-nested models with the same dependent variables against each other.

One alternative is testing:


y=β 0 + β 1 x 1 + β 2 x 2 +u
against
y=β 0 + β 1 log ⁡(x 1)+ β 2 log ⁡( x 2 )+u .
If model 1 is correct, then the fitted value ^y from model 2 shouldn’t be significant in 1. Thus,
a 2-sided t-test on the coefficient in from of ^y should provide an answer. H0: Model 1 is
“correct” and model 2 is not, (H0:  = 0). Thus, we prefer model 1 over model 2 if we cannot
reject H0, and  is not significantly different from zero.
y=β 0 + β 1 x 1 + β 2 x 2 + ^y +u.
Another alternative is that if model 2 is correct, then the fitted value ~y from model 1 should
not be significant in 2. H0 is that model 2 is “correct” and model 1 is not “correct” (H0:  = 0).
Thus, we prefer model 2 over model 1 if we cannot reject H0 ( is not significantly different
from zero).
y=β 0 + β 1 log ⁡(x 1)+ β 2 log ⁡( x 2 )+ ~y+ error .

A drawback with the Davidson-MacKinnon test, is that it is possible that both models are
rejected. Then we have no clear indication of what to do. There is also possible that none of
the models are rejected, and then we need to think about other criteria’s such as economic
intuition, R2, etc. If one of the models is rejected, this does not mean that the other model is
correct. Rejection could be due to a variety of functional form problems. The test offers
insight into a specific setting testing between two models and says nothing of all possible
other models. It is always important to think about model specification as well and not rely
solely on misspecification tests to guide model choice.

Panel data
Panel data is when we measure the same units in at least two periods, so we get two
dimensions. We get one cross sectional dimension N and a time series dimension T. Panel
data is also called longitudinal data. The advantages of panel data are as follows:
 Increase the sample size
 Able to reduce multicollinearity problems because of
o Variation between cross-sections
o Variation over time
 Able to control for unobserved effects better than in cross-sections or time-series.
 Able to build dynamic models

We will in general consider that N is large and that Ti is relatively small. This is typical the
case in panels of individuals, households or firms. Panel data is more than a sequence of
cross-sections over time. In a sequence of cross-sections we don’t necessarily observe the
same individuals over time. In panel data we follow individuals for several periods of time. A
balanced panel is when Ti = T for every i. An unbalanced panel is when Ti can be different
for each i. Unbalanced panels can be such that individuals have different starting periods.

The unobserved, specific effect: ai


With panel data we can control for unobservables, ai, which can be correlated with the
regressors and are individual-specific. The term ai picks up all unobserved individual specific
effects that are constant over time.

An example is illustrated below:


y ¿ =β 0+ β1 l ¿ + β 2 k ¿ + v ¿ , with v ¿=(ai +u ¿ )
ai represents here the amount of fixed input that is unobservable. This fixed input is a
complement of the firms’ labor and capital, l and k. With fixed input we mean that it is
constant over the sample period. ai could be thought of a proxy of the difference in
productivity, i.e. how much more output a firm is producing if it is given input k*.

Another example:
R D ¿ =β 0+ β 1 R D ¿−1+ v ¿ , with v ¿ =(ai +u¿ )
The term β 1 R D¿−1 captures the fact that previously investment in R&D reduced the cost
(marginal or fixed) of investment today. Thus, we have constructed a dynamic model where
the dependent variable and some of the explanatory variables are from different periods. If
Cov(RDit-1, ai)>0, OLS will provide an upward biased estimate of β 1 . We can control for
unobservables which are correlated with the regressors, that are common for all the
individuals.

uit = idiosyncratic error

ai, the individual unobserved specific effect may mean different things.
If subscript i denotes an individual:
 Ability
 Motivation
 Family background
If subscript i denotes a firm:
 Efficiency
If subscript i denotes a country
 political system
 citizens’ attitude
If subscript i denotes a market
 local demand
 willingness to pay

If Cor(xit,ait) ≠ 0, then OLS is biased and inconsistent, as the distribution of the estimated
coefficient becomes more and more tight around the true parameter as the sample size grows.
The correlation between xit and ai is violating the conditional mean zero, resulting in a biased
estimate. Because of this we need to get rid of ai or at least remove it from the error term.
Ignoring the existing of ai when estimating the model, we could think of this as an omitted
variable bias problem.

!Omitted variable bias problem


Assume that the true model is:
y=β 0 + β 1 x 1 + β 2 x 2 +u
Where x2 is the omitted variable. But instead, we use the following model:
~
y=β 0 + β 1 x 1 + v , where v =u+ β 2 x 2
~
This may result in a wrong estimate of β 1 compared to the true population parameter, when
~
we ignore the unobserved, individual specific effect ai. β 1 is picking up the effect of both x1
~
and ai. In a production function we may get a too large β 1, since this parameter is picking up
the effect of both capital input and efficiency, which both will increase output.

Corr(x1,x2) > 0 Corr(x1,x2) < 0


2 > 0 Positive bias Negative bias
2 < 0 Negative bias Positive bias

Positive bias means that the estimated coefficient is too large. Negative bias means that the
estimated coefficient is too small.
Estimation methods for removing ai
The occurrence of ai in the error term vit causes the bias when estimating the panel data model
with OLS. There are several ways to remove ai and still find an estimate of the original slope
coefficient 1.

Fixed effect estimators


The main idea with the fixed effect estimators is that it allows for ai to correlate with x.
First differencing (FD)
t=2
y i 2=β 0 + β 1 x i 2+ δ 2 D +(ai +ui 2)
y i 1=β 0 + β 1 x i 1+ δ 2∗0+( ai +ui 1)
y i 2− y i 1=( β 0− β0 ) + β1 ( x i 2−xi 1 ) + δ 2 (1−0 ) + ( ai−a i ) +(u i2 −ui 1)
∆ y i =β1 ∆ x i +δ 2∗1+∆ ui
The ∆ u i is uncorrelated with ∆ x i if ui1 is uncorrelated with xi2 and xi1, and ui2 us uncorrelated
with xi2 and xi1. This is also called strict exogeneity. Since ai has disappeared from the first-
difference equation, it cannot create problems. Thus, we can estimate FE-equations by OLS.
There is no constant term, so we force the OLS regression to go through origo.

If we only have dummies as variables:


This estimation allows us to see if there is a bigger change in y between time t and t-1 when
the Dummy=1, compared to the same time periods when the Dummy=0. In other words, this
approach studies whether each i have relatively more y between t and t-1 when the dummy=1
on t=1, compared to the same difference between t and t-1 when the dummy=1 on t-1.

A more general expression:


t=2 t =T
∆ y ¿ =β 1 ∆ x¿ 1+ …+ β k ∆ xitK +δ 2 ∆ D +…+δ T ∆ D +∆ u ¿

Least square dummy variable approach (LSDV)


It is possible to remove the individual specific effects by including (N-1) dummies:
y ¿ =β 0+ β1 x ¿1 +…+ β K x itK
i=2 i= N
+ a2 D +…+ aN D
t=2 t=T
+ δ 2 ∆ D +…+ δ T ∆ D +u ¿
If we were to include N dummies, then the intercept would give perfect multicollinearity
problems (we see that Di=1 is not included). Now ai are identified and estimated. Thus, we
have taken a1 out from the error term, and we can estimate the model with OLS without
violating the zero conditional mean assumption.

The within group estimator (WG)/Fixed effect estimator (FE)


FE can be understood as a stepwise OLS
The fixed effects transformation:
¿
y¿= y¿− yi
x ¿¿= x¿ −x i
Then we can run the following regression:
¿ ¿ ¿
y ¿ =β 1 x ¿ +u¿

Under this is illustrated with equations for one observable unit i.

y i 2=β 0 + β 1 x i 2+ ai +ui 2
y i 1=β 0 + β 1 x i 1+ ai +ui 1

1 1
( y i1 + y i 2 )= ( ( β0 + β 1 x i 1 +ai +ui 1 ) + ( β 0 + β 1 x i 2 +ai +ui 2 ) )
T T
1 β1 1 1
y i= ( β 0 + β 0 ) + ( x i 1+ xi 2 ) + ( a1 +a 1) + ( ui 1+u i2 )
T T T T

y i=β 0 + β 1 x i +a 1+u i

Then we do the FE transformation


y ¿¿ = y ¿ − y i
¿ ( β 0 + β 1 x i 1+ ai +ui 1 )−(β 0+ β1 x i+ a1 +ui )
¿ ( β 0−β 0 ) + β 1 ( x ¿−x i ) + ( ai−a i )+(u¿ −ui )
¿ β 1 x ¿¿ +u ¿¿
Again, we see that we have removed ai and it can therefore not create any problems. This,
we can estimate the within group transformation by OLS. Also, here there is no constant
term. Thus, we have to estimate it with OLS where we force the regression line to go
through origo.

We are only using the variation within each group, not between groups  within group
estimator.

To show that this is equal to FD estimator when t=2:


y ¿¿ = y ¿ − y i
y i 2− y i=( β 0+ β 1 xi 1 +a i+u i1 ) −¿
y ¿ − y i=β 1 ( x¿ −x i ) + ( a i−ai ) +(u¿ −ui )

( (
y i 2−
y i1 + y i 2
2 ))
=β 1 ¿

2 y i 2 − y i 1− y i 2
2
=β 1 (2 x i 2−x i1 −xi 2
2
+)( 2ui 2−u i1 −ui 2
2 )
y i1 − y i 2
2 (
x −x
=β 1 i1 i 2 + i1 i 2
2 )( )
u −u
2
y i 1− y i 2=β 1 ( x i1 −xi 2 )+(ui 1−u i 2)

It works perfectly fine if you include a dummy as well. Then:


δ
y i 2− y i=( β 0+ β 1 xi 1 +a i+u i1 ) −(β ¿ ¿ 0+ β 1 xi + + a + ui)¿
2 1
y i1 − y i 2
2
=β 1 ( 2 ) (
x i1−x i 2 δ u i1−ui 2
+ +
2 2 )
y i 1− y i 2=β 1 ( x i 1−x i 2 ) +δ+(ui 1−u i 2)
STATA interpretation: Having OLS with cluster robust standard errors can be done because
this clustering recognizes that even though the individuals are independently drawn, there
might be correlation over time for a given individual.

FE or FD?
LSDV approach and within group estimators (FE) are always identical.

FD and LSDV, as well as FE gives the same estimated when T=2. To achieve this FE
estimation must include a dummy variable for the second time period to be identical to the FD
estimates that include an intercept. With T=2 FD has the advantage of being straightforward
to implement, and it is easy to compute heteroskedasticity-robust statistics after FD
estimation. The error variance estimates are also numerically identical.

When T≥ 3, FE and FD are not the same. Both are unbiased under their assumptions and
consistent. For large N and small T, the choice between FE and FD hinges on the relative
efficiency of the estimators, and this is determined by the serial correlation in the
idiosyncratic errors uit. When uit are serially uncorrelated, FE are more efficient than FD. So if
we are to use FD or FE depends on the potential serial correlation in the error term, and
measurement errors. The best thing may be to use both estimation approaches and compare
the results.

LSDV approach and within group estimators (FE) are always identical. In LSDV there are a
lot of coefficients to be estimated, consuming degrees of freedom. When using WG/FE we
are only using the within group variation in the data, while it sometimes would be nice to take
advantage of some of the between-group variation. The FE estimator is neither able to give
coefficient estimates of variables that are not varying over time. Might therefore be useful to
use another estimator that allows for this and neither gives many degrees of freedom.

Fixed effects vs. random effects


Both fixed effect models and random effect models allow for an unobserved individual
specific effect.

Fixed effect estimators: (WG, FD, LDSV) allow for correlation between the explanatory
variables x and ai. But if there is no variation over time in one of the explanatory variables the
effect of this non-time varying explanatory variable cannot be identified. The fixed effect
estimators are also “consuming” degrees of freedom, having a lot of parameters that need to
be estimated.

Random effect estimator: (RE) gives the possibility to estimate the effect of non-time-
varying explanatory variables and still take account of ai. Neither does it “consume”
parameters as the fixed effect estimators do (RE is more efficient than the FE-estimator). But
the RE model cannot be used if there is correlation between x and the ai.

The choice: To choose between fixed effect estimators and random effect estimators, we can
base this decision on a) theory or b) testing (Hausman test).
The Hausman test is as follows:
−1
m= q^ 1 ' [ VC ( q^ 1 ) ] q^ 1 χ 2df =k
q^ 1= ^β − ^β
FE ℜ

VC ( q^ 1 )=VC ( β^ )−VC ( ^β ¿ ¿ ℜ)¿


FE

VC ( . )=variance−covariance ¿

H0: ^β FE ≅ β^ ℜ. RE is unbiased only if xit is uncorrelated with ai, while FE is unbiased even if xit
is correlated with ai. This occurs when the two coefficient vectors ^β FE and ^β ℜ are similar. If xit
is uncorrelated with ai, we then prefer to use RE since RE is more efficient than FE.

If we reject H0 we will use FE (because ^β FE ≠ ^β ℜ, thus xit is correlated with ai).

If we only have one slope coefficient, , that we want to compare between RE and FE, we get
the following setup:
−1
m=( ^β − ^β )' [ VC ( q^ 1) ] ( ^β − ^β )
FE ℜ FE ℜ

( ^β FE− ^β ℜ)
2

¿
VC ( q^ 1 )

( )
( ^β−β H 0 ) ( β^ −β H 0 )
2 2
2
t = =
st . dev ( ^β ) var ( ^β)
^
( β−β H 0)
t=
st . error ( ^β)
The numerator (top) of m is what we “want to test” and the denominator (bottom) is a kind of
weighting.

A comment about FE vs. RE: When ai affect the outcome yit linearly, we have seen that we
can eliminate ai from the specification through some linear transformation (WG/FD/LSDV).
However, if ai affects the outcome yit nonlinearly, it isn’t easy to find a transformation to
eliminate ai. Such a nonlinear model is for instance the binary choice model where the
observed outcome yit takes the value of either 1 or 0. Then the only alternative might be a RE-
specification (it doesn’t help to include one dummy for each individual as we did for LSDV).

Random effect vs. pooled OLS


If FE and RE are the same, we have concluded that we rather prefer RE instead of FE because
it is more efficient. But why choose RE and not OLS?

y ¿ =x ¿ β1 + β 0 +(ai +u¿ )
y ¿−1 =x¿−1 β 1 + β 0+(a i+ u¿−1 )
v ¿= p v ¿−1+ ε ¿
( a i+ u¿ ) =p ( ai +u¿−1) + ε ¿

We see that the existence of ai, both on the LHS and RHS of the last equation, induces
correlation between the residuals in period t and t-1 for individual i, i.e. between vit and vit-1.
We know that autocorrelation, correlation over time in the residuals, will affect the efficiency
(the st.errors of our coefficient estimates). Thus, we can no longer be sure about the
correctness of the st.errors. RE is taking into account this dependency between the residuals
for individual i, and thus we prefer RE over pooled OLS for panel data sets.
The difference-in-difference estimator (DiD)
To be used when we have cross-sections that are sampled before and after a treatment.

This is one of the mostly used models to analyze the effect of policy changes if one has one
group that is potentially affected by a treatment and a control group that is not affected by the
policy change. The treatment group experiences an exogenous change (treatment), while the
control group don’t. At the same time, we control for an underlying trend not caused by the
policy change.

It is important to be sure that the effect of the policy intervention is really caused by the
intervention itself, and not by something else. Whether individuals are allocated into
treatment or control group should be random (random assignment). The conditional
independence assumption, or unconfoundedness, says that conditional upon the covariates
xi, selection into treatment is unrelated to the potential gain from the treatment.

Using DiD we rely on the common trend assumption, that the lines for the treatment and
control group are parallel before the policy change/treatment. But, there might be a difference
in the trend already before the policy change, and therefore we might not assume that
differences in yit is caused by the treatment. [This assumption is therefore often violated. It
can be violated if people have the possibility to change from control to treatment group (they
want the treatment), or if they have different long run trends. One way to support evidence for
this assumption is by providing trends for the different control/treatment groups before the
treatment takes place, to show that they have a common trend, conducting placebo tests, etc.
 just come up with something fitting if you are asked to]

Calculating DiD looking at mean values:


We can compute the DiD estimator (δ^ 1) in two ways:
 Computing the differences in averages between the treatment and control group in
each period, and then difference the results over time
δ^ 1=( y 2 ,T − y 2 , C )−( y 1 , T − y1 , C )
Where T = treatment group and C = control group, and the numbers (1,2) indicate
before/after treatment, e.g. the time periods.
 Compute the changes in averages over time for each of the treatment and control
groups, and then difference these changes.
δ^ 1=( y 2 ,T − y 1 , T )− ( y 2 ,C − y1 , C )
The DiD estimator can be given the interpretation as an average treatment effect.

Calculating DiD thorugh a linear regression model:


Letting dT be a dummy taking the value 1 for those in treatment group T, and d2 being a
dummy variable for the second (post-policy change) time period, the equation we will
estimate is as follows:
y=β 0 + δ 0 d 2+ β 1 dT + δ 1 d 2∗dT +uijt
δ 1 measures the effect of the policy, and is the same as the DiD-estimator.
uijt  i = individuals, j = different type of control groups, e.g. zones, schools, etc., t = time

Which method to use:


The DiD may be calculated when looking at mean values, or in a linear regression model
setting. The latter approach might be better because:
 Gives us statistical significance
 We may also control for that the characteristics of the individuals/observations (also
called confounding factors in the treatment literature) in the treatment group and in
the control group might be different.

Time dummies
We can construct time dummies that take the value 1 if the observations are from a certain
time period, zero otherwise.

The endogeneity problem in short


When the zero conditional mean assumption doesn’t hold (cov (x,u) ≠ 0), we can get an
endogeneity problem. This means that x and the error term u is related, and that causal
interpretation of OLS is problematic.

There is several ways to solve this problem, but one way is to use an instrumental variable
(IV) estimation instead of the endogenous variable. Using IV estimation means finding an
instrument z that has a direct effect on x, but no direct effect on y and is uncorrelated with u.
There are two conditions that z has to satisfy, it has to be exogenous (cannot be tested) and
relevant (can be tested with a F-test). If the two conditions are met, we can estimate the IV
estimate with a 2 stage least square (2SLS) method.

Endogeneity problem
When the zero conditional mean assumption holds, the OLS results can be interpreted in a
causal way. If we on the other hand think that x is correlated with the error term u, the
variable is endogenous, and we have an endogeneity problem.

The endogeneity problem may occur in the following situations:


 Measurement error
 Simultaneity
 Omitted relevant factors
 Reverse causality

Definitions:
Weak instrument: When the correlation between x and z is small

Solutions to the endogeneity problem


 Include additional control variables (reduce omitted variable bias)
o If there are omitted factors which are correlated with x and also impact y,  is
biased and we cannot interpret our results causally.
o In practice it is often difficult to believe that we have included all such factors,
making causal interpretation of OLS estimations challenging.
 Use panel data if possible and hope that the removal of the unobserved fixed effect by
FE or FD estimation solves the problem.
o Availability of panel data, the effect of non-time-variant x and simultaneity
issues may complicate this
 Use an instrumental variable instead of suspected endogenous variable
o  This is what we will look closer at
o IV-estimation can overcome potential endogeneity problems
o IV estimates can be interpreted in a causal way, while OLS cannot in the case
of endogenous explanatory variables.
o The instrumental variable estimator gives the causal effect if we have at least
one valid instrumental variable.

Instrumental variables (IV) estimation


The endogeneity problem means that the effect on y of changes of x from the impact of
changes in u – difficult to say if x changes because u changes, or if it is because solely x
changes. By identifying a variable z that affects x, but not directly y, we can overcome this
problem. Estimating the relationship between z and x helps to overcome our endogeneity
problem and to get the causal effect of x on y. To do this we use instrumental variables (IV)
estimations.

The data on z has to satisfy two conditions:


1. Exogeneity
a. Z is exogeneous, Cov(z,u) = 0, and is not correlated with the error term.
b. This cannot be tested, so can only be discussed
c. That z is exogenous also means that z has no direct effect on y, only an indirect
effect via x. If z directly affects y, it should appear in the population model,
and will then not serve as an instrument.
2. Relevance
a. Z has to have a direct effect on x.
b. This can be tested since both x and z are observable.
i. This condition is satisfied when the F-statistic in the first-stage
regression is larger than 10. If F<10 the instrument is weak and we can
get unprecise estimations or biases when the sample size in addition is
small.
The IV estimator will be consistent for 1, but biased. If an estimator ^β i is consistent, the
distribution of ^β i converges to I as N goes to infinity. The IV estimator will be biased,
especially if the correlation between z and x is small or the sample size is small.

IV-estimation: 2SLS
If the two conditions for z is satisfied (exogeneity and relevance), we can estimate the IV
estimator by conducting the following steps:
1. Regress x on z. This is called the first stage.
x=π 0 + π 1 z + v
The predicted ^x is the part of x that can be explained by z. Because z is uncorrelated
with the error term u in the structural model, so is ^x .
1. The second stage is to replace x with ^x in the original equation, and estimate this by
OLS:
y=β 0 + β 1 x^ +u
The coefficient in front of ^x is the IV estimate of β 1.

OLS and IV estimates can give very different results. One possible explanation is that either
the estimates from OLS or IV is biased. IV estimation may for example have problems with
small samples or weak instruments, making the estimate consistent but not unbiased. If z is in
fact not exogenous, the IV estimator is not consistent, and we are not solving the endogeneity
problem but may make it even worse.

Variance of the IV estimator:


The variance of the IV estimator is computed based on assumption of homoskedasticity.
Given homoskedasticity and the exogeneity and relevance conditions, we can construct an F-
test similarly to OLS.

F-tests are used when we compare the variances or shares (andeler) of two populations.
The F-test for variance is as follows, where s is the standard error:
s 21
F= 2 F n−1 , n−2
s2

Asymptotic variance of the OLS estimator is:


2
σu
Var ( β^ ¿ ¿ 1 ,OLS ) ≈ 2
¿
nσx
And the asymptotic variance of the IV estimator is:
2
σu
Var ( β^ ¿ ¿ 1 , IV ) ≈ 2 2
¿
n σx px , z
with
u2=Var(u)
2x = Var(x)
px,z = Corr (x,z)

If Var ( β^ ¿ ¿ 1 ,OLS )<Var ( ^β ¿ ¿ 1 , IV )¿ ¿ then OLS is more efficient than the IV estimation.
1
As a rule of thumb, the standard error of the IV estimator is about larger than the variance
p xz
for OLS, where pxz is the sample correlation between xi and zi. This we can think of as the cost
of doing IV when we could be doing OLS. When x is endogenous, we therefore prefer OLS to
IV since it gives a more efficient estimate. If pxz is small, the IV standard error becomes large,
and a small pxz also means that z is a weak instrument. If z = x, then p2xz=1, and we get the
OLS variance.

That the variance is asymptotic means that when N goes towards infinity, the variance will
be reduced and/or move towards its true value.

Using additional instruments:


When we have more than one instrument for one endogenous variable x, the standard error for
the IV estimate gets smaller. It is always useful to use more instruments if we have them. But,
if some of the extra instruments we use to increase efficiency in fact are not endogenous, the
IV estimator is not consistent.

Testing the relevance condition


The relevance condition says that z has to have a direct effect on x. A small correlation
between the instrument z and the explanatory variable can lead to large biasedness in the
estimations. Even with large sample sized the 2SLS estimator can be biased and have a
distribution that is very different from the standard normal if there is a weak correlation
between z and x.

To test if the relevance condition hold, we can look at the first-stage F statistic for exclusion
of the instrument variables. If F>10, then the instrument is relevant in explaining x. If F<10,
the relationship between z and x is considered weak, and we have a weak instrument. This
implies large standard errors, and the IV estimator can be biased, especially if the sample size
is very small.

When the conditions for a valid instrument is not met


1. Exogeneity condition
a. When this condition is met, Cov(z,u) = 0
b. If this condition is not met, it means that we are not solving the endogeneity
problem. The bias in the IV estimators can become large, even larger than with
the OLS estimation. The endogeneity problem may get worse using IV than
OLS.
c. In general, if an instrument doesn’t satisfy the exogeneity condition, we should
never is IV estimation. Therefore, it is important to discuss this condition, even
though we fundamentally never know if it holds.
2. Relevance condition
a. If cov(z,x) = 0 we cannot conduct an IV estimation.
b. When z has a small impact on endogenous x, this is also a problem (weak
instrument). The consequences are large standard errors, and IV estimates can
be biased (particularly if the sample size is small). If we have a slightly weak
instrument, we can always conduct an IV estimation and discuss the
consequences of this. With a very weak instrument, the instrument is often
useless.
!!Testing for endogenous variable
We cannot test if z is exogenous, but we can test for whether the suspected endogenous
variable x in fact is endogenous. If we find that it is endogenous, we should use IV estimation.

H0 = x is exogenous
 Regress x on z (OLS)
 Save the residuals from the first stage regression, call them ^v .
 Estimate y=β 0 + β 1 x + δ v^ +u
 H0 that x is exogenous is equivalent to H0: =0.
 If we reject H0, there is an endogeneity problem, and we use IV.
 The test requires a valid instrument. You should not rely solely on the result of this
endogeneity test in determining if we need to use IV estimation or not.

Including additional exogenous variables


Adding additional exogenous x’s can be thought of as “an instrument for themselves”. When
they are omitted, they end up in the error term u. If we omit relevant exogenous variables, the
exogeneity condition is less likely to hold. There is a similar tradeoff as in multiple OLS
estimation, bias vs. precision. We should not include additional x’s which are endogenous or
suspected to be.

!!Overidentification
When we have more instruments than endogenous variables, we have overidentification.

If we have 2 instruments for 1 endogenous variable, we have 1 overidentifying restriction.


If we have 4 instruments for 2 endogenous variables, we have 2 overidentifying restrictions.

Number of instruments−number of endogenous explanatory variables=number of overidentifying restriction

When we have more instruments than endogenous variables, we can test whether some of the
instruments are uncorrelated with the error term u in the population. Assume we have two
instruments, z1 and z2 for one endogenous variable x:
 We could get two different IV estimates, one using z1 as an instrument for x, and one
using z2 as an instrument for x.
 If both instruments are exogenous, they should give similar IV estimates. If the two
IV-estimates are very different, one of the instruments or both, are endogenous and
should not be used as instruments.
Testing whether overidentifying restrictions are exogenous, means comparing different IV
estimates based on using different instruments. With more than two instruments, comparing
many different IV estimates is cumbersome. The intuition for the test is that if all instruments
are exogenous, the residuals from the 2SLS estimation should be uncorrelated with q linear
functions of the instruments, where q is the number of overidentifying restrictions.

How to test for overidentification:


 Save û from the second-stage estimation
 Regress û on all instruments and obtain R2.
2 2
 Construct the test statistic nR2, where n R ≈ χ q (Chi-square distributed with q degrees
of freedom), Sargan test
 H0: all instruments are exogenous
 H1: at least one of the instruments is not exogenous
 Reject H0 if nR2 is larger than the chosen critical value
 Remember that we never “accept” the null hypothesis, we simply fail to reject it. We
can never conclude that these instruments satisfy the exogeneity condition.

Time series
An advantage of time-series is that we can build dynamic models that rely on observations in
previous periods. It also enables us to discuss short-term and long-term effects.

Dynamic characteristics in time series may be captured in several different ways:


 Contemporaneous and lagged explanatory variables
 Lagged dependent variable
 Lags in the error term

Static model
y t =α + δ 0 z t +u t
A model is static if the change in the explanatory variable gives an immediate change in the
dependent variable.

Distributed lag models (DR models)

Finite distributed lag model:


y t =α + δ 0 z t + δ 1 z t −1+ δ 2 z t−2 +ut
A finite number of lags of the explanatory variables.

Infinite distributed lag model:


y t =α + δ 0 z t + δ 1 z t −1+ δ 2 z t−2 +…+ut

y t =α + ∑ δ t zt −τ +ut
τ=0
yit is dependent on the indefinite past.

Autoregressive distributed lag model (ARDL)


y t =α + δ 0 z t + γ y t−1 +ut
Where yt-1 is called the lagged dependent variable. This model portrays the time of path of the
dependent variable in relation to its past value(s). This way of modelling is useful if we think
there are habits or persistency in preferences.

y t =α + δ 0 z t + δ 1 z t −1+ …+γ y t−1+ …+ γ p y t− p +ut


This is called an ARDL(p,q) model, where p is the number of lags of the dependent variable,
and q is the number of lags of the additional explanatory variable, z.

Multipliers
When the explained variable in a model is dependent on lagged variables x, a permanent
increase in x will lead to the increase being spread over several periods. The long-run (LR)
multiplier shows the aggregate effect of an increase.
the long run multiplier=∑ δ t
The short-run multiplier is the coefficient for the specific time-lag, e.g. δ 1, since this gives the
effect that you see immediately.

The LR multiplier in a finite distributed lag model with lagged dependent variables:
y t =α + δ 0 z t + δ 1 z ( t−1) +δ 2 z ( t−2) + γ y t−1 +γ y t−1 +ut
Assuming we have reached a steady state, where yt = yt-1= y and zt = zt-1 = zt-2 = z.
Then
y ( 1−γ )=α+ ( δ 0 +δ 1 +δ 2) z +ut
α ( δ 0 +δ 1+ δ 2 ) 1
y= + z+ u
(1−γ ) (1−γ ) (1−γ ) t
(δ 0 +δ 1+ δ 2 )
Where is the long-run multiplier.
(1−γ )

OLS on time-series data


Giving that the Gauss-Markow Assumptions holds, the OLS estimator will be the best linear
unbiased estimator (BLUEs).
 Linear in parameters
 Zero conditional mean
o For each t, the expected value of the error ut, given the explanatory variable for
all time periods, is zero.
o This condition may be violated if we have omitted variables or measurement
errors in some of the regressors.
 No perfect collinearity
o No independent variable is constant, nor a perfect combination of the others
 Homoskedasticity
o The variance of ut is the same for all t. x and ut is independent, and that Var(ut)
is constant over time.
 No serial/autocorrelation
o Corr(ut,us)=0 for all t ≠s

Adding the normality assumption as well, we get the classical linear model assumptions:
ut independent of x ,∧ut N ( 0 , σ )
2

(The errors ut are independent of X and are independently and identically distributed as
Normal(0, σ 2)).

Under the 6 CLM Assumptions, the OLS estimators are normally distributed, conditional on
X. Further, under the null hypothesis, each t statistic has a t distribution, and each F statistic
has an F distribution. The construction of confidence intervals are also valid. Then everything
we have learned about estimation and inference for cross-sectional regressions will apply
directly to time-series regressions.

Functional forms in time-series

Dummy variables:
C t=α 0 + δ 0 Y t +δ 1 Y t −1+ η D t ≥2015 + ut
D
t ≥ 2015
{
= 1 , if t ≥ 2015
0 otherwhise

This can for instance be used to check if consumption is higher (lower) in the years greater
(smaller) or equal to the years prior to 2015?

Trends and seasons:


A lot of series have trend.
y t =α 0+ α 1 t+u t (Linear time trend)
log y t =β 0+ β1 t+ ut (Exponential time trend)

y t− y t−1
∆ log y t y t −1 relative change , growth
β 1= ≈ =
∆t ∆t time unit

Unobserved, trending factors that affect yt might also be correlated with the explanatory
variables. Ignoring this will lead to spurious regression. Including a time trend in the
regression model eliminates the problem.

By taking away the trend (detrend), we are only using variation relative to the trend lines. The
R2 of the detrended series reflects how well the explanatory variable(s) explains the dependent
variable net of the effect of the time trend.

We can de-seasonalize time series as well. When analyzing time series, we should control for
trends and seasonality. If there are underlying common trends that are not taken into account,
the results are biased and can even give the opposite coefficient sign relative to the true effect.
Moreover, the R2 could be exaggeratedly large even though the true R2 might be minimal.

Autocorrelation
Autocorrelation might affect the efficiency of the estimate but should in theory not lead to
biased coefficient results. However, for small samples, autocorrelation might have
implications even for the sign of the slope of the coefficients. With autocorrelation it can be
shown that E ( β^ 1 ) =β 1 is unbiased, but the standard errors of this estimator will no longer be
unbiased. This implies that the least squares estimates are unbiased but inefficient in the
presence of autocorrelation.

Autocorrelation is a sign of dynamic mis-specification.

One assumption working with time series and cross-sectional data is that ui and uj are
independent for all i ≠ j (the error terms doesn’t correlate).
Cov ( u i , u j )=E {(ui−E ( u i ))(u j−E ( u j ) ) }=E ( u j∗ui )=0
If this assumption is violated, we have lags in the error term  autocorrelation.
!!!First-order autoregressive error model (AR(1)-model)
An AR(1)-model is a simple model with a lag in the error term. This model also captures the
dynamic characteristic in time-series data.

y t =β 0+ β1 x t +ut , t=1,2 … T
ut =ρ ut −1 + et
2
e t N (0 , σ e )
E ( e t e s ) =0 for all t ≠ s

Manipulating (multiply second period with p)


y t =β 0+ β1 x+ ut
ρ y t−1= ρ β 0 + ρ β 1 x t−1 + ρ u(t −1)

Differentiating:
y t −ρ y t−1=β 0− ρ β 0 + β 1 x t −ρ β 1 x t−1 +(u t− ρu t−1)
y t −ρ y t−1=β 0 ( 1−ρ ) + β 1 ( xt −ρ x t−1 ) +(ut −ρ ut−1 )
y t =β 0 ( 1−ρ ) + β 1 ( x t −ρ xt −1 ) + ρ y t−1 +(u t− ρut −1)
where ( ut −ρ ut −1) =e t .
A model with a lagged dependent variable (yt-1), a contemporaneous variable (xt), a lagged
variable (xt-1) and a well behaved residual et (since we assumed E(etes)=0 for all t ≠ s), as this,
is also an ARDL(p,q)-model.

 Hva I alle dager oppnår jeg med å gjøre dette her, hva har dette her å si for autokorrelasjon,
hva gjør jeg

Detecting autocorrelation
Durbin Watson Test
T T T T
2 2
∑ ( u^ t−u^ t−1 ) ∑ u^ t +∑ u^ t −1 −2 ∑ u^ t u^ t−1
2

d= t =2 T
= t =2 t=2
T
t =2

∑ u^ t ∑ u^ t
2 2

t=2 t=2

T T T
2 2
∑ u^ t ∑ u^ t −1 ∑ u^ t u^ t −1
t =2 t=2 t=2
¿ T
+ T
−2 T

∑ u^ t
2
∑ u^ t
2
∑ u^ t
2

t =2 t =2 t=2

T T

∑ ( ρ u^ ¿ ¿ t−1 +e t )^ut −1 ^p ∑ u^ t −1 u^ t −1
t =2 t=2
¿ 1+1−2 T
=1+1−2 T
¿¿
∑ u^ t ∑ u^ t
2 2

t =2 t =2

¿ 1+1−2 ^p=2(1− ^p )
^pis the sample first order coefficient of autocorrelation. Since -1< ^p <1, d must be such that 0
< d < 4 ( ^p=1 gives d=0, ^p=-1 gives d=4). The DW-test is often inconclusive, meaning that
there are values of d where one cannot say whether autocorrelation is a problem or not. The
DW-test also only work for AR(1)-processes.

Another way of testing autocorrelation


Another way of testing for an AR(1) serial correlation in the error term without strictly
exogenous regressors.

The zero conditional mean assumption implies that K explanatory variables in period t is
independent of the error term in the same period, but also that the error term is independent of
all explanatory variables from all periods.
E ( ut|x )=0 , t=1. .T
E ( ut|x1 t , … , x Kt ) =0 ,t =1. . T
E ( ut|x1 τ ,… , x Kτ )=0 , τ ≠ t
Anything that causes unobservables at time t, ut, to be correlated with any of the explanatory
variables in any time period causes the assumption about zero conditional mean to fail. Then
the regressors are not strictly exogenous. It is quite normal in time series that one or more of
the regressors are not strictly exogenous, in the following there will be presented a test where
one doesn’t necessarily need strictly exogenous regressors.

1. Run OLS
y t =β 0+ β1 x 1t +…+ β k x kt +ut
2. Find
u^ t
3. Run OLS
u^ t =δ + ρ u^ t−1 +θ1 x1 t + …+θk x kt + v t
4. T-test for 
H0: =0 (no autocorrelation)

!Correlogram
This we will come back to in highly persistent time series.

Consequences of autocorrelation
With autocorrelation it can be shown that E ( β^ 1 ) =β 1 is unbiased, but the standard errors of this
estimator will no longer be unbiased. This implies that the least squares estimates are
unbiased but inefficient in the presence of autocorrelation.

Correcting for autocorrelation


There are different ways of correcting for this efficiency-problem that autocorrelation cause.

When the structure of the autocorrelation is known


y t =β 0+ β1 x t +ut
ut =ρ ut −1 + et
2
e t N (0 , σ e )
If we know , we can do the following transformation:
y t =β 0+ β1 +u t
ρ y t−1= ρ β 0 + ρ β 1 x t−1 + ρ ut −1
Differentiating:
y t −ρ y t−1=β 0 ( 1−ρ ) + β 1 ( xt −ρ x t−1 ) +(ut −ρ ut−1 )
Where the e t =ut −ρu t−1. We can now use the OLS on the transformed model, i.e., regress
y t −ρ y t−1 on x t −ρ x t −1. This is called the quasi-difference data method.

When the structure of autocorrelation is unknown


This is called the Cochrane-Orcutt procedure.
y t =β 0+ β1 x t +ut
ut =ρ ut −1 + et
1. Estimate
y t =β 0+ β1 +u t
2. Get an estimate of ^p
u^ t =δ + ρ u^ t−1 + v t
3. Transform variables
¿
y t = y t −^ρ y t−1
x ¿t = xt −^ρ x t−1
4. Estimate
¿ ¿ ¿ ¿
y t =β 0+ β1 x t +(ut− ^ρ u t−1)
5. Repeat from 2-4 until the estimated  differ very little from one round to the next.

We have now estimated a static model, but where we have allowed that the shocks ut are
serially correlated and to follow an AR(1) process. We have estimated y t =β 0+ β1 x t +ut where
the shocks ut are AR(1), ut =ρ ut −1 + et and et is white noise e t N (0 , σ 2e ). We thus have found
estimates of 1 and .

Common factor approach


We may also estimate the static model with serially correlated shocks using a common factor
approach.

y t =β 0+ β1 x t +ut , t=1,2 … T
ut =ρ ut −1 + et
Which may be transformed as follows:
y t =β 0+ β1 +u t
ρ y t−1= ρ β 0 + ρ β 1 x t−1 + ρ u(t −1)
Differentiating:
y t =β 0 ( 1−ρ ) + β 1 x t −ρ β 1 xt −1+ ρ y t −1+(ut −ρ ut −1)
The coefficient of xt-1 is a product of two other coefficients. Thus there is a common factor
since we have here only two deep parameters, 1 and .

Or as the unresticed model:


y t =π 0 +π 1 x t + π 2 x t −1+ π 3 y t−1 +(ut −ρu t−1 )
Where H0: 2=-1*3
If we cannot reject H0, that indicates that the estimated model y t =β 0+ β1 x t +ut ,
ut =ρ ut −1 + et is the true model.

STATA: In Stata you will get the coefficient 1 and , thus making you able to calculate π 2.

!!Restrictions on the coefficients


A third way to analyze a static model with autocorrelation in the shocks ut is to put restrictions
on the coefficients.
This is a non-linear regression method, i.e. it is non-linear in the parameters. The restricted
form:
y t =β 0 ( 1−ρ ) + β 1 x t −ρ β 1 xt −1+ ρ y t −1+(ut −ρ ut −1)

 Hva gjør man, hva ser man, hva innebærer det at man legger restriksjoner på
koeffisientene?
Stationarity
In order to predict a time series, it has to have some attributes that are constant over time.
Thus, time-series need to be stationary. A time series is stationary if there is
E ( y t )=μ( constant mean)
Var ( y t ) =σ 2 ( constant variance)
Cov ( y t , y t−s )=γ s (covariance depends on s , not t )

When non-stationary time series are used in a regression model the results may indicate a
significant relationship even when there is none, which may lead to a spurious regression.
Thus, if we suspect one or all of the series used in a regression to be non-stationary we should
be careful since we can obtain spurious regression results.

You might also like