Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

EC3304 Econometrics II

Textbook: Stock and Watsons Introduction to Econometrics


Topics:
Ch. 2 and 3: Review of probability, statistics, and matrix algebra
Ch. 10: Panel data
Ch. 11: Binary dependent variables
Ch. 12: Instrumental variables
Ch. 13: Experiments and quasi-experiments
Ch. 14: Intro to time series and forecasting
Ch. 15: Estimation of dynamic causal eects
Ch. 16: Additional topics in time series
1
Assessment:
1. Tutorials and class participation 20%
2. Midterm exam 20%
3. Final exam 50%
Oce hours:
Wednesday 3pm-5pm AS2, #05-02
2
Chapter 2 Review of Probability
Section 2.1 Random variables and probability distributions
The probability of an outcome is the proportion of the time that an outcome occurs in
the long run
Ex. whether your computer crashes while writing a term paper
The computing crashing is a random outcome that isnt known beforehand
Its probability tells us the likelihood of this outcome
Random variable: numerical summary of a random outcome
A discrete random variable takes on only discrete values like 0, 1, 2, ...
Ex. binary random variable: 1 if employed, 0 if unemployed
A continuous random variable takes on a continuum of possible values. Ex. income
3
4
Probability density function (PDF) of a discrete random variable
a function that returns the probability for each possible outcome of a random variable
Cumulative distribution function (CDF)
function that returns the probability that a random variable is less than or equal to a
particular value
In tabular form:
5
The probability density function also summarizes a continuous variable
similar to the discrete case in that it returns a number (not the probability! ) for each
possible value of the random variable but it has to be interpreted dierently
the area under the PDF between any two values is the probability that the random
variable falls between these two values
The cumulative distribution function (CDF) (or cumulative probability distribu-
tion)
dened just as for a discrete random variable
7
Section 2.2 Expected values, mean, and variance
The expected value of a random variable, Y , denoted E(Y ) or
Y
, is a weighted-average
of all possible values of Y
For a discrete random variable with k possible outcomes and probability function Pr (Y = y):
E(Y ) =
k

j=1
y
j
Pr (Y = y
j
)
For a continuous random variable with probability density function f (y):
E(Y ) =
_

y f (y)
Intuitively, the expectation can be thought of as the long-run average of a random variable
over many repeated trials or occurrences
10
The variance and standard deviation measure the dispersion or spread of a proba-
bility distribution
For a discrete random variable:

2
Y
= var (Y ) = E(Y
Y
)
2
=
k

j=1
(y
j

y
)
2
Pr (Y = y
j
)
For a continuous random variable:

2
Y
= var (Y ) = E(Y
Y
)
2
=
_

(y
y
)
2
f (y)
The standard deviation is the square root of the variance, denoted
Y
It is easier to interpret the SD since it has the same units as Y
The units of the variance are the squared units of Y
11
Linear functions of a random variable have convenient properties
Suppose
Y = a + bX,
where a and b are constants
The expectation of Y is

Y
= a + b
X
The variance of Y is

2
Y
= b
2

2
X
The standard deviation of Y is

Y
= b
X
12
A random variable is standardized by the formula
Z
X
X

X
,
which can be written as
Z = aX + b,
where a =
1

X
and b =

X
.
The expectation of Z is

Z
= a
X
+ b = (
X
/
X
)

X
= 0
The variance of Z is

2
Z
= a
2

2
X
=
_

2
X
/
2
X
_
= 1
Thus, the random variable Z has a mean of zero and a variance of 1
13
Section 2.3 Two random variables
The joint probability density function of two random variables is the probability
that the random variables simultaneously take on certain values
X is a binary variable equal to 1 if it rains and 0 if it does not
Y is a binary variable equal to 1 if the commute is short and 0 if it is long
The marginal probability density function of Y is simply the PDF
The term is used in the context of multiple random variables when referring to just one
of the random variables
14
The distribution of Y given that X takes on a specic value is called the conditional
distribution of Y given X
Conditional probabilities are dened by Pr (Y = y | X = x) =
Pr (Y = y, X = x)
Pr (X = x)
Consider the computer crashing example. Now suppose you are randomly assigned to a
computer in the computer lab.
The conditional expectation (or mean) of Y given X is the mean of the conditional
distribution of Y given X
It is computed as
E (Y | X = x) =
k

j=1
y
j
Pr (Y = y
j
| X = x)
Two random variables are independent if knowing the outcome of one of the random
variables provides no information about the other
Mathematically:
Pr (Y = y | X = x) = Pr (Y = y)
Another useful denition of independence is
Pr (Y = y, X = x) = Pr (Y = y) Pr (X = x)
16
Covariance measures the extent to which two random variables move together:

XY
cov (X, Y ) = E [(X
x
) (Y
Y
)]
=
k

i=1
l

j=1
(x
i

X
) (y
j

Y
) Pr (X = x
i
, Y = y
j
)
The covariance depends on the units of measurement
Correlation does not
Correlation is the covariance divided by the standard deviations of X and Y :

XY
corr (X, Y ) =
cov (X, Y )
_
var (X) var (Y )
=

XY

Y
The correlation is constrained to the values:
1 corr (X, Y ) 1
17
Section 2.4 The normal, chi-squared, Student t, and F distributions
We will concern ourselves with the normal distribution which we will see over and over
The normal distribution is a continuous random variable that can take on any value
Its PDF has the familiar bell shape graph
We say X is normally distributed with mean and variance
2
, written as X N
_
,
2
_
Mathematically, the PDF of a normal random variable X with mean and variance
2
is
f (x) =
1

2
2
exp
_

(x )
2
2
2
_
If X N
_
,
2
_
, then aX + b N
_
a + b, a
2

2
_
A good relationship to memorize is
A random variable Z such that Z N (0, 1) has the standard normal distribution
The standard normal PDF is denoted by (z) and is given by
(z) =
1

2
exp
_

1
2
z
2
_
The standard normal CDF is denoted by (z)
In other words:
Pr (Z c) = (c)
Ex. Suppose X N (3, 4) and we want to know Pr (X 1)
We compute the probability by standardizing and then looking the probability up in a
textbook:
Pr (X 1) = Pr (X 3 1 3) = Pr
_
X 3
2
1
_
= Pr (Z 1) = (1) = 0.159
20
Section 2.5 Random sampling and the distribution of the sample average
A population is any well-dened group of subjects such as individuals, rms, cities, etc
We would like to know something about the population but cannot survey each subject
Ex. wages of the working population
Instead we use random sampling
Random sampling: n objects are selected at random from a population with each
member of the population having an equally likely chance of being selected
If we obtain the wages of 500 randomly chosen people form the working population, then
we have a random sample of wages from the population of all working people
The observations are random reecting the fact that many dierent outcomes are possible
If we sample 500 dierent people twice, then the wages for each sample will dier
21
Formally, let Y be a random variable that we want to learn about
Y has some unknown, underlying distribution
A sample of n observations drawn from the same population are denoted Y
1
, Y
2
, . . . , Y
n
,
where Y
i
denotes the i
th
observation
Each Y
i
has the same marginal distribution as Y so that we say Y
1
, Y
2
, . . . , Y
n
are indepen-
dently and identically distributed (i.i.d.)
The sample mean

Y is

Y =
1
n
(Y
1
+ Y
2
+ + Y
n
) =
1
n
n

i=1
Y
i
Since the n observations are random so is the sample mean and as such it has a sampling
distribution
22
The mean of

Y is
E
_

Y
_
= E
_
1
n
n

i=1
Y
i
_
=
1
n
n

i=1
E(Y
i
) =
1
n
n
Y
=
Y
The variance of

Y is

Y
var
_

Y
_
= var
_
1
n
n

i=1
Y
i
_
=
1
n
2
n

i=1
var (Y
i
) =
1
n
2
n
2
Y
=

2
Y
n
(since Y
1
, Y
2
, . . . , Y
n
are i.i.d. for i = j cov (Y
i
, Y
j
) = 0)
These results hold whatever the underlying distribution of Y is
If Y N
_

Y
,
2
Y
_
, then

Y N
_

Y
,
2
/n
_
(because the sum of independent, normally distributed random variables is normally dis-
tributed)
23
Section 2.6 Large-sample approximations to sampling distributions
There are two approaches to characterizing sampling distributions: exact or approximate
The exact approach means deriving the exact sampling distribution for any value of n
nite-sample distribution
The approximate approach means using an approximation that can work well for large n
asymptotic distribution
The key concepts are the Law of Large Numbers and the Central Limit Theorem
The Law of Large Numbers states that under certain conditions

Y will be close to
Y
with very high probability when n is large
If Y
i
, i = 1, . . . , n are i.i.d with E(Y
i
) =
Y
and if var (Y
i
) < , then

Y
p

y
24
The Central Limit Theorem states that under certain conditions the distribution of

Y is well approximated by a normal distribution when n is large


Suppose that Y
1
, . . . , Y
n
are i.i.d. with E(Y
i
) =
Y
and var (Y
i
) =
2
Y
, where 0 <
2
Y
< .
Then, for large n,

Y
Y

Y
/

n
a
N (0, 1)
This is a very convenient property that is used over and over in econometrics
A slightly dierent presentation of the CLT that is often useful:

n
_

Y
Y
_
a
N
_
0,
2
Y
_
26
Chapter 3 Review of statistics
Statistics is learning something about a population from a given sample from that population
Estimation, hypothesis testing, and condence intervals
Section 3.1 Estimation of the population mean
An estimator is a function of the sample data
An estimate is the numerical value of the estimator actually computed using data
Suppose we want to learn the mean of Y ,
Y
How do we do it? We come up with an estimator for
Y
Naturally, we could use the sample average

Y
We could also use say the rst observation, Y
1
There are many possible estimators of
Y
so we need some criteria to decide which ones
are good
28
Some desirable properties of estimators
Let
Y
denote some estimator of
Y
Unbiasedness: E(
Y
) =
Y
Consistency:
Y
p

Y
Eciency:
Let
Y
and
Y
be unbiased estimators.
Y
is more ecient than
Y
if var (
Y
) < var (
Y
).


Y is unbiased, E
_

Y
_
=
Y
, and consistent,

Y
p

Y
Eciency? var
_

Y
_
=

2
Y
n
< var (Y
1
) =
2
Y


Y is the Best Linear Unbiased Estimator (BLUE)
That is,

Y is the most ecient estimator of
Y
among all unbiased estimators that are
weighted averages of Y
1
, . . . , Y
n
29
Section 3.2 Hypothesis tests concerning the population mean
The null hypothesis is a hypothesis about the population that we want to test
Typically not what the researcher thinks is true
A second hypothesis is called the alternative hypothesis
Typically what the researcher thinks is true
H
0
: E(Y ) =
Y,0
and H
1
: E(Y ) =
Y,0
Hypothesis testing entails using a test statistic to decide whether to accept the null hypoth-
esis or reject it in favor of the alternative hypothesis
In our example, we use the sample mean to test hypotheses about the population mean
Since the sample is random and the test statistics is random we have to reject or accept the
null hypothesis using a probabilistic calculation
30
Given a null hypothesis, the p-value is the probability of drawing a test statistic (

Y ) that
is as extreme or more extreme than the observed value of the test statistic ( y)
A small p-value indicates that the null hypothesis is unlikely to be true
To compute the p-value we make use of the CLT and treat

Y
a
N
_

Y
,
2

Y
_
Under the null hypothesis

Y
a
N
_

Y,0
,
2

Y
_
(assume for now that
2
Y
is known)
For y >
Y,0
, the probability of getting a more extreme positive value than y:
Pr
_

Y > y |
Y,0
_
= Pr
_

Y
Y,0

Y
>
y
Y,0

Y
_
= Pr
_
Z >
y
Y,0

Y
_
=
_

y
Y,0

Y
_
The p-value is thus
p-value = 2
_

y
Y,0

Y
_
32
For y <
Y,0
, the probability of getting a more extreme negative value than y:
Pr
_

Y < y |
Y,0
_
= Pr
_

Y
Y,0

Y
<
y
Y,0

Y
_
= Pr
_
Z <
y
Y,0

Y
_
=
_
y
Y,0

Y
_
The p-value is thus
p-value = 2
_
y
Y,0

Y
_
We can encompass both these cases by using the absolute value function:
p-value = 2
_

y
Y,0

_
33
Typically, the standard deviation
Y
is unknown and must be estimated
The sample variance is
s
2
Y
=
1
N 1
n

i=1
_
Y
i


Y
_
2
The sample standard error is

Y
=
s
Y

n
,
a consistent estimator
The p-value formula is simply altered by replacing the standard deviation with the standard
error
p-value = 2
_

y
Y,0

Y

_
That is, when n is large,

Y
a
N
_

Y
,
2

Y
_
If the p-value is less than some pre-decided value (often 0.05) than the null hypothesis is
rejected . Otherwise, it is accepted (or more precisely, not rejected).
34
There are two possible errors that can be made
A type I error occurs when a null hypothesis is rejected when it is true
A type II error occurs when a null hypothesis is not rejected when it is false
The signicance level of a test (or the size of a test) is the probability of a type I error:
= Pr (Type I error) = Pr (Reject H
0
| H
0
is true)
The probability of a type II error:
= Pr (Type II error) = Pr (Do not reject H
0
| H
0
is false)
The power of a test is one minus the probability of a type II error:
1 = Pr (Reject H
0
| H
0
is false)
35
Section 3.3 Condence intervals for the population mean
An estimate is a single, best guess of
Y
A condence interval instead gives a range of values that contains
Y
with a certain prob-
ability
Using the sampling distribution Y
a
N

Y
,
2
Y

we have:
Pr

1.96
Y
Y

Y
1.96

= 0.95
Pr

Y 1.96
Y

Y
Y + 1.96
Y

= 0.95
The interval

Y 1.96
Y
, Y + 1.96
Y

contains
Y
with probability 0.95
An estimate of this interval [ y 1.96
Y
, y + 1.96
Y
] is called a 95% condence interval
of
Y
Note that the rst interval is random but the second is not
36
Appendix 18.1 Summary of matrix algebra
A vector is a collection of n numbers or elements collected either in a column or vector:
b =
_

_
b
1
b
2
.
.
.
b
n
_

_
and c = [ c
1
c
2
c
n
]
The i
th
element of vector b is denoted b
i
A matrix is a collection of numbers or elements in which the elements are laid out in
columns and rows
The dimension of a matrix is n m, where n is the number of rows and m is the number
of columns
37
An n m A is
A =
_

_
a
11
a
12
a
1m
a
21
a
22
a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
nm
_

_
,
where a
ij
is the (i, j) element of the A
A square matrix has an equal number of rows and columns
A matrix is symmetric if element (i, j) is the same as (j, i)
A diagonal matrix is a square matrix in which the o-diagonal elements are zero
38
The identity matrix I
n
is a n n with 1s on the diagonal and 0s everywhere else
I
n
=
_

_
1 0 0
0 1
.
.
. 0
.
.
.
.
.
.
.
.
. 0
0 0 0 1
_

_
The transpose of a matrix, denoted A

, switches the rows and columns


That is, element (i, j) of A becomes element (j, i) of A

Matrix summation: two matrices of the same dimension are added together by adding
their elements
Scalar multiplication is multiplying a scalar (single number), , with a matrix: A
Vector multiplication of two n 1 column vectors, a and b is computed as a

b =

n
i=1
a
i
b
i
Two matrices A and B can be multiplied together to form the product AB if they are
conformable
the number of columns of A equals the number of rows of B
The (i, j) element of the resulting matrix is computed by multiplying the i
th
row of A with
the j
th
column of B
The inverse of the square matrix A is dened as the matrix for which A
1
A = I
n
40
Some useful properties of matrix addition and multiplication:
1. A+ B = B + A
2. (A+ B) + C = A+ (B + C)
3. (A+ B)

= A

+ B

4. If A is n m, then AI
m
= A and I
m
A = A
5. A(BC) = (AB) C ]
6. (A+ B) C = AC + BC
7. (AB)

= B

41
Ex. summation notation vs. matrix notation
The simple linear regression model is
Y
i
=
0
+
1
X
i
+ u
i
, i = 1, . . . , n
where Y
i
is the dependent variable, X
i
is the independent variable,
0
is the intercept,
1
is the slope, and u
i
is the error term
The OLS estimator minimizes the sum of squared errors:
n

i=1
u
2
i
=
n

i=1
(Y
i

1
X
i
)
2
The rst-order conditions are:
n

i=1
_
Y
i

1
X
i
_
= 0
and
n

i=1
X
i
_
Y
i

1
X
i
_
= 0
42
Solving for

0
and

1
:

1
=

n
i=1
_
X
i


X
_ _
Y
i


Y
_

n
i=1
_
X
i


X
_
2
and

0
=

Y

1

X
Properties of the OLS estimator of the slope of the simple linear regression model
43
We can also write the linear regression model in matrix notation
We generalize the model to include k independent variables (including an intercept)
Y
i
=
0
+
1
X
1i
+
2
X
2i
+ . . . +
k
X
ki
+ u
i
, i = 1, . . . , n
The k-dimensional column vector X
i
contains the independent variables of the i
th
obser-
vation and is a k dimensional column vector of coecients. Then, we can rewrite the
model as
Y
i
= X

i
+ u
i
, i = 1, . . . , n
The n k-dimensional matrix X contains the stacked X
i
, i = 1, . . . , n:
X =
_

_
X

1
X

2
.
.
.
X

n
_

_
=
_

_
1 X
11
. . . X
k1
1 X
12
. . . X
k2
.
.
.
.
.
.
.
.
.
.
.
.
1 X
1n
. . . X
kn
_

_
44
The n-dimensional column vector Y contain the stacked observations of the dependent
variable and the n-dimensional column vector u contain the stacked error terms:
Y =
_

_
Y
1
Y
2
.
.
.
Y
n
_

_
and u =
_

_
u
1
u
2
.
.
.
u
n
_

_
We can now write all n observations compactly as
Y = X + u
45
The sum of squared errors is
u

u = (Y X)

(Y X)
The rst-order conditions are
X

_
Y X

_
= 0
which can be solved for

as
X

_
Y X

_
= 0
X

Y = (X

X)


= (X

X)
1
X


is the vector representation of the OLS estimators
If X

i
= [1 X
i
], the two approaches result in equivalent estimators
46

You might also like