Professional Documents
Culture Documents
All Econometrics - Slides PDF
All Econometrics - Slides PDF
University of Mannheim
Fall 2014
Important dates
Start: 2013-10-07
End: 2013-12-05
Lectures:
Tuesday 10:15 - 11:45 in L 7, 3-5 - 001
Thursday 10:15 - 11:45 in L 7, 3-5 - 001
Exercises:
Thursday 13:45 - 15:15 in L 9, 1-2 003
Thursday 15:30 - 17:00 in L 9, 1-2 003
teaching assistants: Maria Marchenko
Slides will be provided via Ilias, usually on Friday for the next week.
Contact
Office: L7, 3 - 5, room 142
Phone: 1940
E-Mail: isteinke@rumms.uni-mannheim.de
Office hour: on appointment
Overview:
1 Probability theory
2 Asymptotic theory
3 Conditional expectations
4 Linear regression
Y = β0 + β1 X + u,
e.g. Y consumption, X wage, u error term typically data not generated by
experiments, error term “collects all other effects on consumption besides
wage”
→ variables somehow “random”
→ How do we formalize randomness?
Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 8
Aims of this course:
(1) Provide basic probabilistic framework and statistical tools for
econometric theory.
(2) Application of these tools to the classical multiple linear regression
model.
→ Application of these results to economic problems in Advanced
Econometrics II/III and follow-up elective courses.
1 Probability measures
2 Probability measures on R
3 Random variables
4 Expectation
Ω = {ω1 , ω2 , ω3 , · · · }
(e.g. Ω = N, Ω = Z).
Defintion 1.1
A probability measure P on a countable set Ω is a set function that
maps subsets of Ω to [0, 1], i.e. P : P(Ω) → [0, 1], and has the following
properties:
(i) P(Ω) = 1.
(ii) It holds
∞
[ ∞
X
P( Ai ) = P(Ai )
i=1 i=1
Recap: Series
X ∞
X
ai := aij ,
i∈I j=1
Lemma 1.2
For a countable sample space Ω = {ωi }i∈I (with countable I ) and a
probability measure P on Ω. Then for every A ⊆ Ω, it holds
X
P(A) = P({ω}).
ω∈A
Proof: Exercise.
Let ω ∈ Ω. An event {ω} that only contains one element is also called an
elementary event.
Definition 1.5
A set function P : A → [0, ∞) is a measure on (Ω, A) if for A1 , A2 , . . .
pairwise disjoint ∈ A it holds that
∞ ∞
!
[ X
P Ai = P(Ai ) (σ − additivity)
i=1 i=1
Proof: Exercise.
Proof: Exercise.
Ingo Steinke (Uni Mannheim) Advanced Econometrics I Fall 2014 22
Chapter 1: Probability theory 1.2 Probability measures on R
Definition 1.10
A class A∗ of subsets of Ω is a field if
(i) ∅ ∈ A∗ ,
(ii) A ∈ A∗ , then AC ∈ A∗ ,
(iii) A1 , A2 ∈ A∗ , then A1 ∪ A2 ∈ A∗ .
∞ ∞
!
[ X
P∗ Ai = P ∗ (Ai )
i=1 i=1
is called pre-measure.
If, in addition, P ∗ (Ω) = 1 it is called probability pre-measure.
P
Then, the cdf is defined by F (x) = a∈Sf ∩(−∞,x] f (a).
SP = {a ∈ R : P({a}) > 0}
Remark 1.18
1 SP ⊆ A is countable and P(SP ) = 1.
2 If P is a discrete probability measure with support SP then F has
jumps at a ∈ SP with jump heights P({a}).
Example 1.19
1 Binomial distribution
n i
P({i}) = π (1 − π)(n−i) for i = 0, 1, . . . , n,
i
F 0 (x) = f (x),
if f is continuous at x.
A density is not unique but almost unique.
Example 1.22
1 Normal distribution
1 (x − µ)2
1
f (x) = √ exp − .
2πσ 2 σ2
Parameter: µ ∈ R, σ > 0
2 Uniform distribution
1
f (x) = 1 (x)
b − a [a,b]
Parameters: −∞ < a < b < ∞
3 Exponential distribution
Parameter: λ > 0
Extension to Rk
∂k F
(x1 , . . . , xk ) = f (x1 , . . . , xk ).
∂x1 · · · ∂xk
cf.(1), p.27.
Then by
Z b1 Z bk
P((a1 , b1 ] × · · · × (ak , bk ]) = ··· f (x1 , . . . , xk )dxk · · · dx1
a1 ak
Then
(k)
B k := σ(A0 )
i.e. X is A − B k measurable.
Notation: B ∈ B k ,
Definition 1.28
(i) Random variables X1 , . . . , Xl on a probability space (Ω, A, P) are
independent if
SZ := {z ∈ Rk : P(Z = z) > 0}
λx −λ
P(X = x) = e , x = 0, 1, 2, . . .
x!
Let SX> = {x ∈ R : f (x) > 0}. Then SX = SX> is called the support of X.
It holds
P(X ∈ SX ) = 1.
1 (x − µ)2
1
fX (x) = √ exp − .
2πσ 2 2 σ2
1
fX (x) = 1 (x).
b − a [a,b]
X is exponentially distributed with parameter λ > 0, i.s.
X ∼ Exp(λ) if its density can be written as
fX1 and fX2 are called marginal densities and the distribution of X1 and
X2 , resp., are called marginal distributions.
1.4 Expectation
Especially,
PN if SX = {a1 , . . . , aN }, pi = P(X = ai ) for i = 1, 2, . . . and
i=1 pi = 1, then
N
X N
X N
X
∗
E X = ai pi = ai P(X = ai ) = ai P X ({ai }).
i=1 i=1 i=1
E ∗ X = a · P(X = a) = a.
Special case:
X
E ∗ |X | = |x|P(X = x),
x∈SX
Remark 1.35
(a) E ∗ |X | is always well-defined.
(b) E ∗ X is finite iff E ∗ |X | < ∞. See (7).
Lemma 1.37
Let X , Y be discrete r.v., E ∗ [X ], E ∗ [Y ] finite, and a, b, c ∈ R.
(a) |E ∗ [X ]| ≤ E ∗ [|X |].
(b) E ∗ [a + bX + cY ] is finite and
E ∗ [a + bX + cY ] = a + bE ∗ [X ] + cE ∗ [Y ].
(c) If X , Y are independent, then E ∗ [X · Y ] is finite and
E ∗ [X · Y ] = E ∗ [X ] · E ∗ [Y ].
Denote
X + = max(0, X ), X − = max(0, −X ).
Then X + ≥ 0, X − ≥ 0,
X = X + − X −, and |X | = X + + X − .
Lemma 1.41 Let X be any r.v. and Yn , Zn discrete r.v. Assume that
Examples 1.49
1 If X ∼ B(n, π), then E [X ] = nπ and Var [X ] = nπ(1 − π).
2 If X ∼ Po(λ), then E [X ] = λ and Var [X ] = λ.
3 If X ∼ U(a, b), then E [X ] = (a + b)/2, Var [X ] = (b − a)2 /12.
4 If X ∼ Exp(λ), then E [X ] = 1/λ, Var [X ] = 1/λ2 .
5 If X ∼ N(µ, σ 2 ), then E [X ] = µ, Var [X ] = σ 2 .
More generally,
Xn n
X n−1 X
X n
2
Var [ ci Xi ] = ci Var [Xi ] + 2 ci cj Cov [Xi , Xj ] (12)
i=1 i=1 i=1 j=i+1
E [Xk ]
E [a + BX ] = a + B · E [X ].
1 Convergence of expectations
2 Modes of convergence
3 Convergence in distribution
4 Limit Theorems
Then
E [Xn ] −→ E [X ].
n→∞
Then
E [X ] ≤ lim inf E [Xn ].
n→∞
Idea of the proof: Put Yn = inf k≥n Xk and apply the Monotone
convergence theorem.
E [Xn ] −→ E [X ]. (16)
n→∞
Theorem 2.1, Lemma 2.2, and Theorem 2.3 are still valid if (13), (14),
and (15) are replaced by a.s (almost surely) statements. I.e. put
If
then (13), (14), and (15) are said to hold almost surely, Theorem 2.1,
Lemma 2.2, and Theorem 2.3 stay valid, and (16) holds true.
Notation:
P
Xn −→ X , p-limn→∞ Xn = X .
Notation:
a.s.
Xn −→ X P − a.s., Xn −→ X a.s., Xn −→ X .
n→∞ n→∞
E kXn − X kp −→ 0.
Notation:
Lp
Xn −→ X , Lp − lim Xn = X .
n→∞
Definition 2.5 Suppose that (Xn )n and X are random variables with
values in (Rk , B k ). (Convergence in distribution)
The sequence (Xn )n converges to X in distribution if
Lp r L 1 L
Xn −→ X → Xn −→ X → Xn −→ X
E [g (kX k)]
P(kX k ≥ ε) ≤ .
g (ε)
For a real-valued random variable X with finite second moment and ε > 0
it holds that
Var [X ]
P(|X − EX | ≥ ε) ≤ (Chebychev’s inequality).
ε2
Suppose that (Xn )n and X are random variables on some probability space
(Ω, A, P).
P a.s.
Put X = 0. Then Xn −→ X , but not Xn −→ X .
Let f : Rk → R be a function.
C (f ) = {x ∈ Rk : f is continuous at x} is called the set of
continuity points of f .
f is bounded if if there is a c ∈ R s.th. |f (x)| ≤ c for x ∈ Rk .
f is a Lipschitz-function if there is a L ∈ R such that
Characteristic functions
Let X be a random vector in Rk .
The function, ϕX : Rk → C, defined by
0
ϕX (t) = E [e it X ] = E [cos(t 0 X )] + iE [sin(t 0 X )],
is called characteristic function of X.
i is here the imaginary unit, i.e. i 2 = −1.
Proposition 2.16
Let Y be a random vector in Rk , a ∈ Rm and B ∈ Rm×k
1 ϕX is uniformly continuous.
0
2 ϕa+BX (t) = e ia t · ϕX (B 0 t) for t ∈ Rm .
3 If X und Y are independent, then ϕX +Y (t) = ϕX (t)ϕY (t).
2 /2
4 If Z ∼ N(0, 1), then ϕZ (t) = e −t .
d
Lemma 2.17 X = Y iff ϕX = ϕY .
d d
Theorem 2.19 (Cramèr-Wold) Xn → X iff ∀t ∈ Rk : t 0 Xn → t 0 X .
P P
→ X iff Xn,j −
Lemma 2.21 Xn − → Xj for j = 1, . . . , k.
P
Note that for P(An = an ) = 1 holds: an → a iff An −
→ a.
Application: Let Xn , Yn be r.v. and an , bn , cn be real numbers.
P P
Let Xn → X , Yn → Y , an → a, bn → b, cn → c. Then
P
an + bn Xn + cn Yn → a + bX + cY ,
P
Xn Yn → XY etc.
P
(ii) Let c ∈ Rm and B ∈ Rm×k . Let Xn → c with values in Rm ,
P d
Bn → B m × k-matrices, and Zn → Z with values in Rk . Then
d
Xn + Bn Zn → c + BZ .
X1 + · · · + Xn − nµ d
√ −→ Z ∼ N(0, 1).
nσ
t2
ϕX (t) = 1 + itE [X ] − E [X 2 ] + t 2 R ∗ (t)
2
with R ∗ (t) → 0 if t → 0.
Then
X1 + · · · + Xn − µ1 − · · · − µn d
q −→ Z ∼ N(0, 1).
σ12 + · · · + σn2
X1 , . . . , Xn ∼ N(µ, σ 2 ) i.i.d.
are defined on the same sample space. µ and σ 2 are unknown and could
be ”determined” by observations. Θ = R × (0, ∞).
Note that the distribution P Zn of Zn = (X1 , . . . , Xn )0 changes with the
choice of the parameter θ = (µ, σ 2 ).
Let Zn be r.v. with values of Rmn defined on some sample space. Let
Θ 6= ∅ a set and PθZn for any θ ∈ Θ probability measures on Rmn . Then
(Rmn , B mn , {PθZn : θ ∈ Θ}) is called statistical experiment, Θ its
parameter space. θ ∈ Θ is called parameter.
n is usually some sample size.
Bias, Consistency
Let there be given a statistical experiment with parameter space Θ. A r.v.
g (Zn ), g : Rmn → S, Θ ⊂ S, can be called estimator of θ.
Let T̂ , T̂n be estimators with values in Rk .
T̂ is unbiased for τ = τ (θ) ∈ Θ if
Eθ [T̂ ] = τ ∀θ ∈ Θ.
lim Eθ [T̂n ] = τ ∀θ ∈ Θ.
n→∞
are unbiased and consistent estimators for µ = E [Xi ] and σ 2 = Var [Xi ],
respectively.
R
Notation: If F is the CDF of Y , we write E [g (Y )] = g (y )F (dy ). Let
DF (R) denote the set of all CDF on R.
Note that the parameter space of the Example 2.30 can be written as
Z
Θ = {F ∈ DF (R) : x 2 F (dx) < ∞};
P 1
X̄n −→ E [Xi ] = µ = forall λ > 0.
λ
Consequently, g (x) = 1/x, by Theorem 2.20,
1 P 1
Λ̂n = = g (X̄n ) −→ g (µ) = = λ forall λ > 0,
X̄n µ
Some Applications
Let X , X1 , . . . , Xn i.i.d. with values in Rk with E [X ] = µ, Var [X ] = Σ.
Then
n
1X P
h i
(i) Xi XiT → E XX T . (LLN)
n
i=1
n n
1X T 1X
Xi XiT − X̄n (X̄n )T
(ii) Σ̂n = Xi − X̄n Xi − X̄n =
n n
i=1 i=1
h i
P T T
→ E XX − E [X ] E [X ]
h i
= E (X − E [X ]) (X − E (X ))T = Σ.
√ − 21 d
(iii) n Σ̂n X̄n − µ −→ N(0, Ip ).
Notation: Xn = OP (1).
Note that for a r.v. X , in general, there is no C ∈ R s.th.
P(kX k ≤ C ) = 1.
P
Notation: Zn = oP (1) iff Zn −
→ 0.
Theorem 2.34
d
(i) Xn −→ X =⇒ Xn = Op (1).
(ii) Xn = X + op (1) =⇒ Xn = Op (1)
(Xn , X scalar or vector or matrix).
(iii) For Xn = op (1), Yn = op (1), Un = Op (1), Wn = Op (1) it holds
(a) Xn + Yn = op (1),
(b) Un + Wn = Op (1),
(c) Un · Wn = Op (1),
(d) Xn · Un = op (1).
(iv) g : Rk → Rl continuous at x0 . Then
Then:
√ d
N 0, φ0 (c) Σ φ0 (c)0 .
n (φ (Xn ) − φ (c)) −→
Summary
Definition 3.1
Each (measurable) function g that minimizes (∗) is called conditional
expectation of Y given X .
Notation:
E [Y |X ] = g (X ), E [Y |X = x] = g (x).
g1 (X ) = g2 (X ) a.s.
P(A | X ) = E (1A | X )
fX ,Y (x, y ) P(X = x, Y = y )
fY |X (y |x) := = = P(Y = y |X = x)
fX (x) P(X = x)
for x ∈ SX . Then
X
E (Y | X = x) = g (x) = y · fY |X (y |x)
y ∈SY
and
X
P Y |X =x (B) = g (x) = fY |X (y |x).
y ∈SY ∩B
Example 3.5 Suppose that X and Z are the thrown numbers of two
independent dice throws and define Y = X + Z . Then, for x = 1, . . . , 6
E [Y | X = x] = x + 3, 5
y = (y1 , . . . , yn )0 , k = 0, . . . , n, is independent of θ.
Then
Z ∞
E [Y |X = x] = y · fY |X (y |x)dy
−∞
and
Z b
Y |X =x
P ([a, b]) = fY |X (y |x)dy .
a
Then holds
fX ,Y (x, y ) fX (x)fY (y )
fY |X (y |x) = = = fY (y ) a.e.
fX (x) fX (x)
Remarks:
By (i)-(iv) the most restricted conditions prevail.
(v): If conditioning on X, f(X) can be handled like a constant and
pulled out of the conditional expectation.
Redundant information can be dropped (vi).
Example 3.9 (Application of (vi)). Some model equation for the wage in
dependence of the education and experience.
E [Y 2 |X ]
P(|Y | ≥ ε|X ) ≤ a.s.
ε2
(Jensen inequality).
(ii) 0 ≤ Yn ↑ Y =⇒ E [Yn |X ] ↑ E [Y |X ] a.s.
(monotone convergence).
Summary
Yi = β1 Xi,1 + · · · + βK Xi,K + εi , i = 1, . . . , n.
Matrix notation:
Y = X β + ε; (*)
Yn Xn,1 . . . Xn,K βK εn
E [Y |X ] = X β, Var [Y |X ] = σ 2 In .
Yi = β1 + β2 Xi,2 · · · + βK Xi,K + εi , i = 1, . . . , n.
rank(A) = K ⇐⇒ Az = 0 =⇒ z = 0.
⇐⇒ det(A0 A) 6= 0.
rank(A) < K ⇐⇒ ∃z 6= 0 : Az = 0.
Note that
ei = Yi − β 0 Xi
OLS estimator
Definition 4.2 In the classical linear regression model the OLS estimator
(ordinary least square) is defined by
ê = Y − Ŷ , êi = Yi − Ŷi .
βbOLS = (X 0 X )−1 X 0 Y .
X 0 ê = 0 (normal equations).
10n ê = 0.
A≥B :⇔ A − B ≥ 0.
An estimator β̂ is called
linear iff β̂ = AY for some K × n-Matrix A = A(X ).
(A may depend on X, but not Y.)
conditionally unbiased iff E [β̂|X ] = β.
BLUE (best linear unbiased) iff β̂ is linear and biased and
Estimation of σ 2
Definition 4.6 If n > K , the OLS estimate of the variance σ 2 > 0 is
given by
2 2 ê 0 ê
σ
bOLS =σ bn,OLS =
n−K
q
and σ 2
bOLS is called standard error of regression (SER).
Note: For a quadratic matrix C = (ci,j )1≤i,j≤m is tr (C ) = ni=1 ci,i the
P
trace of C. It holds for a matrix Z = (Zi,j )1≤i,j≤m of r.v. and matrices
A ∈ Rm×k and B ∈ Rk×m :
Yi = Xi0 β + εi , i = 1, . . . , n.
and √ d
n (βbn,OLS − β) −→ Z ∼ N(0K , Σ).
with Σ = σ 2 (E [X1 X10 ])−1 .
Consequently,
√ d
n (βbn,OLS,k − βk ) −→ Zk ∼ N(0, Σk,k ).
Confidence intervals
Let there be given an statistical experiment with parameter space Θ and
τ = τ (θ) a parameter of interest.
Definition 4.11 Let Ln , Un be r.v. (not depending of θ) and α ∈ (0, 1).
[Ln , Un ] is called asymptotic (1 − α)-confidence interval for τ iff
√ d P
Lemma 4.12 Let → Z ∼ N(0, σ 2 ) and Ŝn −
n(T̂n − τ ) − → σ, then
h Ŝn Ŝn i
T̂n − z1−α/2 √ , T̂n + z1−α/2 √
n n
Consequently,
1 0 − P 2
2
Σ̂n = σ̂n,OLS · XX → σ · (E [X1 X10 ])−1 = Σ.
−
n
By Lemma 4.12
q q
h Σ̂n,k,k Σ̂n,k,k i
βbn,OLS,k − z1−α/2 √ , βbn,OLS,k + z1−α/2 √
n n
Remark:
Tests to decide (a) are called parameter tests.
Tests to decide (b) are called goodness-of-fit tests.
Statistical tests
H0 : θ = θ0 vs. H1 : θ 6= θ0 .
Then Tn = g (Zn ) is called test statistic and c critical value for the test
ϕ.
Decision scheme:
Decision for
H0 H1
H0 is true correct type I error
H1 is true type II error correct
Yi = β1 Xi,1 + · · · + βK Xi,K + εi , i = 1, . . . , n,
H0 : βj = 0 or H0 : β2 = · · · = βK = 0.
H0 : Rβ = θ0 vs. H1 : Rβ 6= θ0
Example 4.19
This general hypothesis covers several interesting special cases.
1 H0 : β k = 0
with R = (0, . . . , 0, |{z}
1 , 0, . . . , 0) and θ0 = 0.
k−th
2 H0 : β 1 = β 2
with R = (1, −1, 0, . . . , 0) and θ0 = 0.
3 H0 : β 1 + β 2 + β 3 = 1
with R = (1, 1, 1, 0, . . . , 0) and θ0 = 1.
4 H0 : β 1 = β 2 = β 3 = 0
with
1 0 0 0 ... 0 0
R = 0 1 0 0 . . . 0 and θ0 = 0 .
0 0 1 0 ... 0 0
Ideas to proceed:
Put τ = Rβ − θ0 , i.e. H0 is true iff τ = 0r and kτ k = 0, resp.
Estimate τ by τ̂n = R β̂n,OLS − θ0 .
Find some distance function d : Rr → [0, ∞) s.th. d(0r ) = 0 and
d(x) is ”large” if kxk is ”large”.
Decision rule: Reject H0 if Tn = d(τ̂n ) > c (is ”large”).
Determine c s.th. the decision rule becomes an α-test.
For the last step we need to specify the distribution of d(τ̂n ), at least
approximately.
χ2 -distribution
Definition 4.20 X∗ is χ2 -distributed with k degrees of freedom,
k ∈ N, k ≥ 1, if X∗ is continuous with density
1
fχ2 (x) = x k/2−1 exp(−x/2)1[0,∞) (x),
k 2k/2 Γ(k/2)
where Γ denotes the so-called Gamma function, defined by
Z ∞
Γ(a) = x a−1 e −x dx, a > 0.
0
Let Fχ2 denote the CDF of X∗ and χ2k,α its α-quantile, α ∈ (0, 1), i.e.
k
Fχ2 (χ2k,α ) = α.
k
1
exp −0.5(z − µ)0 Σ−1 (z − µ) , z ∈ Rk .
φ(z) = p
(2π)k detΣ
Proposition 4.21
Let Z ∼ N(µ, Σ), a ∈ Rm and B ∈ Rm×k . Then holds:
1 a + BZ ∼ N(a + Bµ, BΣB 0 ).
2 Z = (Z1 , . . . , Zk )0 ∼ N(0k , Ik ) iff Z1 , . . . , Zk ∼ N(0, 1) are i.i.d.
3 If µ = 0, then Z 0 Σ−1 Z ∼ χ2 (k).
Convergence to infinity
Let (Zn )n be r.v. with values in Rk . (Zn ) converges in probability to ∞,
if for all C > 0
Wald-test for H0 : Rβ = θ0
Put
Summary