Professional Documents
Culture Documents
Basic Statistics and Probability For Econometrics Econ 270a
Basic Statistics and Probability For Econometrics Econ 270a
Basic Statistics and Probability For Econometrics Econ 270a
COMMON DISTRIBUTIONS 3
II. MOMENTS OF A DISTRIBUTION AND MGF’S 3
1. MOMENTS: 3
( ) AND E ⎡⎣ X ∂( n)
( n) ⎤
2. MOMENT GENERATING FUNCTIONS: M x (t ) = E etX = M X( n ) (0) where M X( n ) = M X (t ) 3
⎦ ∂t
3. MOMENTS OF COMMON DISTRIBUTIONS 4
III. LOCATION AND SCALE FAMILIES 4
IV. EXPECTATION, VARIANCE OF A R.V. 4
A. EXPECTATION SINGLE VARIABLE 4
B. VARIANCE AND STD DEV 4
V. COVARIANCE AND CORRELATION BETWEEN RV’S 4
A. COVARIANCE 4
VI. INDEPENDENCE AND MEAN INDEPENDENCE 5
VII. INEQUALITIES 5
0. MARKOV’S 5
1. CHEBYCHEV’S 5
2. JENSEN’S 5
3. HOLDER’S 5
4. CAUCHY-SCHWARTZ INEQUALITY (SPECIAL CASE OF HOLDER’S) 5
5. MINKOWSKI’S 5
VIII. ORDER STATISTICS: THE “ORDERED” STATISTIC (E.G. MIN/MAX/MEDIAN OF AN IID SAMPLE HAS A
DISTRIBUTION) 6
1. PDF OF THE J-TH ORDER STATISTIC: 6
2. SAMPLE MEDIAN: ROBUST (NOT SENSITIVE TO OUTLIERS). NOTE: SAMPLE MEAN IS NOT ROBUST. 6
IX. MODES OF CONVERGENCE (OF A SEQUENCE OF RV’S) 6
1. CONVERGES TO Y ALMOST SURELY (AKA WITH PROBABILITY 1): IF ∀ε > 0, P(lim | Yn − Y |< ε ) = 1 ⇔ P(| lim Yn = Y |) = 1 6
2. CONVERGES TO Y IN PROBABILITY: IF ∀ε > 0, P(| Yn − Y |> ε ) → 0 as n → ∞ ⇔ P(| Yn − Y |< ε ) → 1 6
3. CONVEGES TO Y IN LP: IF E(|YN- Y|P) Æ 0 AS N ÆINF 6
4. CONVERGES TO Y IN DISTRIBUTION: IF FN(X) Æ F(X) AS N Æ INF AT POINTS X WHERE F IS CONTINUOUS, WHERE FI IS THE CDF OF YI
AND F IS THE CDF OF Y. 6
5. O p and o p AND MODES OF CONVERGENCE: 7
1. CHEBYCHEV’S WEAK LAW OF LARGE NUMBERS: LET Z1,…ZN BE A SEQUENCE OF IID RV’S WITH E(zi ) = μ AND Var ( zi ) = σ
2
.
n
∑z
1
THEN z n ≡ i ⎯ μ
⎯⎯→
a .s .
(THIS FOLLOWS SINCE BIAS2 Æ 0 AND VAR Æ 0, SO BY CHEBYCHEV’S INEQUALITY, WE HAVE CONVERGENCE
n i =1
IN P) 7
n
KOLMOGOROV’S SECOND STRONG LAW OF LARGE NUMBERS: LET {z i }in=1 BE IID WITH E ( z i ) = μ .THEN, z n ≡ ∑z
1
2. i ⎯ μ
⎯⎯→
a .s .
7
n i =1
n
ERGODIC THEOREM: LET {z i }in=1 BE A STATIONARY AND ERGODIC PROCESS WITH E ( z i ) = μ . THEN, z n ≡ ∑z
1
3. i ⎯ μ
⎯⎯→
a .s .
7
n i =1
5. LLN FOR COVARIANCE STATIONARY PROCESSES WITH VANISHING AUTOCOVARIANCES: 7
6. LLN FOR VECTOR COVARIANCE-STATIONARY PROCESSES WITH VANISHING AUTOCOVARIANCES (DIAG ELEMENT OF {Γ j } ) : 8
1
2. BILLINGSLEY (ERGODIC STATIONARY MARTINGALE DIFFERENCES) CLT: LET {g i } BE A VECTOR MARTINGALE DIFFERENCE
n n
∑g ∑ g ⎯⎯→ N (0, Σ)
1 1
= ∑ , AND LET g ≡
D
SEQUENCE THAT IS STATIONARY AND ERGODIC WITH E ( g i g i ' ) i . THEN, n g = i 8
n i =1
n i =1
3. GENERAL CLT: (FOR NIID) 8
4. CLT FOR MA(INF) (BILLINGSLEY GENERALIZES LINDBERG-LEVY TO STATIONARY AND ERGODIC MDS, NOW WE GENERALIZE FOR
SERIAL CORR) 8
5. MV CLT FOR MA(INF) 8
XII. TRILOGY OF THEOREMS (WHAT DO WE KNOW ABOUT THE LIMITING DISTRIBUTION OF A SEQUENCE OF
RANDOM VARIABLES?): 9
1. SLUTSKY’S THEOREM (GENERAL): CONVERGENCE IN DISTRIBUTION RESULTS 9
IF Yn ⎯⎯→ Y AND An ⎯
⎯→ a, Bn ⎯
⎯→ b FOR A, B NON-RANDOM CONSTANTS, THEN AnYn + Bn ⎯⎯→ aY + b 9
D P P D
O
O (VECTOR): x n →d x, y n → p α ⇒ x n + y n →d x + α 9
O (VEC/MAT): x n →d x, A n → p A ⇒ A n x n →d Ax (PROVIDED THAT THE MATRIX MULTIPLICATION IS CONFORMABLE) 9
2. CONTINUOUS MAPPING THEOREM (GENERAL): CONVERGENCE IN PROBABILITY AND DISTRIBUTION RESULTS 9
3. DELTA METHOD: CONVERGENCE IN DISTRIBUTION RESULTS 9
XIII. PROPERTIES OF UNIVARIATE, BIVARIATE, MULTIVARIATE NORMAL 10
XIV. CHANGE OF VARIABLES: UNIVARIATE, BIVARIATE, MULTIVARIATE TRANSFORMATIONS OF PDF 10
THINGS TO CHECK: 1. IS THE FUNCTION 1-1 OVER THE DOMAIN 2. ARE THERE LIMITS TO VALUES OF THE
TRANSFORMED VARIABLE. 10
1. UNIVARIATE: 10
2. BIVARIATE: 10
3. TRI-VARIATE: 11
4. MULTIVARIATE: 11
5. USEFUL CHANGE OF VARIABLES FORMULAS 11
XV. PROBABILITY THEORY 13
1. DEFINITIONS: PROBABILITY MEASURE, SIGMA ALGEBRA (SIGMA ALGEBRA IS WHAT WE DEFINE OUR MEASURES ON), BOREL
FIELDS 13
2. PROBABILITY SPACE, RANDOM VARIABLES, AND MEASURABILITY 14
3. CONDITIONAL EXPECTATIONS AND LAW OF ITERATED EXPECTATIONS 14
XVI. MATRIX ALGEBRA TOPICS 15
A. RANK OF A MATRIX 15
B. PROJECTION MATRICES: GIVEN P PROJECTION MATRIX ONTO SUBSPACE V 16
C. POSITIVE (SEMI)DEFINITE / NEGATIVE (SEMI)DEFINITE 16
D. SINGULARITY, POSITIVE DEFINITE VS. NON-SINGULAR (INVERTIBLE) 16
E. TRACE 16
F. INVERTING 2X2, 3X3 16
G. DETERMINANTS 17
H. DIFFERENTIATING WRT VECTORS 17
I. TRANSPOSE: AT 17
J. MATRIX MULTIPLICATION – PROPERTIES ∀ NXN SQUARE MATRIX A 17
K. RANDOM PROPERTIES 17
XVII. MISCELLANEOUS 18
A. MEASUREMENT ERROR AND MSE 18
B. APPROXIMATION METHOD: PROPAGATION OF ERROR/DELTA METHOD 18
2
I. Common Distributions
Distribution Interpretation E(X) VAR(X) M(t)=E(exp(tx)) Cdf(X) = P(X < x) of Likelihood Func. Pmf/Pdf
Binomial(n,p) K successes in n Bernoulli trials np np(1 − p) (1 − p + pe t ) n ⎛n⎞ ⎛n⎞
L( π ) = ⎜⎜ ⎟⎟π k (1 − π ) n − k P ( X = k ) = ⎜⎜ ⎟⎟ p k q n −k
⎝k ⎠ ⎝k ⎠
Bernoulli(p) Probability of success P p(1-p) P(x) = px(1-p)1-x if x = 0 or x = 1, 0 o.w.
Geom(p) Prob that N trials for 1st success 1/ p (1 − p) / p 2 (etp)/(1-(1-p)et) L( π ) = (1 − π ) n π P( X = n) = p (1 − p) n −1
Neg Bin(n,p) Prob that N trials for R successes r/ p r (1 − p ) / p 2 ⎛ et p ⎞
r
⎛ N − 1⎞ k ⎛ n − 1⎞ r n −r
L( π ) = ⎜⎜ ⎟⎟π (1 − π ) N − k P ( X = k ) = ⎜⎜ ⎟⎟ p q
⎜⎜ 1 − (1 − p)e t ⎟⎟
Generalization of Geometric
Sum of R independent geo RV’s
⎝ ⎠ ⎝ k −1 ⎠ ⎝ r − 1⎠
Poisson( λ ) Limit of a binomial distribution as λ λ λx e − λ λk e − λ
e λ ( e −1)
t
L (λ ) = ∏
i
1+ ⎜ ⎟
No Moments! ⎝ σ ⎠
Sum of p iid Z2 r.v., Z~N(0,1)
(1 − 2t )− p / 2
Chi-Squared(p) P 2p
Note: Sum of p independent X2 is Chi- (1/ 2) p / 2 p / 2 −1 − x / 2
x e
sq(df1+…+dfp) Γ( p / 2)
Other Important Distributions
- T-Distribution: If Z~N(0,1) and C~X2(q) are independent, then Z
~ tq
C q
(So, n( X − μ) /σ n ( X − μ) )
= ~ t n −1
S2 σ 2 S2
- F-Distribution: Let C1 ~ X2(p) and C2 ~ X2(q) be independent, then C1 / p ~ F
p ,q
C2 / q
(So, [ n( X − μ) /σ ]2
=
n( X − μ ) 2
~ F1,n −1 )
S2 σ 2 S2
1. Moments:
Note: Mostly we care about the first 4 moments to summarize the distribution of a r.v.: 1st moment tells us the mean, 2nd moment / central moment
gives us
2.
⎣ ( )
Moment Generating Functions: M x (t ) = E etX and E ⎡ X ( n ) ⎤ = M X( n ) (0) where M X( n ) =
⎦
∂ ( n)
∂t
M X (t )
⎛ t ⎞
MGF of a Sample Average (of a random sample): M X (t ) = M (t ) = ∏ M X ⎜ ⎟
1 ⎛⎜ ⎞
N⎜
∑ Xi ⎟
⎟
⎝N⎠
⎝ ⎠
3
3. Moments of Common Distributions
Moments Normal Uniform(0,θ) Exponential(λ)
1
θ /2 1/ λ
2
θ /3
2
2 / λ2
3
θ3 /4 6 / λ3
4
θ 4 /5 24 / λ 4
III. Location and Scale Families
X, Y in the same Location Family Æ There exists some m s.t. X = Y + m. (You can get to one from another by adding/subtracting by a constant.)
X, Y in the same Scale Family Æ There exists some “standard” r.v. and some s1 and s2 s.t. X = s1Z and Y = s2Z (You can get from one to another by
multiplying by a constant)
B. Correlation
Cov( X , Y )
ρ xy = , ρ xy ∈ [ −1,1] and ρ xy = 1 or − 1 iff Y = a + bX (i.e. X and Y are linear transformations of each other)
Var ( X )Var (Y )
1
So the usual law of iterated expectations is a special case where G = {Ω, ∅} because E(Y|G) = E(Y) in this case. Remember, E(Y) is just
taking expectation over the trivial sigma field.
4
VI. Independence and Mean Independence
Independence
Def: X ind Y if E(XY) = E(X)E(Y)
Properties of Independence:
• X ⊥ Y ⇒ f ( X ) ⊥ g (Y ) for some arbitrary functions f , g
• X ⊥ (W , Y , Z ) ⇒ X ⊥ Any subset of (W , Y , Z )
• But, X ⊥ W , X ⊥ Y , X ⊥ Z ≠> X ⊥ (W , Y , Z ) (see 271(a) HW1 #1(c))
• X ⊥ Y ⇒ Cov( X , Y ) = 0 (REVERSE IMPLICATION NOT TRUE EXCEPT FOR NORMAL)
Special Case: For Normal, Cov(X,Y) = 0 Æ X ind Y.
• X ⊥ Y ⇒ E (Y | X ) = E ( X ) and E ( X | Y ) = E (Y ) (mean independence)
Mean Independence
Def: X mean ind Y if E(X|Y) = E(X)
Implications: X ind Y Æ X mean ind Y and Y mean ind. X
X mean ind Y Æ E[X | g(Y) ] = E(X)
Æ Cov(X,Y) = 0
Æ Cov(X, g(Y) ) = 0
X mean ind Y notÆ Y mean ind X
not Æ f(X) mean ind. Y
VII. Inequalities
0. Markov’s
E( X )
Let X be a nonnegative RV, then … P ( X ≥ t ) ≤
t
1. Chebychev’s
3. Holder’s
Let X,Y be RV’s, and p,q > 0 s.t. 1/p + 1/q = 1. Then E ( XY ) ≤ E XY ≤ E ( X ) ( ) ( p
) (E ( Y ))
1/ p q 1/ q
Useful For: Bounding the expected values involving 2 RV’s using the moments of individual RV’s)
5. Minkowski’s
Let X, Y be RV’s. Then, for 1 ≤ p < ∞ , [E ( X + Y )] ≤ [E ( X )] + [E (Y )]
p 1/ p p 1/ p p 1/ p
Useful For: If X and Y have finite pth moment, then so does X+Y.
5
VIII. Order Statistics: The “ordered” statistic (e.g. min/max/median of an iid sample has a distribution)
Motivation: Siuppose we have an iid normally distributed sample of n observations. How do we find the distribution of the max of the n-sample?
Let X (1) ,..., X ( n ) denote the order statistic of a iid sample, X 1 ,..., X n , with cdf FX ( x) and pdf f X ( x).
Then, pdf of the j − th order statistic is :
n!
f X ( x) [ FX ( x) ] [1 − F ( x) ]
j −1 n− j
f X ( j) ( x) =
( j − 1)!(n − j )!
2. Sample Median: Robust (not sensitive to outliers). Note: Sample mean is not robust.
• Population Median = F-1(0.5) Æ At this point, 50% of population is less than the value. Æ It’s the “middle” observation.
• Population median need not be unique, but for this course we assume it is uniquely defined.
• Asymptotic Distribution of the sample median: For X1…Xn iid density f, with median θ . If f is continuous at θ with f( θ ) > 0 (i.e. the
probability of median > 0 ),
Then… n ( X~ n − θ ) ⎯⎯→
D ~
N [0,1 / 4 f 2 (θ )] or X n ⎯⎯→
D ~
N [θ ,1 / 4nf 2 (θ )] where X n is the sample median.
• We can compute the asymptotic relative efficiency (between sample mean and sample median) = ratio of asymptotic variances.
Note: We write Yn ⎯
⎯→
P
Y and Yn − Y = o p (1)
q -q
Note2: If n q (Yn − Y ) ⎯
⎯→
P
0 then we write Yn − Y = o p (n − q ) since Yn – Y goes to 0 faster than n goes to infinity, or faster than n goes to 0.
Note3: Y is a consistent estimator of Y if Y converges to Y in probability.
Note4: Y is a super consistent estimator of Y if Y converges to Y in probability s.t. n1 / 2 (Yn − Y ) ⎯
⎯→
P
0 or Yn − Y = o p ( n −1 / 2 )
Note5: Convergence in probability does not imply asymptotic unbiased-ness
Note: ⎯⎯→
L
⇒ ⎯⎯→ for p ≥ q > 0
p L q
Note2: We normally care about L2 because L2 convergence Æ L1 convergence Æ convergence in probability. To show L2 convergence, or
convergence of MSE, enough to show
Var Æ 0 and Bias Æ 0!
4. Converges to Y in Distribution: If Fn(x) Æ F(x) as n Æ inf at points x where F is continuous, where Fi is the cdf of Yi and F is the cdf of Y.
(Meaning: at the limit, the marginal distributions are the same, i.e. pointwise convergence of the sequence of cdf’s to F. But this says nothing
about
the inter-dependence relations between the variables. It could be that the two RV’s are completely different functions, or have a correlation of -1,
but
has same cdf. Thus, only the CDF’s converge, the random variables do not necessarily converge.)
6
5. O p and o p and Modes of Convergence:
Xn
Def: Xn = Op( n ) iff p lim <∞
n
X
Def: Xn = op( n ) iff p lim n = 0
n
Interpretations:
P
X n = o p (1) ⇔ X n ⎯⎯→ 0
X n = O p (1) ⇔ X n bounded in probability ⇔ ∀ ε > 0, ∃ M < ∞ st P X n ≥ M < ε ( )
Xn
X n = o p (Yn ) ⇔ = o p (1) (Yn goes to 0 in prob faster )
Yn
Xn
X n = O p (Yn ) ⇔ = O p (1) (Yn goes to 0 in prob faster )
Yn
Properties:
O p (1) o p (1) = o p (1)
O p (1) + o p (1) = O p (1)
if E | X |2 < ∞
CLT : X n = μ + O p (n −1 / 2 )
∑z
1
Then z n ≡ i ⎯ μ
⎯⎯→
a.s .
(This follows since bias2 Æ 0 and var Æ 0, so by chebychev’s inequality, we have convergence in p)
n i =1
(Again, in op notation: X n = μ + o p (1) )
n
Kolmogorov’s Second Strong Law of Large Numbers: Let {z i }i =1 be iid with E ( z i ) = μ .Then, z n ≡ ∑z
1
2.
n
i ⎯ μ
⎯⎯→
a .s .
n i =1
(Unlike above, now we don’t need assumption about existence of second moment or variance)
n
Ergodic Theorem: Let {z i }in=1 be a stationary and ergodic process with E ( z i ) = μ . Then, z n ≡ ∑z
1
3. i ⎯ μ
⎯⎯→
a.s .
n i =1
(This generalizes Kolmogrov’s)
4. Uniform Law of Large Numbers: Under regularity conditions, zt niid converges uniformly to E(zt) in θ (the parameter).
∑ y ⎯⎯⎯⎯→ μ
1
(a) y = t
m.s./ L 2
if lim j →∞ γ j = 0
T t =1
∞
(b) lim j →∞ Var ( ) ∑γ
ny = j < ∞ if
j =−∞
(Note: we also call this the long-run variance of the covariance stationary process2, it can be expressed from AGF gY(1)).
2
We can think of the sample as being generated from an infinite sequence of random variables (which is cov. Stationary). So, the “long-run”
variance is the sum of covariances from any 1 element in the sequence to all the other elements.
7
6. LLN for Vector Covariance-Stationary Processes with vanishing Autocovariances (diag element of {Γ j } ) :
Let {yt} be a vector covariance-stationary with mean μ and {Γ j } be the autocovariances3 of {yt}. Then,
T
∑y
1
(a) y = t ⎯⎯⎯⎯
m.s./ L 2
→ μ if diagonal elements of Γ j →m.s. 0 as j → ∞
T t =1
∞
(b) lim j →∞ Var ( ) ∑Γ
ny = j < ∞ if { Γ j } is summable (i.e. each component of Γ j summable)
j =−∞
(Note: we also call this the long-run covariance variance matrix of the vector covariance stationary process, it can be expressed
∞ ∞
from Multivariate AGF: GY (1) = ∑
j =−∞
Γ j = Γ0 + ∑ (Γ
j =1
j + Γ j ') ).
2. Billingsley (Ergodic Stationary Martingale Differences) CLT: Let {g i } be a vector martingale difference sequence that is stationary and
n n
∑ ∑ g ⎯⎯→ N (0, Σ)
1 1
ergodic with E ( g i g i ' ) = ∑ 4, and let g ≡
D
g i . Then, n g = i
n i =1
n i =1
3. General CLT: (For niid)
T
∑σ
1
Let { yt } be a sequence of niid r.v. s.t. E ( yt ) = 0, Var ( yt ) = σ t2 , and let σ T2 = 2
t
T
t =1
2 +δ
If E ⎡ yt ⎤ < ∞ ∀ t for some δ > 0, then
⎣⎢ ⎦⎥
⎛1 ⎞ ⎛ T ⎞
T⎜
⎜T ∑ y ⎟⎟ → t D N ⎜ 0, p lim
⎜
1
T ∑σ 2⎟
t
⎟
(
= N 0, p lim σ T2 )
⎝ t =1 ⎠ ⎝ t =1 ⎠
4. CLT for MA(inf) (Billingsley generalizes Lindberg-Levy to stationary and ergodic mds, now we generalize for serial corr)
∞ ∞
Let yt = μ + ∑ j =0
ψ j ε t − j where {ε t } is iid white noise and ∑ψ
j =0
j < ∞ . Then,
→ N ⎛⎜ 0, ∑ γ j ⎞⎟
∞
n ( y − μ ) ⎯⎯
D
⎝ j =−∞ ⎠
5. MV CLT for MA(inf)
∞ ∞
Let yt = μ + ∑
j =0
ψ j ε t − j where {ε t } is vector iid white noise (i.e. jointly covariance stationary) and ∑ψ
j =0
j < ∞ . Then,
→ N ⎛⎜ 0, ∑ Γ j ⎞⎟
∞
n ( y − μ ) ⎯⎯
D
⎝ j =−∞ ⎠
3
In a vector process, the diagonal elements of {Γ j } are the autocovariances and the off diagonal are the covariances between the lagged values
of the elements of the vector.
⎡x ⎤
Let yt = ⎢ t ⎥ .
⎣ zt ⎦
⎛ ⎡x ⎤ ⎞ ⎡x ⎤
For example: Then, Γ j = cov( yt , yt − j ) = E ( yt yt − j ') − E ( yt ) E ( yt − j ) = E ⎜ ⎢ t ⎥ ⎡⎣ xt − j zt − j ⎤⎦ ⎟ − E ⎢ t ⎥ E ⎡⎣ xt − j zt − j ⎤⎦
⎜ zt ⎟
⎝⎣ ⎦ ⎠ ⎣ zt ⎦
⎡ E ( xt xt − j ) − E ( xt ) E ( xt − j ) E ( xt zt − j ) − E ( xt ) E ( zt − j ) ⎤ ⎡Cov( xt , xt − j ) Cov( xt , zt − j ) ⎤
=⎢ ⎥=⎢ ⎥
⎣⎢ E ( xt − j zt ) − E ( xt − j ) E ( zt ) E ( zt zt − j ) − E ( zt ) E ( zt − j ) ⎦⎥ ⎣⎢ Cov( xt − j , zt ) Cov ( zt , zt − j ) ⎦⎥
4
Since {g i } stationary, the matrix of cross moments does not depend on i. Also, we implicitly assume that all the cross moments exist and are finite.
8
XII. Trilogy of Theorems (WHAT DO WE KNOW ABOUT THE LIMITING DISTRIBUTION OF A SEQUENCE OF RANDOM VARIABLES?):
o (vector): x n →d x, y n → p α ⇒ x n + y n →d x + α
o (vec/mat): x n →d x, A n → p A ⇒ A n x n →d Ax (provided that the matrix multiplication is conformable)
o If Yn ⎯⎯
D
→ Y , and g continuous function, then g (Yn ) ⎯⎯
D
→ g (Y )
(similar to Delta Method – ASK YING)
Why? For g non-linear, we linearlize by Taylor approximation about μ to the second order, then we get…
1 1
Y = g ( X ) ≈ g (μ X ) + ( x − μ X ) g ' (μ X ) + ( x − μ X ) 2 g ' ' (μ X ) = g (μ X ) + ( x − μ X ) 2 g ' ' (μ X )
2 2
1
2
( ) 1
⇒ E (Y ) = g ( μ X ) + g ' ' ( μ X ) E ( X − μ X ) 2 = g ( μ X ) + g ' ' ( μ X )Var ( X )
2
⎛ ⎞ 1
⇒ Var (Y ) = Var ⎜ g ( μ X ) + ( x − μ X ) 2 g ' ' ( μ X )) ⎟ = (g ' ' ( μ X ) )2 Var ( X )
1
⎝ 2 ⎠ 4
( )
n ( xn − β ) → D N (0, Σ) ⇒ n a ( xn ) − a ( β ) → D N (0, A( β )ΣA( β ) ')
5
Why? For g non-linear, we linearlize by Taylor approximation about μ to the first order, then we get… Y = g ( X ) ≈ g ( μ X ) + g ' ( μ X )( x − μ X ) ⇒ E (Y ) = g ( μ X ), Var (Y ) = Var ( X )[ g ' ( μ X )] 2 Then, by
slutsky’s….
6
Check page 244 of casella berger for proof.
9
XIII. Properties of Univariate, Bivariate, Multivariate Normal
⎧ 1 ⎫
1. PDF : f ( x ) = (2π ) − p / 2 | Σ |−1 / 2 exp⎨− ( x − μ )' Σ −1 ( x − μ )⎬, x ∈ ℜ p , μ = E ( x), and Σij = Cov( X i , X j )
⎩ 2 ⎭
2. Mutual Independence: X1…Xn ~N, then Xi, Xj independent iff Cov(Xi,Xj) = 0 for all i ≠ j .
3. Linear Transformation of MVN: Let X ~ N p ( μ , Σ), and let A ∈ ℜ q× p and b ∈ ℜ q , where A has full row rank (q ≤ p). Then,
Y = AX + b ~ N q ( Aμ + b, AΣA' )
4. Conditional Distributions
Bivariate Case:
⎡X ⎤ ⎛ ⎡μ ⎤ ⎡ σ 2 σ XY ⎤ ⎞ ⎛ σ ⎞ ⎛ σ ⎞
If ⎢ ⎥ ~ N 2 ⎜ ⎢ X ⎥ , ⎢ X
⎜ μ
⎥⎟
⎟ σX
(⎜ σ
)
Then, Y | X = x ~ N ⎜ μY + ρ Y ( X − μ X ) , σ Y2 1 − ρ 2 ⎟ = N ⎜ μY + XY (
( X − μ X ) , σ Y2 1 − ρ 2 ) ⎟⎟
⎝ ⎣ Y ⎦ ⎣⎢σ XY σ Y2
2
⎣Y ⎦ ⎦⎥ ⎠ ⎝ ⎠ ⎝ X ⎠
5. Functions of Normals
(
X , Y normal , a, b cons ⇒ aX + bY = N a μ X + bμY , a 2σ X2 + b 2σ Y2 + 2abσ XY )
6. Distribution of Mahalanobis Distance: Let X ~ N m ( μ , Σ), for. some vector μ , (m × 1), and some cov ariance matrix Σ, (m × m).Then,
(x − μ ) ' Σ −1 (x − μ ) ~ χ m2
k ⎧ −1 d −1
⎪ f X ( g ( y )) gi ( y ) if y ∈ Yi = gi ( Ai )
Then, PDF of Y is: fY ( y ) = ∑
i =1
fY(i ) ( y ) where fY(i ) ( y ) = ⎨
⎪
dy
⎩ 0 if y ∉ Yi
(Note: This is the most general case. If g is monotone and g-1 is continuously differentiable on the whole domain of X, then there is no need to
partition.)
2. Bivariate:
Given (X,Y) continuous random vector with joint pdf fxy, then the joint pdf of (U,V) where U = f(x,y) and V = g(x,y) can be expressed in terms of
fxy(x,y).
10
3. Tri-Variate:
Given (X,Y,Z) continuous random vector with joint pdf fxyz, then the joint pdf of (U,V,W) where U=f(x,y,z), V=g(x,y,z), W = h(x,y,z) can be
expressed in terms of fxyz(x,y,z).
4. Multivariate:
Let (X1,…,Xn) be a random vector with pdf fx(x1,…,xn). Let A = { x: fX(x) >0 } be the support of fX.
Consider a new random vector (U1,...,U n ) s.t. U1 = g1 ( X) ... U n = g n ( X)
Suppose that A0,…,Ak form a partition partition of A such that…
a) P ( X1,..., X n ∈ A0 ) = 0 ( A0 may be empty )
b) The transformation (U1,...,U n ) = ( g1( X),..., g n ( X) ) is a 1-1 transformation from Ai onto B for each i = 1,…,k
(so the inverse function is well defined)
Let the i-th inverse give, for each ( u1,..., un ) ∈ B , the unique ( x1,..., xn ) ∈ Ai s.t. (u1,..., un ) = ( g1 ( x1,.., xn ),..., g n ( x1,.., xn ) )
−1 −1 −1
∂g ni (u ) ∂g ni (u ) ∂g ni (u )
∂u1 ∂u2 ∂un
(J: Jacobian from (x1,…,xn)Æ(u1,…,un)
(Note: No need to partition, if functions are 1-1 transformations on the whole space then the inverses exist uniquely and are differentiable.)
∞ Z = X +Y⎫ X =W ⎫
∫
0 1
1. PDF of Z = X + Y: f Z ( z ) = f X ( w) fY ( z − w) dw ⎬⇒ ⎬⇒ J = = 1 ⇒ f ZW ( z , w) = f XY ( w, z − w)
−∞ W = X ⎭ Y = Z −W ⎭ 1 −1
∞ Z = X −Y⎫ X =W ⎫
∫
0 1
2. PDF of Z = X – Y: f Z ( z ) = f X ( w) fY ( z + w) dw ⎬⇒ ⎬⇒ J = = 1 ⇒ f ZW ( z , w) = f XY ( w, z + w)
−∞ W = X ⎭ Y = Z +W ⎭ 1 1
∞
∫
1
3. PDF of Z = XY: fZ ( z) = f X ( w) fY ( z / w) dw
−∞ w
Z = XY ⎫ X =W ⎫ 0 1 1
⎬⇒ ⎬⇒ J = = ⇒ f ZW ( z , w) = f XY ( w, z / w)
W = X ⎭ Y = Z /W ⎭ 1/ W −Z / W 2 W
∞
∫
w
4. PDF of Z = X/Y: fZ ( z) = f X ( w) fY ( w / z ) dw
−∞ z2
Z = X /Y ⎫ X =W ⎫ 0 1 W
⎬⇒ ⎬⇒ J = = 2 ⇒ f ZW ( z , w) = f XY ( w, w / z )
W = X ⎭ Y =W /Z⎭ −W / Z 1/ W
2
Z
11
Cauchy Distribution Example: (Where partitioning is important)
Let X, Y ind. Standard Normals
1. Find PDF of X2
u = f ( x) = x 2 , −∞ < x < ∞ : Not a 1 − 1 transformation over the domain.
Let A0 = {0}, A1 = (−∞, 0), A2 = (0, ∞)
∂g1−1 ( x)
On A1 : x = g1−1( x) = − u ⇒
1
=
∂u 2 u
∂g 2−1( x)
On A2 : x = g 2−1( x) = u ⇒
1
=
∂u 2 u
∂gi−1 ( x)
Then, fu = ∑ f (g
x
−1
( y ))
∂u
(
= fx − u + fx ) ( u)= 1 1
2π 2 u
⎧ 1
⎩ 2
⎫
exp ⎨ − ( u ) ⎬ +
⎭
1 1
2π 2 u
⎧ 1 ⎫
exp ⎨− ( u ) ⎬ =
⎩ 2 ⎭
1
2π u
⎧ 1 ⎫
exp ⎨− u ⎬ ~ Chi − Sq (1)
⎩ 2 ⎭
12
3. Find PDF of X/|Y| (Partition)
X ⎫
U= ⎪
| Y | ⎬ ⇒ U ,V not a 1 − 1 mapping from R 2 to R (multiple Y ' s map to same U .
V =| Y | ⎪⎭
Partition R 2 s.t. (u , v) is a 1 − 1 transformation on each Ai :
Let A0 = {( x, y ) : y = 0}, A1 = { ( x, y ) : y > 0}, A2 = {( x, y ) : y < 0}
⎫ X
⎪ X = UV ⎫
U= v u
On A1 : ⎬⇒ Y ⎬ ⇒| J |= =| v |
⎪ Y =V ⎭ 0 1
⎭V =Y
⇒ fUV
1
= f XY (uv, v) | v |=
|v|
2π
⎡ 1
(
⎤ |v|
exp ⎢ − u 2v 2 + v 2 ⎥ =
⎣ 2 ⎦ 2π ⎢⎣ )
exp ⎡ −v 2 u 2 + 1 ⎤
⎥⎦ ( )
−X ⎫
U= ⎪ X = −UV ⎫ −v −u
On A2 : Y ⎬⇒ ⎬ ⇒| J |= =| v |
Y = −V ⎭ 0 −1
V = −Y ⎪⎭
⇒ fUV
2
= f XY (−uv, −v) | v |=
|v|
2π
⎡ 1
( ⎤ |v|
exp ⎢ − u 2v 2 + v 2 ⎥ =
⎣ 2 ⎦ 2π
) ⎡ 1 ⎤
exp ⎢ − v 2 u 2 + 1 ⎥
⎣ 2 ⎦
( )
fUV = fUV
1
+ fUV
2
=
|v|
π
⎡ 1 2 2
( ⎤ v
) ⎡ 1 2 2 ⎤
(
exp ⎢ − v u + 1 ⎥ = exp ⎢ − v u + 1 ⎥ sin ce v ∈ [0, ∞)
⎣ 2 ⎦ π ⎣ 2 ⎦
)
( ) ( ) ( )
∞ ⎡ 1 ⎤ −1 ∞ ⎡ 1 ⎤
∫ )∫
v
fU = exp ⎢ − v 2 u 2 + 1 ⎥ dv = −v u 2 + 1 exp ⎢ − v 2 u 2 + 1 ⎥
v =0 π ⎣ 2 ⎦ (
π u2 + 1 v =0 ⎣ 2 ⎦
1
=
( )
~ Cauchy (0,1)
π u2 + 1
1. Definitions: Probability Measure, Sigma Algebra (Sigma Algebra is what we define our measures on), Borel Fields
13
Def: The set S of all possible outcomes of a particular experiment is called the sample space of the experiment.
Def: A collection of subsets of S, denoted B, is called a sigma field or sigma algebra if it satisfies the following:
1. Empty Set : ∅ ∈ β
2. Complements : If A ∈ β , then S \ A = Ac ∈ β
3. Unions : If A1 , A2 ,... ∈ β , then (∪ A ) ∈ β
i =1 i
c
⎛ ∞ ⎞ ⎛ ∞ ⎞ ⎛ ∞ ⎞
∩ ⎟ ⎜ ⎟ ∩ ⎜ ⎟ ∪
Pf : If A1 , A2 ,... ∈ β , then clearly ⎜ Ai ⎟ ∈ β ⇒ ⎜ Ai ⎟ ∈ β by 2 ⇒ ⎜ Ai c ⎟ ∈ β by De Morgan ' s Laws
⎜
⎝ i =1 ⎠ ⎝ i =1 ⎠ ⎝ i =1 ⎠
Def: P is a probability measure on the pair (S,B), if P satisfies:
1. P ( A) ≥ 0 for all A ∈ β
2. P ( S ) = 1
∞
3. If A1 , A2 ,... ∈ β are pairwise disjo int, P ( ∪ A ) = ∑ P( A )
i =1 i
i =1
i
Def: Let X: (Ω,F) Æ (R, B) be F measurable. A Borel field is the smallest σ-field that makes X measurable, given by:
{
σ ( X ) ≡ G ⊆ Ω : G = X −1 ( B) for some B ∈ β }
(Think of this is the only sets in the universe that the random variables gives us information about – since they are the sets that
are
preimages of all the possible outcomes of the r.v. So, the random variables X is informative about members of σ(X) but not
more than that! )
Def: A random variable is a function from the sample space into the real numbers, or a measurable mapping from (Ω,F) into (R, B)
(So, for a random variable X: (Ω,F) Æ (R, B), the sample space for X is R)
Def : A random variable X: (Ω,F) Æ (R, B) is F-measurable if the preimage {ω ∈ Ω : X (ω ) ∈ β } ∈ F for all B ∈ β
(all the events in B can be mapped back to F and be measured there)
Note: X(ω) is a random variable that induces a probability measure PX on (R, B), ω ε Ω (the universe)
PX is defined from P (a probability measure on (Ω,F) ) by
Pr X takes on values in B : PX ( B ) ≡ P( X ∈ B) = P ({ω ∈ Ω : X (ω ) ∈ B} ) for some B ∈ β
Def: A random variable Y = g(X): (RX, BX)Æ (RY, BY) induces the probability measure PY on the sample space RY as follows:
for some A ∈ BY , PY ( A) ≡ P(Y ∈ A) = P ( X ∈ { x ∈ RX : Y = g ( x) ∈ A} ) = PX ({ x ∈ RX : Y = g ( x) ∈ A} )
Prop: Let F and G be 2 σ-fields s.t. G c F (all the sets in G are also in F). If a random variable X is G-measurable, then X is F-measurable7.
7
Pf : ∀ B ∈ β , {ω ∈ Ω : X (ω ) ∈ B} ∈ G ⊆ F
14
Def: Let X and Y be real-valued random variables on (Ω,F,P) and let G = σ(X). Suppose E|Y| finite. The conditional expected value
of Y given X is a random variable (function of X) that satisfies the following 3 conditions8:
1. E E (Y | X ) < ∞
2. E (Y | X ) is G − measurable : i.e. ∀B ∈ β , {ω ∈ Ω : E (Y | X )(ω )} ∈ σ ( X ) ( E (Y | X ) is as inf ormative as X but no more sophis
3. For all g ∈ G, ∫ E (Y | X )(ω ) dP(ω ) = ∫ Y (ω ) dP(ω )
g g
a. Rank of a Matrix
8
Y always satisfies 1 and 3. But Y will only satisfy 2 if σ(Y) c σ(X) i.e. Y is no more informative than X. So typically not possible to use Y as
E(Y|X).
9
By condition 3 in the definition of conditional expectation, since E(Y|X) is clearly Ω-measurable,
∫ ∫
for Ω ∈ Ω, E ( E (Y | X )) = E (Y | X )(ω ) dP (ω ) = Y (ω ) dP (ω ) = E (Y )
Ω Ω
10
So the usual law of iterated expectations is a special case where G = {Ω, ∅} because E(Y|G) = E(Y) in this case. Remember, E(Y) is just
taking expectation over the trivial sigma field.
15
Prop: If A is Mxn and B is nxn s.t. rank(B) = n, then rank(AB) = rank(A)
Prop: Rank(A) = rank(A’A) = rank(AA’)
Prop: For any matrix A and nonsingular matrices B and C, rank(BAC) = rank(A) (provided that the multiplication is conformable)
1. P2 = PP = P (Idempotent)
2. P projection Æ I – P projection as well
3. I = PV + PV⊥
4. Eigenvalues of P are 1 or 0
5. For any vector/matrix X, X(X’X)X’ is a projection matrix onto the column space of X
Def: A (square) matrix A is positive definite if for all non-zero vectors x, x’Ax > 0 (i.e. matrix projected on any direction is > 0)
Def: A (square) matrix A is positive semidefinite if for all non-zero vectors x, x’Ax > 0
1. If A has full rank, then A’A is p.d11 but AA’ is p.s.d.
2. If A is p.d. and B is a nonsingular matrix, then BA’B is p.d.
3. A p.d. iff all eigenvalues of A > 0.12
4. A p.d. iff tr(A) >0 (follows from above)
5. A p.d. iff det(A) > 0 Æ invertible (follows from 3)
6. For any matrix A, A’A is symmetric positive semi-definite
e. Trace
1. Tr(A+B) = Tr(A) + Tr(B)
2. Tr(AB) = Tr(BA) (if the multiplication is defined)
3. Tr(A) = Tr(A’)
4. Tr ( A' A) =
∑
ai ' ai = ∑∑
2
aij where ai is the ith col of A
j i
3x3:
11
Pf: Suppose for contradiction that X’X not p.d.
∀ c ≠ 0, c' X ' Xc ≤ 0 ⇒ (cX )' I ( Xc) ≤ 0 for some non − zero vector Xc
(sin ce X full rank , there does not exist non − trivial linear combianations of rows / columns s.t. Xc = 0)
Thus, this implies I is not p.d . Contradiction!
12
For nonzero x, x’Ax > 0 Æ Det(x’Ax) = |A||x’x|> 0 Æ |A| = product of eigenvalues must be > 0
13
A p.d. Æ x’Ax > 0 Æ det(x’Ax) = det(A)det(x’x) > 0 Æ either det(A) > 0 and det(x’x) > 0 or det(A)<0 and det(x’x) < 0 Æ A
invertible/nonsingular.
16
−1
⎛a b c⎞ ⎛ ei − fg ch − ib bg − ec ⎞
⎜ ⎟ 1 ⎜ ⎟
A = ⎜d e f⎟ = fg − di ai − cg cd − af ⎟
⎜g a (ei − hf ) − b(di − fg ) + c(dh − eg ) ⎜⎜
⎝ h i ⎟⎠ ⎟
⎝ gh − ge bg − ha ae − db ⎠
g. Determinants
Det(AB) = Det(A)Det(B) if A,B square
If A invertible, Det(A) = 1/Det(A) (this follows from above)
∂x kx1
(The convention is, when you differentiate wrt a vector kx1, the resulting matrix is kx(.))
If A is square
∂ ( x' Ax)
• = ( A + A')x
∂x kx1
If A symmetric
∂ ( x' Ax)
• = 2 Ax
∂x kx1
∂ ( x' Ax)
• = x' x
∂A
∂ ln | A |
• = A' −1
∂A
i. Transpose: AT
1. (A+B)T = AT+BT
2. (AB)T = BTAT
3. (AT)-1 = (A-1)T if A invertible [AA-1=InÆ (AA-1)T=(In)TÆ (A-1)TAT=InÆ(AT)-1 = (A-1)T]
T
4. rank(A) = rank(A ) for any A
5. Ker(A) = Ker(ATA) for any nxm matrix A. [Ker(A)cKer(ATA), Ker(ATA)c Ker(A)]
T
6. If Ker(A) = {0} then A A is invertible for any nxm matrix A [Ker(ATA) = Ker(A) = {0}]
T
7. Det(A) = Det(A ) for square matrix A
8. Dot Product: v • u = v u
T
17
For any vector c, c’c is p.s.d.
Any symmetric, idempotent matrix is p.s.d.
If a matrix A is symmetric and positive definite, then there exists some C nonsingular s.t. A=C’C
XVII. Miscellaneous
Mean Square Error (MSE) = Overall measure of the size of the measurement error when an estimate X is used to measure X0 (true quantity)
= E[(X-X0)2] = Var(X-X0) + E(X-X0)2 = Var(X) + Bias2 = σ X2 + β 2
Note: For an unbiased estimator, E(X) = X0, the MSE = E[(X-X0)2] = E[(X-E(X))2] = Var(X)
Given RVs X and Y, and we know E(X) and Var(X). Suppose Y = g(X) where g is a nonlinear function.
To find E(Y) and Var(Y) requires that g be linear. We can linearlize g using the Taylor expansion of g about the mean of X (we choose the mean
of X so we can get E(g(X)) and Var(g(X)) easily).
1. To the first order: Y = g ( X ) ≈ g ( μ X ) + ( x − μ X ) g ' ( μ X ) ⇒ E (Y ) = g ( μ X ),Var (Y ) = Var ( X )[ g ' ( μ X )]2 or μY ≈ g ( μ X ),σ Y2 ≈ σ X2 [ g ' ( μ X )]2
Æ This allows us to approximate the E and Var of nonlinear functions of a RV X, whose E(X) and Var(X) we know
Æ THIS IS THE DELTA METHO
Miscellaneous Definitions
Law of Total Probability: P(X) = ∑ P( X | Y = y )P(Y = y )
i i
Binomial Expansion: ⎛ n⎞
(1 + x) n = ∑ ⎜⎜ ⎟⎟ x k
Geometric Series: ∑α n
= 1 /(1 − α ) for 0 < α <1
⎝ ⎠
k
Indicators and Expectation: Exp. Number of things can be expressed as sum of indicators.
Fundamental Theorem of Calculus: If F ( x) = x f (t )dt , then F ' ( x) = f ( x)dx Æ Application: If P(Y< y) = FZ(ln y), then PDFY = FZ’(ln y) = fZ(ln y) (1/y)
∫a
Statistic/Estimator: A statistic/estimator is some function of the data (and doesn’t depend on unknown parameters – thought its properties do).
Unbiased Estimator: An estimator T = t(x1… xN) is called an unbiased estimator of some unknown parameter if Eθ (T ) = θ ∀θ Æ Show T consistent, show
E(T) = θ
Consistent Estimator: An estimator T = t(x1… xN) is called a consistent estimator of some unknown parameter θ if T ⎯⎯→ P
θ.
How to Show Consistency (i.e. P(Yn − μ > ε ) ⎯⎯→ 0 ?) By Chebychev we know P(Y − μ > ε ) ≤
P E [(Y − μ ) 2
] Var (Y − μ ) + [ E (Yn − μ )] 2 Var (Yn ) + Bias 2
n
= n
=
ε ε ε
n 2 2 2
Show Var(Yn) Æ 0 and Bias Æ 0 (sufficient but not necessary). But in application, we can just appeal to the law of large numbers (which, like
consistency, is convergence in probability!)
EXAMPLE: θˆ = c 1 x is a consistent estimator of σ , then θˆn,c ⎯⎯→ σ . But by law of large numbers, we know θˆ = c 1
∑ ∑
P
n ,c i x ⎯ ⎯→
P
cE (| x |)
n ,c i i
n n
∴ cE ( xi ) = σ
Note1: So if the estimator is unbiased, all we need is to show Var(Yn) Æ 0. But under appropriate smoothness conditions, VarÆ 0 is guaranteed for
MLEs. So normally, unbiasedness is enough.
Note2: For an unbiased estimator, the equation above just refers to its variance.
Note3: Consistency does not imply unbiasedness, and vice versa. (e.g. X n + 1 / n is consistent but biased).
18