A Probability and Statistics Cheatsheet

Probability and Statistics
Cheat Sheet
c Matthias Vallentin, 2011

Copyright
vallentin@icir.org
6th March, 2011
12 Parametric Inference
12.1 Method of Moments . . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . . .
sity of California in Berkeley but also influenced by other sources
12.2.1 Delta Method . . . . . . . . . .
[4, 5]. If you find errors or have suggestions for further topics, I
12.3 Multiparameter Models . . . . . . . .
would appreciate if you send me an email. The most recent ver12.3.1 Multiparameter Delta Method
sion of this document is available at http://bit.ly/probstat.
12.4 Parametric Bootstrap . . . . . . . . .
This cheat sheet integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Univer-
To reproduce, please contact me.
.
.
.
.
.
.
13 Hypothesis Testing
Contents
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
2 Probability Theory
3 Random Variables
3.1 Transformations . . . . . . . . . . . . .
4 Expectation
5 Variance
6 Inequalities
14 Bayesian Inference
14.1 Credible Intervals . . . .
3
14.2 Function of Parameters
3
14.3 Priors . . . . . . . . . .
4
14.3.1 Conjugate Priors
6
14.4 Bayesian Testing . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 15 Exponential Family
7
16 Sampling Methods
7
16.1 The Bootstrap . . . . . . . . . . . . .
16.1.1 Bootstrap Confidence Intervals
7
16.2 Rejection Sampling . . . . . . . . . . .
16.3 Importance Sampling . . . . . . . . . .
8
.
.
.
.
16
16
16
17
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
17
18
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
19
19
19
19 Non-parametric Function Estimation

19.1 Density Estimation . . . . . . . . . . . .
11 Statistical Inference
10
19.1.1 Histograms . . . . . . . . . . . .
11.1 Point Estimation . . . . . . . . . . . . . 10
19.1.2 Kernel Density Estimator (KDE)
11.2 Normal-based Confidence Interval . . . . 11
19.2 Non-parametric Regression . . . . . . .
11.3 Empirical Distribution Function . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
19.3 Smoothing Using Orthogonal Functions
20
20
20
21
21
21
7 Distribution Relationships
8 Probability
Functions
and
Moment
17 Decision Theory
17.1 Risk . . . . . . . . . . . .
17.2 Admissibility . . . . . . .
9
17.3 Bayes Rule . . . . . . . .
17.4 Minimax Rules . . . . . .
9
9 18 Linear Regression
9
18.1 Simple Linear Regression
9
18.2 Prediction . . . . . . . . .
18.3 Multiple Regression . . .
9
18.4 Model Selection . . . . . .
10
.
.
.
.
.
11 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
11
20.2 Poisson Processes . . . . . . . . .
12
12
21 Time Series
12
21.1 Stationary Time Series . . . . . .
13
21.2 Estimation of Correlation . . . .
13
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
13
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
14
21.5 Spectral Analysis . . . . . . . . .
14
14 22 Math
22.1 Gamma Function . . . . . . . . .
15
22.2 Beta Function . . . . . . . . . . .
15
22.3 Series . . . . . . . . . . . . . . .
15
22.4 Combinatorics . . . . . . . . . .
16
Generating
9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .
10 Convergence
10.1 Law of Large Numbers (LLN) . . . . . .
10.2 Central Limit Theorem (CLT) . . . . . 10
22
. . . . 22
. . . . 22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
24
24
24
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
26
27
27
Distribution Overview
1.1
Discrete Distributions
Notation1
FX (x)
Unif {a, . . . , b}
Uniform
x<a
axb
x>b
(1 p)1x
bxca+1
ba
1
Bernoulli
Bern (p)
Binomial
Multinomial
Hypergeometric
Hyp (N, m, n)
Negative Binomial
NBin (n, p)
Geometric
Geo (p)
Poisson
Po ()
MX (s)
I(a < x < b)

ba+1
a+b
2
(b a + 1)2 1
12
eas e(b+1)s
s(b a)
px (1 p)1x
p(1 p)
1 p + pes
np
np(1 p)
(1 p + pes )n
x N+
x
X
i
i!
i=0
Uniform (discrete)
nm(N n)(N m)
N 2 (N 1)
nm
N
r
1p
p
x e
x!
1p
p2
p
1 (1 p)es
e(e
1)
Poisson
p = 0.2
p = 0.5
p = 0.8
=1
=4
= 10
0.2
0.6
PMF
0.4
PMF
0.15
PMF
0.10
0.05
x
1 We
10
20
x
30
40
0.0
0.0
0.1
0.2
0.00
PMF
0.20
0.3
0.25
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
r
p
1 (1 p)es
Geometric
pi e
N/A
1p
p2
1
p
x N+
Binomial
!n
si
i=0
nx

N
x
p(1 p)x1
k
X
npi (1 pi )
npi
!
x+r1 r
p (1 p)x
r1
Ip (r, x + 1)
1 (1 p)x
V [X]
k
X
n!
x
xi = n
px1 1 pkk
x1 ! . . . xk !
i=1

m mx
Mult (n, p)
x np
p
np(1 p)
E [X]
!
n x
p (1 p)nx
x
I1p (n x, x + 1)
Bin (n, p)
fX (x)
0.8
10
10
15
20
use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N , 2
x<a
a<x<b
1
x>b
Z x
(x) =
(t) dt
xa
ba
Log-Normal
ln N , 2
Multivariate Normal
MVN (, )
Students t
Student()
Chi-square
2k

1
1
ln x
+ erf
2
2
2 2
fX (x)
E [X]
V [X]
MX (s)
I(a < x < b)

ba

(x )2
1
(x) = exp
2 2
2

(ln x )2
1
exp
2 2
x 2 2
a+b
2
(b a)2
12
esb esa
s(b a)

2 s2
exp s +
2
(2)k/2 ||1/2 e 2 (x)

Ix
,
2 2

1
k x
,
(k/2)
2 2
1 (x)

(+1)/2
+1
x2
2
1+
2
1
xk/2 ex/2
2k/2 (k/2)
r
e+
/2
(e 1)e2+
2k
d2
d2 2
2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)

1
exp T s + sT s
2
(1 2s)k/2 s < 1/2

F
F(d1 , d2 )
Exponential
Exp ()
Gamma
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
d1 x
d1 x+d2
Gamma (, )
InvGamma (, )
d1 d1
,
2 2
Pareto(xm , )
(, x/)
()

, x
()
1
x1 ex/
()
1 /x
x
e
()
P

k
k
i=1 i Y 1
xi i
Qk
i=1 (i ) i=1
>1
1
2
>2
( 1)2 ( 2)2
( + ) 1
x
(1 x)1
() ()
+

1
1 +
k
( + )2 ( + + 1)

2
2 1 +
2
k
xm
>1
1
x
m
>2
( 1)2 ( 2)
1 e(x/)
1
d1
, 2
2
1 x/
e
Ix (, )
Weibull(, k)
xB

d1
1 ex/
Dir ()
Beta (, )
(d1 x)d1 d2 2
(d1 x+d2 )d1 +d2
x
m
x xm
k x k1 (x/)k
e

x
m
+1
x
x xm
i
Pk
i=1 i
1
(s < 1/)
1 s

1
(s < 1/)
1 s

p
2(s)/2
4s
K
()
E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+
X
k=1
X
n=0
k1
Y
r=0
+r
++r
sk
k!
sn n
n
1+
n!
k
(xm s) (, xm s) s < 0
Normal
Lognormal
Student's t
1.0
= 0, 2 = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
=1
=2
=5
=
0.2
PDF
PDF
0.4
(x)
PDF
0.6
0.6
0.3
0.8
0.8
= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5
0.4
Uniform (continuous)
0.1
4
0.0
0.0
0.2
0.2
0.0
0.0
0.5
1.0
1.5
2.5
3.0
Exponential
Gamma
2.0
=2
=1
= 0.4
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
0.4
3.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
0.3
PDF
0.2
1.0
PDF
PDF
1.5
0.1
0
0.0
0
Inverse Gamma
Beta
10
15
20
Pareto
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
2.0
2.5
Weibull
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
3.0
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
3
x
3
x
0.0
0.2
0.4
0.6
x
0.8
1.0
2
0
0.0
0.0
0.5
0.5
1.0
1.0
PDF
PDF
1.5
PDF
2
PDF
1.5
2.0
2.5
0.0
0.0
0.0
0.5
0.1
0.5
1.0
0.2
PDF
0.3
2.0
1.5
2.5
0.4
0.5
k=1
k=2
k=3
k=4
k=5
2.0
0.5
0.4
1
ba
0.0
0.5
1.0
1.5
x
2.0
2.5
Probability Theory
Law of Total Probability
Definitions
P [B] =
Sample space
Outcome (point or element)
Event A
-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
[
X
n
n

Ai =
(1)r1

Probability distribution P
1. P [A] 0 for every A
2. P [] = 1
" #
G
X
3. P
Ai =
P [Ai ]
i=1
r=1
n
G
Ai
i=1
\

r

A
ij

ii1 <<ir n j=1
Random Variables
Random Variable
i=1
X:R
Probability space (, A, P)
Probability Mass Function (PMF)
Properties
i=1
1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A
i=1
n
X
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]
= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]
fX (x) = P [X = x] = P [{ : X() = x}]

Probability Density Function (PDF)
Z
P [a X b] =
f (x) dx
a
Cumulative Distribution Function (CDF):

FX : R [0, 1]
FX (x) = P [X x]
1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )

2. Normalized: limx = 0 and limx = 1
3. Right-continuous: limyx F (y) = F (x)
Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]
S
where A = i=1 Ai
T
where A = i=1 Ai
Z
P [a Y b | X = x] =
fY |X (y | x)dy
ab
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
P [A B]
P [B]
if
P [B] > 0
1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)
6
3.1
Transformations
E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)

X,Y
Transformation function
E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
X
E [X] =
P [X x]
Z = (X)
Discrete
X

fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =
f (x)
x1 (z)
x=1
Sample mean
n
X
n = 1
X
Xi
n i=1
Continuous
Z
FZ (z) = P [(X) z] =
with Az = {x : (x) z}
f (x) dx
Az
Special case if strictly monotone

dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|
Conditional Expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]
(x, y)fY |X (y | x) dx
Z
The Rule of the Lazy Statistician
E [(Y, Z) | X = x] =
E [Z] =
(y, z)f(Y,Z)|X (y, z | x) dy dz
Z
(x) dFX (x)
Z
E [IA (x)] =
Z
E[(X, Y ) | X = x] =
E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
Z
dFX (x) = P [X A]
IA (x) dFX (x) =

A
Convolution
Z
Z := X + Y
fX,Y (x, z x) dx
fZ (z) =
X,Y 0
fX,Y (x, z x) dx
Z := |X Y |
Z :=
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z
fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx
Variance

2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
" n
#
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1
Expectation
" n
X
i=1
#
Xi =
i=1
n
X
i6=j
V [Xi ]
iff Xi
Xj
i=1
Standard deviation
Expectation
sd[X] =
Z
E [X] = X =
Variance
x dFX (x) =
xfX (x)
x
Z
xfX (x)
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]
X discrete
X continuous
V [X] = X
Covariance
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
Cov [X + a, Y + b] = Cov [X, Y ]
n
m
n X
m
X
X
X
Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1
j=1
Correlation
limn Bin (n, p) = N (np, np(1 p))
(n large, p far from 0 and 1)
Negative Binomial
i=1 j=1
Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]
Independence
X NBin (1, p) = Geo (p)

Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Poisson
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Xi Po (i ) Xi Xj =
Sample variance
Xi Po
i=1
S2 =
n
X
1 X
n )2
(Xi X
n 1 i=1
Conditional Variance

2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]
n
X
!
i
i=1
X
n
X
n
Xj , Pn
Xj Bin
Xi Po (i ) Xi Xj = Xi
j
j=1
j=1
j=1
Exponential
Xi Exp () Xi
Xj =
n
X
Xi Gamma (n, )
i=1
Memoryless property: P [X > x + y | X > y] = P [X > x]
Inequalities
Cauchy-Schwarz
Normal
X N , 2

2
E [XY ] E X 2 E Y 2
Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
Chernoff

P [X (1 + )]
V [X]
t2
e
(1 + )1+

> 1
E [(X)] (E [X]) convex
Distribution Relationships
Gamma
X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
i Xi Gamma (
i i , )
Z
()
1 x
=
x
e
dx
0
Beta
Binomial
Xi Bern (p) =
N (0, 1)

X N , Z = aX + b = Z N a + b, a2 2

X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22

P
P
P
2
Xi N i , i2 =
X N
i i ,
i i
i i

a
P [a < X b] = b
(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )
Jensen
n
X
1
( + ) 1
x1 (1 x)1 =
x
(1 x)1
B(, )
()()
B( + k, )

+k1
E Xk =
=
E X k1
B(, )
++k1
Beta (1, 1) Unif (0, 1)
Xi Bin (n, p)
i=1
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)

limn Bin (n, p) = Po (np)
(n large, p small)
Probability and Moment Generating Functions

GX (t) = E tX
X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2
E [X | Y ] = E [X] +
|t| < 1
Xt
MX (t) = GX (e ) = E e
=E
"
#
X (Xt)i
i!
i=0

X
E Xi
=
ti
i!
i=0
P [X = 0] = GX (0)
P [X = 1] = G0X (0)
P [X = i] =
Conditional mean and variance
9.3
Multivariate Normal
(Precision Matrix 1 )
V [X1 ]
Cov [X1 , Xk ]
..
..
..
=
.
.
.
(i)
GX (0)
Covariance Matrix
i!
E [X] = G0X (1 )

(k)
E X k = MX (0)

X!
(k)
E
= GX (1 )
(X k)!
Cov [Xk , X1 ]
If X N (, ),
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))
1/2
fX (x) = (2)n/2 ||
GX (t) = GY (t) = X = Y
9
9.1

1
exp (x )T 1 (x )
2
Properties
Multivariate Distributions
Standard Bivariate Normal
Let X, Y N (0, 1) X
Z with Y = X +
V [Xk ]
1 2 Z
Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) a is vector of length k = aT X N aT , aT a
Joint density
f (x, y) =

2
1
x + y 2 2xy
p
exp
2(1 2 )
2 1 2
10
Conditionals
(Y | X = x) N x, 1 2
and
(X | Y = y) N y, 1 2
Convergence
Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Types of Convergence
Independence
1. In distribution (weakly, in law): Xn X
X
Y = 0
9.2
lim Fn (t) = F (t)
Bivariate Normal

Let X N x , x2 and Y N y , y2 .
f (x, y) =
"
z=
x x
x
2x y
2

+
1
p
2. In probability: Xn X

z
exp
2
2(1 2 )
1
y y
y
t where F continuous
2

2
x x
x

( > 0) lim P [|Xn X| > ] = 0
y y
y
n
as
#
3. Almost surely (strongly): Xn X

h
i
h
i
P lim Xn = X = P : lim Xn () = X() = 1
n
qm
4. In quadratic mean (L2 ): Xn X
CLT Notations
Zn N (0, 1)

2
Xn N ,
n

2
Xn N 0,
n
2
n ) N 0,
n(X

n(Xn )
N (0, 1)
n

lim E (Xn X)2 = 0
Relationships
qm
Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X
Xn
Xn
Xn
Xn
X
qm
X
P
X
P
X
Yn
Yn
Yn
=
Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)
Continuity Correction

x + 12
/ n

x 12
P Xn x 1
/ n
Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X

n x
P X
Slutzkys Theorem
D
Delta Method
Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y

Yn N
11
10.1
Law of Large Numbers (LLN)
2
,
n

= (Yn ) N
2
(), ( ())
n
0
Statistical Inference
iid
Let X1 , , Xn F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .
11.1
Weak (WLLN)
P
n
X
as n
as
n
X
as n
Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )

h i
bias(bn ) = E bn
Strong (SLLN)
10.2
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .

n
n(Xn ) D
X
Z
Zn := q =
V Xn
lim P [Zn z] = (z)
Point Estimation
where Z N (0, 1)
zR
P
Consistency: bn
Sampling distribution: F (bn )
r h i
b
Standard error: se(n ) = V bn
h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

bn D
Asymptotic normality:
N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .
10
11.2
Normal-based Confidence Interval

b 2 . Let z/2
Suppose bn N , se

and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then

= 1 (1 (/2)), i.e., P Z > z/2 = /2
b
Cn = bn z/2 se
11.3
Empirical Distribution Function
Empirical Distribution Function (ECDF)

Pn
Fbn (x) =
i=1
I(Xi x)
n
(
1
I(Xi x) =
0
Xi x
Xi > x
Properties (for any fixed x)

h i
E Fn = F (x)
h i F (x)(1 F (x))
V Fn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn F )

2

P sup F (x) Fn (x) > = 2e2n
11.4
Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of = T (F ) : bn = T (Fn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:
pth quantile: F 1 (p) = inf{x : F (x) p}

n

=X
n
1 X
n )2

b2 =
(Xi X
n 1 i=1
Pn
1
)3
i=1 (Xi
n

=
b3 j
Pn
n )(Yi Yn )
(Xi X
qP
= qP i=1
n
n
2
(X
X
)
i
n
i=1
i=1 (Yi Yn )
12
Parametric Inference

Let F = f (x; : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).
12.1
Method of Moments
j th moment

j () = E X j =
Nonparametric 1 confidence band for F

L(x) = max{Fn n , 0}
U (x) = min{Fn + n , 1}
s

1
2
log
=
2n
P [L(x) F (x) U (x) x] 1
1X
(Xi )
(x) dFbn (x) =
n i=1

b 2 = T (Fn ) z/2 se
b
Often: T (Fn ) N T (F ), se
T (Fn ) =
xj dFX (x)
j th sample moment
n
j =
1X j
X
n i=1 i
Method of Moments Estimator (MoM)

1 () =
1
2 () =
2
.. ..
.=.
k () =
k
11
Properties of the MoM estimator

bn exists with probability tending to 1
P
Consistency: bn
n(b ) N (0, )

where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
g = (g1 , . . . , gk ) and gj =
j ()
12.2
Maximum Likelihood
Likelihood: Ln : [0, )
Ln () =
n
Y
f (Xi ; )
Equivariance: bn is the mle = (bn ) is the mle of ()

p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se
(bn ) D
N (0, 1)
b
se
Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is
h i
V bn
are(en , bn ) = h i 1
V en
i=1
Approximately the Bayes estimator

Log-likelihood
`n () = log Ln () =
n
X
log f (Xi ; )
i=1
12.2.1
Delta Method
b where is differentiable and 0 () 6= 0:

If = ()
Maximum Likelihood Estimator (mle)
(b
n ) D
N (0, 1)
b )
se(b
Ln (bn ) = sup Ln ()
Score Function
s(X; ) =
log f (X; )
b is the mle of and

where b = ()

b b
b = 0 ()
b n )
se
se(
Fisher Information
I() = V [s(X; )]
In () = nI()
Fisher Information (exponential family)

I() = E s(X; )
Observed Fisher Information

Inobs () =
Properties of the mle
P
Consistency: bn
12.3
Multiparameter Models
Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.

Hjj =
2 `n
2
Hjk =
Fisher Information Matrix
n
2 X
log f (Xi ; )
2 i=1
2 `n
j k
..
.
E [Hk1 ]
E [H11 ]
..
In () =
.
E [H1k ]
..
.
E [Hkk ]
Under appropriate regularity conditions

(b ) N (0, Jn )
12
with Jn () = In1 . Further, if bj is the j th component of , then
13
Hypothesis Testing
H0 : 0
(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se
12.3.1
Multiparameter Delta Method
Let = (1 , . . . , k ) be a function and let the gradient of be
1
.
=
..

k
Definitions
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
Critical value c
Test statistic T
Rejection Region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()
(b
) D
N (0, 1)
b )
se(b
H0 true
H1 true
T
Type II error ()
1F (T (X))
p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

Jn

b and
= b.
and Jn = Jn ()
=
12.4
Retain H0
Reject H0
Type
I error ()
(power)

p-value = sup0 P [T (X) T (x)] = inf : T (x) R

p-value = sup0
P [T (X ? ) T (X)]
= inf : T (X) R
|
{z
}
where
b ) =
se(b
Test size: = P [Type I error] = sup ()
p-value

b Then,
Suppose =b 6= 0 and b = ().
r

H1 : 1
versus
Parametric Bootstrap
Sample from f (x; bn ) instead of from Fn , where bn could be the mle or method
of moments estimator.
since T (X ? )F
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Wald Test
Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se

P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood Ratio Test (LRT)
T (X) =
sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )
13
(X) = 2 log T (X) 2rq where
k
X
iid
Zi2 2k with Z1 , . . . , Zk N (0, 1)
i=1

p-value = P0 [(X) > (x)] P 2rq > (x)
Multinomial LRT

Xk
X1
,...,
be the mle
Let pn =
n
n
Xj
k
Y
Ln (
pn )
pj
T (X) =
=
Ln (p0 )
p0j
j=1

k
X
pj
D
Xj log
(X) = 2
2k1
p
0j
j=1
The approximate size LRT rejects H0 when (X) 2k1,
xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
In particular, X n iid = f (xn | ) =
f (xi | ) = Ln ()
i=1
Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that depends Ron
R
Ln ()f ()
Posterior Mean n = f ( | xn ) d = R
Ln ()f () d
14.1
Credible Intervals
1 Posterior Interval
Pearson Test
n
k
X
(Xj E [Xj ])2
where E [Xj ] = np0j under H0
T =
E [Xj ]
j=1
D
f ( | xn ) d = 1
P [ (a, b) | x ] =
a
1 Equal-tail Credible Interval
2k1
T

p-value = P 2k1 > T (x)
f ( | xn ) d =
2
Faster Xk1
than LRT, hence preferable for small n
f ( | xn ) d = /2
1 Highest Posterior Density (HPD) region Rn
Independence Testing
I rows, J columns, X multinomial sample of size n = I J
X
mles unconstrained: pij = nij
X
1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k
mles under H0 : p0ij = pi pj = Xni nj

PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
Pearson 2 : T = i=1 j=1 ijE[Xij ]ij
Rn is unimodal = Rn is an interval
Let = () and A = { : () }.
Posterior CDF for
LRT and Pearson 2k , where = (I 1)(J 1)
14
14.2
Function of Parameters
Bayesian Inference
H(r | xn ) = P [() | xn ] =
f ( | xn ) d
Bayes Theorem
Posterior Density
f (x | )f ()
f (x | )f ()
f ( | x) =
=R
Ln ()f ()
n
f (x )
f (x | )f () d
h( | xn ) = H 0 ( | xn )
Bayesian Delta Method
Definitions
n
X = (X1 , . . . , Xn )

b
b se
b 0 ()
| X n N (),

14
14.3
Priors
Continuous likelihood (subscript c denotes constant)
Choice
Subjective Bayesianism: prior should incorporate as much detail as possible
the researchs a priori knowledge via prior elicitation.
Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior).
Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior.
Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys prior (transformation-invariant):
f ()
I()
f ()
det(I())
Conjugate: f () and f ( | xn ) belong to the same parametric family
14.3.1
Conjugate Priors
Discrete likelihood
Likelihood
Bernoulli(p)
Binomial(p)
Conjugate Prior
Beta(, )
Beta(, )
Posterior hyperparameters
+
+
n
X
i=1
n
X
xi , + n
xi , +
i=1
Negative Binomial(p)
Poisson()
Beta(, )
Gamma(, )
+ rn, +
+
n
X
n
X
n
X
i=1
xi
Multinomial(p)
Dirichlet()
xi , + n
x(i)
i=1
Geometric(p)
Beta(, )
Conjugate Prior
Uniform(0, )
Pareto(xm , k)
Exponential()
Gamma(, )
Posterior hyperparameters

max x(n) , xm , k + n
n
X
+ n, +
xi
i=1
Normal(, c2 )
Normal(0 , 02 )
Normal(c , 2 )
Scaled Inverse Chisquare(, 02 )
Normal(, 2 )
Normalscaled
Inverse
Gamma(, , , )
MVN(, c )
MVN(0 , 0 )
MVN(c , )
InverseWishart(, )
Pareto(xmc , k)
Gamma(, )
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma(c , )
Gamma(0 , 0 )
Pn

n
0
1
i=1 xi
+
+ 2 ,
/
2
2
02
c
0
c1
n
1
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n

n
+ n
x
,
+ n,
+ ,
+n
2
n
1X
(
x )2
2
+
(xi x
) +
2 i=1
2(n + )
1
1
0 + nc
+ n, +
n
X
i=1
1

1
1
x
,
0 0 + n

1 1
1
0 + nc
n
X
n + , +
(xi c )(xi c )T
i=1
n
X
xi
xm c
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +
log
i=1
xi
i=1
Ni
n
X
14.4
xi
Bayesian Testing
If H0 : 0 :
i=1
Z
Prior probability P [H0 ] =
i=1
i=1
n
X
n
X
Likelihood
Posterior probability P [H0 | xn ] =
f () d
Z0
f ( | xn ) d
Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),

xi
f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]
15
Marginal Likelihood
f (xn | Hi ) =
1. Estimate VF [Tn ] with VFn [Tn ].

2. Approximate VFn [Tn ] using simulation:
f (xn | , Hi )f ( | Hi ) d
(a) Repeat the following B times to get Tn,1

, . . . , Tn,B
, an iid sample from
the sampling distribution implied by Fn
i. Sample uniformly X , . . . , Xn Fn .
Posterior Odds (of Hi relative to Hj )

P [Hi | xn ]
P [Hj | xn ]
f (xn | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
P [Hi ]
P [Hj ]
| {z }
ii. Compute Tn = g(X1 , . . . , Xn ).

(b) Then
prior odds
Bayes Factor
log10 BF10 BF10
evidence
0 0.5
1 1.5
Weak
0.5 1
1.5 10 Moderate
12
10 100 Strong
>2
> 100
Decisive
p
BF
10
1p
p =
where p = P [H1 ] and p = P [H1 | xn ]
p
1 + 1p
BF10
15
Exponential Family
b=1
16.1.1
Bootstrap Confidence Intervals
Normal-based Interval
Tn z/2 se
boot
1.
2.
3.
4.
fX (x | ) = h(x) exp {()T (x) A()}

= h(x)g() exp {()T (x)}
Vector parameter
fX (x | ) = h(x) exp
!2
Pivotal Interval
Scalar parameter
vboot = V
Fn
B
B
1 X
1 X
Tn,b
T
=
B
B r=1 n,r
s
X
Location parameter = T (F )
Pivot Rn = bn
Let H(r) = P [Rn r] be the cdf of Rn
Let Rn,b
= bn,b
bn . Approximate H using bootstrap:
B
1 X
H(r)
=
I(Rn,b
r)
B
i ()Ti (x) A()
i=1
b=1
= h(x) exp {() T (x) A()}

= h(x)g() exp {() T (x)}
Natural form
fX (x | ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}

= h(x)g() exp T T(x)
, . . . , bn,B
)
5. Let denote the sample quantile of (bn,1
6. Let r denote the sample quantile of (Rn,1

, . . . , Rn,B
), i.e., r = bn

7. Then, an approximate 1 confidence interval is Cn = a
, b with
a
=
b =
16
16.1
Sampling Methods
The Bootstrap
Let Tn = g(X1 , . . . , Xn ) be a statistic.

1 1 =
bn H
2
1 =
bn H
2
bn r1/2
=
2bn 1/2
bn r/2
=
2bn /2
Percentile Interval

Cn = /2
, 1/2
16
16.2
Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to proportional constant: h() = R
Algorithm
1. Draw cand g()
2. Generate u Unif (0, 1)
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example
Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=
17.1
We can easily sample from the prior g() = f ()

Target is the posterior with h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()
2. Generate u Unif (0, 1)
Ln (cand )
3. Accept cand if u
Ln (bn )
16.3
Decision rule: synonymous for an estimator b

Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,
Posterior Risk
Z
h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))
h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))
r(b | x) =
(Frequentist) Risk
b =
R(, )
Bayes Risk
ZZ
Importance Sampling
b =
r(f, )
Sample from an importance function g rather than target density h.

Algorithm to obtain an approximation to E [q() | xn ]:
iid
Ln (i )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (i )
PB
n
3. E [q() | x ] i=1 q(i )wi
Decision Theory
Definitions
Unknown quantity affecting our decision:
h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))
h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)
1. Sample from the prior 1 , . . . , n f ()
17
Risk
17.2
Admissibility
b0 dominates b if
b
: R(, b0 ) R(, )
b
: R(, b0 ) < R(, )
b is inadmissible if there is at least one other estimator b0 that dominates

it. Otherwise it is called admissible.
17
17.3
Bayes Rule
Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator)

rss(b0 , b1 ) =
b = inf e r(f, )
e
r(f, )
R
b
b
b = r(b | x)f (x) dx
(x) = inf r( | x) x = r(f, )
n
X
2i
i=1
Least Square Estimates

Theorems
bT = (b0 , b1 )T : min rss

b0 ,
b1
Squared error loss: posterior mean

Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4
n
b0 = Yn b1 X
Pn
Pn
n )(Yi Yn )
(Xi X
i=1 Xi Yi nXY
Pn
b1 = i=1
=
P
n
2
2
2
i=1 (Xi Xn )
i=1 Xi nX
h
i
0
E b | X n =
1

P
h
i
2 n1 ni=1 Xi2 X n
n
b
V |X =
X n
1
nsX
r Pn
2
b
i=1 Xi
b b0 ) =
se(
n
sX n
b b1 ) =
se(
sX n
Minimax Rules
Maximum Risk
b = sup R(, )
b
)
R(
R(a)
= sup R(, a)
Minimax Rule
e
e = inf sup R(, )
b = inf R(
)
sup R(, )
b =c
b = Bayes rule c : R(, )
Least Favorable Prior
bf = Bayes rule R(, bf ) r(f, bf )
18
Pn
where s2X = n1 i=1 (Xi X n )2 and
b2 =
of . Further properties:
Pn
2i
i=1
an (unbiased) estimate
Linear Regression
P
P
Consistency: b0 0 and b1 1
Definitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1
1
n2
b0 0 D
N (0, 1)
b b0 )
se(
Simple Linear Regression
b1 1 D
N (0, 1)
b b1 )
se(
Approximate 1 confidence intervals for 0 and 1 are
Model
Yi = 0 + 1 Xi + i
and
E [i | Xi ] = 0, V [i | Xi ] = 2
b b0 )
b0 z/2 se(
b b1 )
and b1 z/2 se(
Fitted Line
The Wald test for testing H0 : 1 = 0 vs. H1 : 1 6= 0 is: reject H0 if
b b1 ).
|W | > z/2 where W = b1 /se(
rb(x) = b0 + b1 x
Predicted (Fitted) Values
Ybi = rb(Xi )
Residuals
i = Yi Ybi = Yi b0 + b1 Xi
R2

Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
2
18
If the (k k) matrix X T X is invertible,
Likelihood
L=
L1 =
L2 =
n
Y
i=1
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi )
i=1
n
Y
b = (X T X)1 X T Y
i
h
V b | X n = 2 (X T X)1
fY |X (Yi | Xi ) = L1 L2
i=1
b N , 2 (X T X)1
fX (Xi )
(
fY |X (Yi | Xi )
i=1
2
1 X
Yi (0 1 Xi )
exp 2
2 i
Estimate regression function

rb(x) =
Unbiased estimate for
1X 2

b2 =
n i=1 i
18.2
k
X
bj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
n
b2 =
n
1 X 2

n k i=1 i
= X b Y
mle
Prediction
b=X
Observe X = x of the covarite and want to predict their outcome Y .
b2 =
nk 2
1 Confidence Interval
Yb = b0 + b1 x
h i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
Prediction Interval
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
Y = X +
where
X11
..
X= .
Xn1
..
.
Model Selection
Underfitting: too few covariates yields high bias

Overfitting: too many covariates yields high variance
Multiple Regression
18.4
Consider predicting a new observation Y for covariates X and let S J

denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues

Pn
2
2
2
i=1 (Xi X )
b
P
n =
b
2j + 1
n i (Xi X)
Yb z/2 bn
18.3
b bj )
bj z/2 se(
X1k
..
.
1

= ...
Xnk

1
..
=.
n
Likelihood
2 n/2
L(, ) = (2 )

1
exp 2 rss
2

rss = (y X)T (y X) = ||Y X||2 =
N
X
i=1
(Yi xTi )2
Hypothesis Testing
H0 : j = 0 vs. H1 : j 6= 0 j J
Mean Squared Prediction Error (mspe)
h
i
mspe = E (Yb (S) Y )2
Prediction Risk
R(S) =
n
X
i=1
mspei =
n
X
i=1
h
i
E (Ybi (S) Yi )2
19
19
Training Error
n
X
btr (S) =
R
(Ybi (S) Yi )2
19.1
i=1
R2 (S) = 1
btr (S)
R
rss(S)
=1
=1
tss
tss
Non-parametric Function Estimation
Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )
The training error is a downward-biased estimate of the prediction risk.

i
h
btr (S) < R(S)
E R
Density Estimation
R
Estimate f (x), where f (x) = P [X A] = A f (x) dx.
Integrated Square Error (ise)
Z
Z
2
L(f, fbn ) =
f (x) fbn (x) dx = J(h) + f 2 (x) dx
Frequentist Risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)
n
i
h
i
X
btr (S)) = E R
btr (S) R(S) = 2
bias(R
Cov Ybi , Yi
i=1
Adjusted R2
19.1.1
n 1 rss
R (S) = 1
n k tss
Histograms
Definitions
Mallows Cp statistic
b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty
Akaike Information Criterion (AIC)
`n (bS ,
bS2 )
fbn (x) =
k
BIC(S) = `n (bS ,
bS2 ) log n
2
Validation and Training
(Ybi (S) Yi )2
m = |{validation data}|, often
i=1
Leave-one-out Cross-validation
bCV (S) =
R
n
X
i=1
m
X
pbj
j=1
Bayesian Information Criterion (BIC)
m
X
Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du
Histogram Estimator
AIC(S) =
bV (S) =
R
(Yi Yb(i) ) =
n
X
i=1
Yi Ybi (S)
1 Uii (S)
U (S) = XS (XST XS )1 XS (hat matrix)
!2
n
n
or
4
2
I(x Bj )
h
i p
j
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
b
R(fn , f )
(f 0 (u)) du +
12
nh
!1/3
1
6
h = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
C
3
2
R (fbn , f ) 2/3
C=
(f 0 (u)) du
4
n
Cross-validation estimate of E [J(h)]
Z
n
m
2Xb
2
n+1 X 2
2
b
b
f(i) (Xi ) =
pb
JCV (h) = fn (x) dx
n i=1
(n 1)h (n 1)h j=1 j
20
19.1.2
Kernel Density Estimator (KDE)
k-nearest Neighbor Estimator

X
1
rb(x) =
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
Kernel K
i:xi Nk (x)
K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
>0
x K(x) dx K
Nadaraya-Watson Kernel Estimator

rb(x) =
n
X
wi (x)Yi
i=1
KDE
xxi
h

Pn
xxj
j=1 K
h
wi (x) =

n
1X1
x Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
(f (x)) dx +
K 2 (x) dx
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
5 2 2/5
c4
c4 = (K
R (f, fbn ) = 4/5
)
K 2 (x) dx
(f 00 )2 dx
4
n
|
{z
}
R(b
rn , r)
h4
4
Z
[0, 1]
4 Z
2
f 0 (x)
x2 K 2 (x) dx
r00 (x) + 2r0 (x)
dx
f (x)
R
2 K 2 (x) dx
dx
nhf (x)
Z
c1
n1/5
c2
R (b
rn , r) 4/5
n
h
C(K)
Epanechnikov Kernel
(
K(x) =
4 5(1x2 /5)
|x| <
JbCV (h) =
K (x) = K (2) (x) 2K(x)
19.2
19.3

n
n
n
2Xb
1 X X Xi Xj
2
2
b
fn (x) dx
f(i) (Xi )
K
+
K(0)
2
n i=1
hn i=1 j=1
h
nh
K (2) (x) =
n
X
(Yi rb(i) (xi ))2 =
i=1
otherwise

Z
JbCV (h) =
i=1
!2
K(0)
xx
Pn
j
j=1 K
h
Approximation
r(x) =
X
j=1
j j (x)
J
X
j j (x)
i=1
Multivariate Regression
Non-parametric Regression
Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points

(x1 , Y1 ), . . . , (xn , Yn ) related by
(Yi rb(xi ))2
Smoothing Using Orthogonal Functions
Z
K(x y)K(y) dy
n
X
where
i = i
Y = +
0 (x1 )
..
and = .
..
.
0 (xn )
J (x1 )
..
.
J (xn )
Least Squares Estimator

Yi = r(xi ) + i
E [i ] = 0
V [i ] = 2
b = (T )1 T Y
1
T Y (for equallly spaced observations only)
n
21
2
n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
i=1
20
j=1
20.2
Poisson Processes
Poisson Process
{Xt : t [0, )} number of events up to and including time t
X0 = 0
Independent increments:
Stochastic Processes
Stochastic Process
(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous
{Xt : t T }
Notations: Xt , X(t)
State space X
Index set T
20.1
t0 < < tn : Xt1 Xt0

Xtn Xtn1
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h)
P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =
Markov Chains
Markov Chain
Rt
0
(s) ds
Homogeneous Poisson Process
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]
n T, x X
(t) = Xt Po (t)
Transition probabilities
pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]
>0
Waiting Times
n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )

(i, j) element is pij
pij > 0
P
i pij = 1

1
Wt Gamma t,
Chapman-Kolmogorov
Interarrival Times
pij (m + n) =
pij (m)pkj (n)
St = Wt+1 Wt
Pm+n = Pm Pn
Pn = P P = Pn
St Exp

1
Marginal probability
n = (n (1), . . . , n (N ))
where
i (i) = P [Xn = i]
St
0 , initial distribution
n = 0 Pn
Wt1
Wt
t
22
21
Time Series
21.1
Mean function
Strictly stationary
xt = E [xt ] =
Stationary Time Series
xft (x) dx
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]
Autocovariance function
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t

x (t, t) = E (xt t )2 = V [xt ]
Autocorrelation function (ACF)
k N, tk , ck , h Z
Weakly stationary

E x2t <
t Z
2
E xt = m
t Z
x (s, t) = x (s + r, t + r)
(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)
r, s, t Z
Autocovariance function
xy (s, t) = E [(xs xs )(yt yt )]
Cross-correlation function (CCF)

xy (s, t) = p
xy (s, t)
x (s, s)y (t, t)
(h) = E [(xt+h )(xt )]

(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)
h Z
Backshift operator
B k (xt ) = xtk
Autocorrelation function (ACF)
Difference operator
Cov [xt+h , xt ]
(t + h, t)
(h)
x (h) = p
=p
=
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)
d = (1 B)d
White Noise
2
wt wn(0, w
)
Jointly stationary time series
iid
2
0, w
Gaussian: wt N
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T
xy (h) = E [(xt+h x )(yt y )]
xy (h) = p
Random Walk
Linear Process
Drift
Pt
xt = t + j=1 wj
E [xt ] = t
xt = +
k
X
j=k
aj xtj
j wtj
where
j=
Symmetric Moving Average

mt =
xy (h)
x (0)y (h)
where aj = aj 0 and
k
X
j=k
aj = 1
(h) =
|j | <
j=
2
w
j+h j
j=
23
21.2
Estimation of Correlation
21.3.1
Detrending
Least Squares
Sample mean
n
1X
x
=
xt
n t=1
Sample variance

n
|h|
1 X
1
x (h)
V [
x] =
n
n
h=n
1. Choose trend model, e.g., t = 0 + 1 t + 2 t2

2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt
Moving average
The low-pass filter vt is a symmetric moving average mt with aj =
Sample autocovariance function

nh
1 X
b(h) =
(xt+h x
)(xt x
)
n t=1
vt =
1
2k+1 :
k
X
1
xt1
2k + 1
i=k
Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion
Sample autocorrelation function

b(h) =
b(h)
b(0)
Differencing
t = 0 + 1 t = xt = 1
Sample cross-variance function
bxy (h) =
nh
1 X
(xt+h x
)(yt y)
n t=1
21.4
ARIMA models
Autoregressive polynomial
(z) = 1 1 z p zp
Sample cross-correlation function
bxy (h)
bxy (h) = p
bx (0)b
y (0)
z C p 6= 0
Autoregressive operator
(B) = 1 1 B p B p
Properties
1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n
21.3
Non-Stationary Time Series
Autoregressive model order p, AR (p)

xt = 1 xt1 + + p xtp + wt (B)xt = wt
AR (1)
xt = k (xtk ) +
k1
X
j (wtj )
k,||<1
j=0
Classical decomposition model
j (wtj )
j=0
{z
linear process
xt = t + st + wt
t = trend
st = seasonal component
wt = random noise term
E [xt ] =
j=0
j (E [wtj ]) = 0
(h) = Cov [xt+h , xt ] =

(h) =
(h)
(0)
2 h
w
12
= h
(h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial
Seasonal ARIMA
(z) = 1 + 1 z + + q zq
z C q 6= 0
Moving average operator

(B) = 1 + 1 B + + p B p
21.4.1
MA (q) (moving average model order q)

xt = wt + 1 wt1 + + q wtq xt = (B)wt
E [xt ] =
q
X
Denoted by ARIMA (p, d, q) (P, D, Q)s

d
s
P (B s )(B)D
s xt = + Q (B )(B)wt
Causality and Invertibility
ARMA (p, q) is causal (future-independent) {j } :

xt =
j E [wtj ] = 0
(h) = Cov [xt+h , xt ] =
2
w
0
Pqh
j=0
j j+h
j=0
j < such that
wtj = (B)wt
j=0
j=0
0hq
h>q
ARMA (p, q) is invertible {j } :
MA (1)
xt = wt + wt1
2 2
(1 + )w h = 0
2
(h) = w
h=1
0
h>1
(
h=1
2
(h) = (1+ )
0
h>1
(B)xt =
j=0
j < such that
Xtj = wt
j=0
Properties
ARMA (p, q) causal roots of (z) lie outside the unit circle
(z) =
j z j =
j=0
(z)
(z)
|z| 1
ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq
ARMA (p, q) invertible roots of (z) lie outside the unit circle
(B)xt = (B)wt
(z) =
Partial autocorrelation function (PACF)
j z j =
j=0
xh1
, regression of xi on {xh1 , xh2 , . . . , x1 }
i
hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)
(z)
(z)
|z| 1
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
ARIMA (p, d, q)
d xt = (1 B)d xt is ARMA (p, q)
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
ARMA (p, q)
tails off
tails off
(B)(1 B)d xt = (B)wt

Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1
xt =
(1 )j1 xtj + wt
when || < 1
21.5
Spectral Analysis
Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)
j=1
x
n+1 = (1 )xn +
xn
Frequency index (cycles per unit time), period 1/
25
Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs
Discrete Fourier Transform (DFT)

d(j ) = n1/2
xt =
q
X
Fourier/Fundamental frequencies
(Uk1 cos(2k t) + Uk2 sin(2k t))
j = j/n
k=1
Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2

Pq
(h) = k=1 k2 cos(2k h)
Pq
(0) = E x2t = k=1 k2
Inverse DFT
xt = n1/2
j=0
Scaled Periodogram
2 2i0 h 2 2i0 h
e
+
e
=
2
2
Z 1/2
=
e2ih dF ()
4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1
P (j/n) =
1/2
Spectral distribution function

< 0
< 0
0
22
22.1
(h)e2ih
h=
h=
!2
Gamma Function
ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
x
Z x
Lower incomplete: (s, x) =
ts1 et dt
Ordinary: (s) =
Spectral density
2X
xt sin(2tj/n
n t=1
Math
Z
F () = F (1/2) = 0
F () = F (1/2) = (0)
f () =
d(j )e2ij t
I(j/n) = |d(j/n)|2
(h) = 2 cos(20 h)
0
F () = 2 /2
n1
X
Periodogram
Spectral representation of a periodic process
xt e2ij t
i=1
Periodic mixture
Needs
n
X
|(h)| < = (h) =
1
1
2
2
R 1/2
1/2
e2ih f () d
f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d
h = 0, 1, . . .
( + 1) = ()
>1
(n) = (n 1)!
nN
(1/2) =
22.2
Beta Function
Z 1
(x)(y)
tx1 (1 t)y1 dt =
Ordinary: B(x, y) = B(y, x) =
(x + y)
0
Z x
Incomplete: B(x; a, b) =
ta1 (1 t)b1 dt
2
White noise: fw () = w
ARMA (p, q) , (B)xt = (B)wt :
|(e2i )|2
fx () =
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
w
Regularized incomplete:
a+b1
B(x; a, b) a,bN X
(a + b 1)!
Ix (a, b) =
=
xj (1 x)a+b1j
B(a, b)
j!(a
+
b
j)!
j=a
26
Stirling numbers, 2nd kind

n
n1
n1
=k
+
k
k
k1
I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)
22.3
Series
Finite
n(n + 1)
k=
2
(2k 1) = n2
k=1
n
X
k=1
n
X
k=1
n
X
k2 =
ck =
k=0
cn+1 1
c1
n
X
n
k=0
n
X
=2
k=0
Balls and Urns

|B| = n, |U | = m
B : D, U : D
Binomial
Theorem:
n
X
n nk k
a
b = (a + b)n
k
B : D, U : D
k=0
c 6= 1
k > n : Pn,k = 0
f :BU
f arbitrary
n
m

B : D, U : D

n+n1
n
m
X
n
k=1
1
,
1p
p
|p| < 1
1p
k=1
!

X
d
1
1
k
=
p
=
dp 1 p
1 p2
pk =
kpk1 =
B : D, U : D
|p| < 1
ordered
(n i) =
i=0
unordered
Pn,k
f injective
(
mn m n
0
else

m
n
(
1 mn
0 else
(
1 mn
0 else
f surjective

n
m!
m

n1
m1

n
m
Pn,m
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
w/o replacement
nk =
D = distinguishable, D = indistinguishable.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.

The American Statistician, 62(1):4553, 2008.
Sampling
k1
Y
n 1 : Pn,0 = 0, P0,0 = 1
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.

Brooks Cole, 1972.
Combinatorics
k out of n
m
X
k=1
k=0
22.4
Pn,i
k=0
d
dp
k=0
k=0

X
r+k1 k
x = (1 x)r r N+
k
k=0

X
k
p = (1 + p) |p| < 1 , C
k
n
X
Vandermondes Identity:

r
X
m
n
m+n
=
k
rk
r
k=0
Infinite
pk =
Pn+k,k =
i=1

r+k
r+n+1
=
k
n
k=0

n
X k
n+1
=
m
m+1
n(n + 1)(2n + 1)
6
k=1
2

n
X
n(n + 1)
k3 =
2
Partitions
Binomial
n
X
1kn
(
1 n=0
n
=
0
0 else
n!
(n k)!

n
nk
n!
=
=
k
k!
k!(n k)!
w/ replacement
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,

Algebra. Springer, 2001.
nk
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und

Statistik. Springer, 2002.

n1+r
n1+r
=
r
n1
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.

Springer, 2003.
27
28
Univariate distribution relationships, courtesy of Leemis and McQueston [2].

A Probability and Statistics Cheatsheet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Probability and Statistics Cheatsheet

Uploaded by

Copyright:

Available Formats

Probability and Statistics

c Matthias Vallentin, 2011

To reproduce, please contact me.

19 Non-parametric Function Estimation

I(a < x < b)

I(a < x < b)

(2)k/2 ||1/2 e 2 (x)

(1 2s)k/2 s < 1/2

(d1 x+d2 )d1 +d2

Law of Total Probability

ii1 <<ir n j=1

fX (x) = P [X = x] = P [{ : X() = x}]

Cumulative Distribution Function (CDF):

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )

xyfX,Y (x, y) dFX (x) dFY (y)

Special case if strictly monotone

The Rule of the Lazy Statistician

(y, z)f(Y,Z)|X (y, z | x) dy dz

IA (x) dFX (x) =

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

limn Bin (n, p) = N (np, np(1 p))

(n large, p far from 0 and 1)

X NBin (1, p) = Geo (p)

Memoryless property: P [X > x + y | X > y] = P [X > x]

E [(X)] (E [X]) convex

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)

Probability and Moment Generating Functions

Conditional mean and variance

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))

Standard Bivariate Normal

1. In distribution (weakly, in law): Xn X

lim Fn (t) = F (t)

( > 0) lim P [|Xn X| > ] = 0

3. Almost surely (strongly): Xn X

4. In quadratic mean (L2 ): Xn X

Law of Large Numbers (LLN)

Let X1 , , Xn F if not otherwise noted.

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

Normal-based Confidence Interval

Empirical Distribution Function

Empirical Distribution Function (ECDF)

Properties (for any fixed x)

pth quantile: F 1 (p) = inf{x : F (x) p}

Nonparametric 1 confidence band for F

P [L(x) F (x) U (x) x] 1

Method of Moments Estimator (MoM)

Properties of the MoM estimator

Equivariance: bn is the mle = (bn ) is the mle of ()

Approximately the Bayes estimator

b where is differentiable and 0 () 6= 0:

Maximum Likelihood Estimator (mle)

b is the mle of and

Observed Fisher Information

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.

Fisher Information Matrix

Under appropriate regularity conditions

with Jn () = In1 . Further, if bj is the j th component of , then

Multiparameter Delta Method

Let = (1 , . . . , k ) be a function and let the gradient of be

Test size: = P [Type I error] = sup ()