Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Probability and Statistics

Cheat Sheet

c Matthias Vallentin, 2011


Copyright
vallentin@icir.org
6th March, 2011

12 Parametric Inference
12.1 Method of Moments . . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . . .
sity of California in Berkeley but also influenced by other sources
12.2.1 Delta Method . . . . . . . . . .
[4, 5]. If you find errors or have suggestions for further topics, I
12.3 Multiparameter Models . . . . . . . .
would appreciate if you send me an email. The most recent ver12.3.1 Multiparameter Delta Method
sion of this document is available at http://bit.ly/probstat.
12.4 Parametric Bootstrap . . . . . . . . .
This cheat sheet integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Univer-

To reproduce, please contact me.

.
.
.
.
.
.

13 Hypothesis Testing

Contents
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
2 Probability Theory
3 Random Variables
3.1 Transformations . . . . . . . . . . . . .
4 Expectation
5 Variance
6 Inequalities

14 Bayesian Inference
14.1 Credible Intervals . . . .
3
14.2 Function of Parameters
3
14.3 Priors . . . . . . . . . .
4
14.3.1 Conjugate Priors
6
14.4 Bayesian Testing . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

6 15 Exponential Family
7
16 Sampling Methods
7
16.1 The Bootstrap . . . . . . . . . . . . .
16.1.1 Bootstrap Confidence Intervals
7
16.2 Rejection Sampling . . . . . . . . . . .
16.3 Importance Sampling . . . . . . . . . .
8

.
.
.
.

16
16
16
17
17

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

17
17
17
18
18

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

18
18
19
19
19

19 Non-parametric Function Estimation


19.1 Density Estimation . . . . . . . . . . . .
11 Statistical Inference
10
19.1.1 Histograms . . . . . . . . . . . .
11.1 Point Estimation . . . . . . . . . . . . . 10
19.1.2 Kernel Density Estimator (KDE)
11.2 Normal-based Confidence Interval . . . . 11
19.2 Non-parametric Regression . . . . . . .
11.3 Empirical Distribution Function . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
19.3 Smoothing Using Orthogonal Functions

20
20
20
21
21
21

7 Distribution Relationships
8 Probability
Functions

and

Moment

17 Decision Theory
17.1 Risk . . . . . . . . . . . .
17.2 Admissibility . . . . . . .
9
17.3 Bayes Rule . . . . . . . .
17.4 Minimax Rules . . . . . .
9
9 18 Linear Regression
9
18.1 Simple Linear Regression
9
18.2 Prediction . . . . . . . . .
18.3 Multiple Regression . . .
9
18.4 Model Selection . . . . . .
10

.
.
.
.
.

11 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
11
20.2 Poisson Processes . . . . . . . . .
12
12
21 Time Series
12
21.1 Stationary Time Series . . . . . .
13
21.2 Estimation of Correlation . . . .
13
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
13
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
14
21.5 Spectral Analysis . . . . . . . . .
14
14 22 Math
22.1 Gamma Function . . . . . . . . .
15
22.2 Beta Function . . . . . . . . . . .
15
22.3 Series . . . . . . . . . . . . . . .
15
22.4 Combinatorics . . . . . . . . . .
16

Generating

9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .

10 Convergence
10.1 Law of Large Numbers (LLN) . . . . . .
10.2 Central Limit Theorem (CLT) . . . . . 10

22
. . . . 22
. . . . 22

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

23
23
24
24
24
24
25
25

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

26
26
26
27
27

Distribution Overview

1.1

Discrete Distributions
Notation1

FX (x)

Unif {a, . . . , b}

Uniform

x<a
axb
x>b
(1 p)1x

bxca+1
ba

1
Bernoulli

Bern (p)

Binomial

Multinomial

Hypergeometric

Hyp (N, m, n)

Negative Binomial

NBin (n, p)

Geometric

Geo (p)

Poisson

Po ()

MX (s)

I(a < x < b)


ba+1

a+b
2

(b a + 1)2 1
12

eas e(b+1)s
s(b a)

px (1 p)1x

p(1 p)

1 p + pes

np

np(1 p)

(1 p + pes )n

x N+

x
X
i
i!
i=0

Uniform (discrete)

nm(N n)(N m)
N 2 (N 1)

nm
N
r

1p
p

x e
x!

1p
p2

p
1 (1 p)es

e(e

1)

Poisson

p = 0.2
p = 0.5
p = 0.8

=1
=4
= 10

0.2

0.6

PMF

0.4

PMF

0.15

PMF

0.10

0.05
x

1 We

10

20
x

30

40

0.0

0.0

0.1

0.2

0.00

PMF

0.20

0.3

0.25

n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9

r

p
1 (1 p)es

Geometric

pi e

N/A

1p
p2

1
p

x N+

Binomial

!n
si

i=0

nx

N
x

p(1 p)x1

k
X

npi (1 pi )

npi

!
x+r1 r
p (1 p)x
r1

Ip (r, x + 1)
1 (1 p)x

V [X]

k
X
n!
x
xi = n
px1 1 pkk
x1 ! . . . xk !
i=1


m mx

Mult (n, p)
x np
p
np(1 p)

E [X]

!
n x
p (1 p)nx
x

I1p (n x, x + 1)

Bin (n, p)

fX (x)

0.8

10

10

15

20

use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3

1.2

Continuous Distributions
Notation

FX (x)

Uniform

Unif (a, b)

Normal

N , 2

x<a
a<x<b

1
x>b
Z x
(x) =
(t) dt
xa
ba

Log-Normal

ln N , 2

Multivariate Normal

MVN (, )

Students t

Student()

Chi-square

2k



1
1
ln x
+ erf
2
2
2 2

fX (x)

E [X]

V [X]

MX (s)

I(a < x < b)


ba


(x )2
1
(x) = exp
2 2
2


(ln x )2
1

exp
2 2
x 2 2

a+b
2

(b a)2
12

esb esa
s(b a)


2 s2
exp s +
2

(2)k/2 ||1/2 e 2 (x)


 
Ix
,
2 2


1
k x

,
(k/2)
2 2

1 (x)

 
(+1)/2
+1
x2
2
 1+

2
1
xk/2 ex/2
2k/2 (k/2)
r

e+

/2

(e 1)e2+

2k

d2
d2 2

2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)



1
exp T s + sT s
2

(1 2s)k/2 s < 1/2


F

F(d1 , d2 )

Exponential

Exp ()

Gamma
Inverse Gamma

Dirichlet

Beta
Weibull
Pareto

d1 x
d1 x+d2

Gamma (, )
InvGamma (, )

d1 d1
,
2 2

Pareto(xm , )

(, x/)
()

, x
()

1
x1 ex/
()

1 /x
x
e
()
P

k
k

i=1 i Y 1
xi i
Qk
i=1 (i ) i=1

>1
1

2
>2
( 1)2 ( 2)2

( + ) 1
x
(1 x)1
() ()

+


1
1 +
k

( + )2 ( + + 1)


2
2 1 +
2
k

xm
>1
1

x
m
>2
( 1)2 ( 2)

1 e(x/)
1

d1
, 2
2

1 x/
e

Ix (, )

Weibull(, k)

xB


d1

1 ex/

Dir ()

Beta (, )

(d1 x)d1 d2 2

(d1 x+d2 )d1 +d2

 x 
m

x xm

k  x k1 (x/)k
e

x
m
+1
x

x xm

i
Pk

i=1 i

1
(s < 1/)
1 s


1
(s < 1/)
1 s

p
2(s)/2
4s
K
()

E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+

X
k=1

X
n=0

k1
Y
r=0

+r
++r

sk
k!

sn n 
n
1+
n!
k

(xm s) (, xm s) s < 0

Normal

Lognormal

Student's t

1.0

= 0, 2 = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1

=1
=2
=5
=

0.2

PDF

PDF
0.4

(x)

PDF

0.6

0.6

0.3

0.8

0.8

= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5

0.4

Uniform (continuous)

0.1
4

0.0

0.0

0.2

0.2

0.0

0.0

0.5

1.0

1.5

2.5

3.0

Exponential

Gamma

2.0

=2
=1
= 0.4

= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5

0.4

3.0

d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

0.3
PDF
0.2

1.0

PDF

PDF

1.5

0.1
0

0.0
0

Inverse Gamma

Beta

10

15

20

Pareto
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4

= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5

2.0

2.5

Weibull

= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5

3.0

= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5

3
x

3
x

0.0

0.2

0.4

0.6
x

0.8

1.0

2
0

0.0

0.0

0.5

0.5

1.0

1.0

PDF

PDF

1.5

PDF
2

PDF

1.5

2.0

2.5

0.0

0.0

0.0

0.5

0.1

0.5

1.0

0.2

PDF

0.3

2.0

1.5

2.5

0.4

0.5

k=1
k=2
k=3
k=4
k=5

2.0

0.5

0.4

1
ba

0.0

0.5

1.0

1.5
x

2.0

2.5

Probability Theory

Law of Total Probability

Definitions

P [B] =

Sample space
Outcome (point or element)
Event A
-algebra A

P [B|Ai ] P [Ai ]

n
G

Ai

i=1

Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
[
X
n
n

Ai =
(1)r1

Probability distribution P
1. P [A] 0 for every A
2. P [] = 1
" #

G
X
3. P
Ai =
P [Ai ]

i=1

r=1

n
G

Ai

i=1

\

r



A
ij

ii1 <<ir n j=1

Random Variables

Random Variable

i=1

X:R

Probability space (, A, P)
Probability Mass Function (PMF)

Properties

i=1

1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A

i=1

n
X

P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]

= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]

fX (x) = P [X = x] = P [{ : X() = x}]


Probability Density Function (PDF)
Z
P [a X b] =

f (x) dx
a

Cumulative Distribution Function (CDF):


FX : R [0, 1]

FX (x) = P [X x]

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )


2. Normalized: limx = 0 and limx = 1
3. Right-continuous: limyx F (y) = F (x)

Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]

S
where A = i=1 Ai
T
where A = i=1 Ai

Z
P [a Y b | X = x] =

fY |X (y | x)dy

ab

Independence

fY |X (y | x) =

A
B P [A B] = P [A] P [B]

f (x, y)
fX (x)

Independence

Conditional Probability
P [A | B] =

P [A B]
P [B]

if

P [B] > 0

1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)
6

3.1

Transformations

E [XY ] =

xyfX,Y (x, y) dFX (x) dFY (y)


X,Y

Transformation function

E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]

X
E [X] =
P [X x]

Z = (X)
Discrete
X



fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

f (x)

x1 (z)

x=1

Sample mean
n

X
n = 1
X
Xi
n i=1

Continuous
Z
FZ (z) = P [(X) z] =

with Az = {x : (x) z}

f (x) dx
Az

Special case if strictly monotone





dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|

Conditional Expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]

(x, y)fY |X (y | x) dx
Z

The Rule of the Lazy Statistician

E [(Y, Z) | X = x] =

E [Z] =

(y, z)f(Y,Z)|X (y, z | x) dy dz

Z
(x) dFX (x)

Z
E [IA (x)] =

Z
E[(X, Y ) | X = x] =

E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0

Z
dFX (x) = P [X A]

IA (x) dFX (x) =


A

Convolution
Z
Z := X + Y

fX,Y (x, z x) dx

fZ (z) =

X,Y 0

fX,Y (x, z x) dx

Z := |X Y |
Z :=

X
Y

fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z

fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx

Variance


 
2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
" n
#
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1

Expectation

" n
X

i=1

#
Xi =

i=1

n
X

i6=j

V [Xi ]

iff Xi
Xj

i=1

Standard deviation

Expectation

sd[X] =
Z

E [X] = X =

Variance

x dFX (x) =

xfX (x)

x
Z

xfX (x)

P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]

X discrete

X continuous

V [X] = X

Covariance

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]


Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
Cov [X + a, Y + b] = Cov [X, Y ]

n
m
n X
m
X
X
X
Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1

j=1

Correlation

limn Bin (n, p) = N (np, np(1 p))

(n large, p far from 0 and 1)

Negative Binomial

i=1 j=1

Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]

Independence

X NBin (1, p) = Geo (p)


Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]

Poisson
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Xi Po (i ) Xi Xj =

Sample variance

Xi Po

i=1

S2 =

n
X

1 X
n )2
(Xi X
n 1 i=1

Conditional Variance




2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]

n
X

!
i

i=1

X
n
X
n

Xj , Pn
Xj Bin
Xi Po (i ) Xi Xj = Xi

j
j=1
j=1
j=1
Exponential
Xi Exp () Xi
Xj =

n
X

Xi Gamma (n, )

i=1

Memoryless property: P [X > x + y | X > y] = P [X > x]

Inequalities

Cauchy-Schwarz

Normal
X N , 2

   
2
E [XY ] E X 2 E Y 2

Markov
P [(X) t]

E [(X)]
t

Chebyshev
P [|X E [X]| t]
Chernoff


P [X (1 + )]

V [X]
t2

e
(1 + )1+


> 1

E [(X)] (E [X]) convex

Distribution Relationships

Gamma
X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
i Xi Gamma (
i i , )
Z
()
1 x

=
x
e
dx

0
Beta

Binomial
Xi Bern (p) =

N (0, 1)


X N , Z = aX + b = Z N a + b, a2 2



X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22


P
P
P
2
Xi N i , i2 =
X N
i i ,
i i
i i


a
P [a < X b] = b

(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )

Jensen

n
X

1
( + ) 1
x1 (1 x)1 =
x
(1 x)1
B(, )
()()
  B( + k, )


+k1
E Xk =
=
E X k1
B(, )
++k1
Beta (1, 1) Unif (0, 1)

Xi Bin (n, p)

i=1

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)


limn Bin (n, p) = Po (np)
(n large, p small)

Probability and Moment Generating Functions


 
GX (t) = E tX

X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2

E [X | Y ] = E [X] +

|t| < 1


Xt

MX (t) = GX (e ) = E e

=E

"
#
X (Xt)i
i!

i=0

 

X
E Xi
=
ti
i!
i=0

P [X = 0] = GX (0)
P [X = 1] = G0X (0)
P [X = i] =

Conditional mean and variance

9.3

Multivariate Normal
(Precision Matrix 1 )

V [X1 ]
Cov [X1 , Xk ]

..
..
..
=

.
.
.

(i)
GX (0)

Covariance Matrix

i!
E [X] = G0X (1 )
 
(k)
E X k = MX (0)


X!
(k)
E
= GX (1 )
(X k)!

Cov [Xk , X1 ]
If X N (, ),

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))

1/2

fX (x) = (2)n/2 ||

GX (t) = GY (t) = X = Y

9
9.1



1
exp (x )T 1 (x )
2

Properties

Multivariate Distributions

Standard Bivariate Normal

Let X, Y N (0, 1) X
Z with Y = X +

V [Xk ]

1 2 Z

Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) a is vector of length k = aT X N aT , aT a

Joint density
f (x, y) =


 2
1
x + y 2 2xy
p
exp
2(1 2 )
2 1 2

10

Conditionals
(Y | X = x) N x, 1 2

and

(X | Y = y) N y, 1 2

Convergence

Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Types of Convergence

Independence

1. In distribution (weakly, in law): Xn X

X
Y = 0

9.2

lim Fn (t) = F (t)

Bivariate Normal



Let X N x , x2 and Y N y , y2 .
f (x, y) =
"
z=

x x
x

2x y

2


+

1
p

2. In probability: Xn X


z
exp
2
2(1 2 )
1

y y
y

t where F continuous

2


2

x x
x



( > 0) lim P [|Xn X| > ] = 0

y y
y

n
as

#

3. Almost surely (strongly): Xn X


h
i
h
i
P lim Xn = X = P : lim Xn () = X() = 1
n

qm

4. In quadratic mean (L2 ): Xn X

CLT Notations
Zn N (0, 1)


2

Xn N ,
n


2

Xn N 0,
n


2
n ) N 0,
n(X

n(Xn )
N (0, 1)
n



lim E (Xn X)2 = 0

Relationships
qm

Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X

Xn
Xn
Xn
Xn

X
qm
X
P
X
P
X

Yn
Yn
Yn
=

Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)

Continuity Correction

x + 12

/ n




x 12

P Xn x 1
/ n

Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X



n x
P X

Slutzkys Theorem
D

Delta Method

Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y


Yn N

11
10.1

Law of Large Numbers (LLN)

2
,
n


= (Yn ) N

2
(), ( ())
n
0

Statistical Inference
iid

Let X1 , , Xn F if not otherwise noted.

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .

11.1

Weak (WLLN)
P
n
X

as n

as
n
X

as n

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )


h i
bias(bn ) = E bn

Strong (SLLN)

10.2

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .



n
n(Xn ) D
X
Z
Zn := q   =

V Xn
lim P [Zn z] = (z)

Point Estimation

where Z N (0, 1)

zR

P
Consistency: bn
Sampling distribution: F (bn )
r h i
b
Standard error: se(n ) = V bn
h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent


bn D
Asymptotic normality:
N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .
10

11.2

Normal-based Confidence Interval




b 2 . Let z/2
Suppose bn N , se


and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then



= 1 (1 (/2)), i.e., P Z > z/2 = /2

b
Cn = bn z/2 se

11.3

Empirical Distribution Function

Empirical Distribution Function (ECDF)


Pn
Fbn (x) =

i=1

I(Xi x)
n

(
1
I(Xi x) =
0

Xi x
Xi > x

Properties (for any fixed x)


h i
E Fn = F (x)
h i F (x)(1 F (x))
V Fn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn F )




2


P sup F (x) Fn (x) > = 2e2n

11.4

Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of = T (F ) : bn = T (Fn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:

pth quantile: F 1 (p) = inf{x : F (x) p}


n

=X
n
1 X
n )2

b2 =
(Xi X
n 1 i=1
Pn
1
)3
i=1 (Xi
n

=

b3 j
Pn
n )(Yi Yn )
(Xi X
qP
= qP i=1
n
n
2

(X

X
)
i
n
i=1
i=1 (Yi Yn )

12

Parametric Inference



Let F = f (x; : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).

12.1

Method of Moments

j th moment
 
j () = E X j =

Nonparametric 1 confidence band for F


L(x) = max{Fn n , 0}
U (x) = min{Fn + n , 1}
s
 
1
2
log
=
2n

P [L(x) F (x) U (x) x] 1

1X
(Xi )
(x) dFbn (x) =
n i=1


b 2 = T (Fn ) z/2 se
b
Often: T (Fn ) N T (F ), se
T (Fn ) =

xj dFX (x)

j th sample moment
n

j =

1X j
X
n i=1 i

Method of Moments Estimator (MoM)


1 () =
1
2 () =
2
.. ..
.=.
k () =
k
11

Properties of the MoM estimator


bn exists with probability tending to 1
P
Consistency: bn
Asymptotic normality:

n(b ) N (0, )



where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
g = (g1 , . . . , gk ) and gj =
j ()

12.2

Maximum Likelihood

Likelihood: Ln : [0, )
Ln () =

n
Y

f (Xi ; )

Equivariance: bn is the mle = (bn ) is the mle of ()


Asymptotic normality:
p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se
(bn ) D
N (0, 1)
b
se
Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is
h i
V bn
are(en , bn ) = h i 1
V en

i=1

Approximately the Bayes estimator


Log-likelihood
`n () = log Ln () =

n
X

log f (Xi ; )

i=1

12.2.1

Delta Method

b where is differentiable and 0 () 6= 0:


If = ()

Maximum Likelihood Estimator (mle)

(b
n ) D
N (0, 1)
b )
se(b

Ln (bn ) = sup Ln ()

Score Function
s(X; ) =

log f (X; )

b is the mle of and


where b = ()


b b
b = 0 ()
b n )
se
se(

Fisher Information
I() = V [s(X; )]
In () = nI()
Fisher Information (exponential family)



I() = E s(X; )

Observed Fisher Information


Inobs () =
Properties of the mle
P
Consistency: bn

12.3

Multiparameter Models

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.


Hjj =

2 `n
2

Hjk =

Fisher Information Matrix

n
2 X
log f (Xi ; )
2 i=1

2 `n
j k

..
.
E [Hk1 ]

E [H11 ]

..
In () =
.

E [H1k ]

..

.
E [Hkk ]

Under appropriate regularity conditions


(b ) N (0, Jn )

12

with Jn () = In1 . Further, if bj is the j th component of , then

13

Hypothesis Testing
H0 : 0

(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se

12.3.1

Multiparameter Delta Method

Let = (1 , . . . , k ) be a function and let the gradient of be

1
.

=
..

k

Definitions

Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
Critical value c
Test statistic T
Rejection Region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()

(b
) D
N (0, 1)
b )
se(b

H0 true
H1 true

T

Type II error ()

1F (T (X))

p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

 

Jn


b and
= b.
and Jn = Jn ()
=

12.4

Retain H0

Reject H0
Type
I error ()
(power)



p-value = sup0 P [T (X) T (x)] = inf : T (x) R


p-value = sup0
P [T (X ? ) T (X)]
= inf : T (X) R
|
{z
}

where
b ) =
se(b

Test size: = P [Type I error] = sup ()

p-value


b Then,
Suppose =b 6= 0 and b = ().

r


H1 : 1

versus

Parametric Bootstrap

Sample from f (x; bn ) instead of from Fn , where bn could be the mle or method
of moments estimator.

since T (X ? )F

evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0

Wald Test
Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se


P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood Ratio Test (LRT)
T (X) =

sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )

13

(X) = 2 log T (X) 2rq where

k
X

iid

Zi2 2k with Z1 , . . . , Zk N (0, 1)

 i=1

p-value = P0 [(X) > (x)] P 2rq > (x)
Multinomial LRT


Xk
X1
,...,
be the mle
Let pn =
n
n
Xj
k 
Y
Ln (
pn )
pj
T (X) =
=
Ln (p0 )
p0j
j=1


k
X
pj
D
Xj log
(X) = 2
2k1
p
0j
j=1
The approximate size LRT rejects H0 when (X) 2k1,

xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
In particular, X n iid = f (xn | ) =
f (xi | ) = Ln ()
i=1

Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that depends Ron
R
Ln ()f ()
Posterior Mean n = f ( | xn ) d = R
Ln ()f () d

14.1

Credible Intervals

1 Posterior Interval

Pearson Test
n

k
X
(Xj E [Xj ])2
where E [Xj ] = np0j under H0
T =
E [Xj ]
j=1
D

f ( | xn ) d = 1

P [ (a, b) | x ] =
a

1 Equal-tail Credible Interval

2k1

T


p-value = P 2k1 > T (x)

f ( | xn ) d =

2
Faster Xk1
than LRT, hence preferable for small n

f ( | xn ) d = /2

1 Highest Posterior Density (HPD) region Rn

Independence Testing
I rows, J columns, X multinomial sample of size n = I J
X
mles unconstrained: pij = nij
X

1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k

mles under H0 : p0ij = pi pj = Xni nj




PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
Pearson 2 : T = i=1 j=1 ijE[Xij ]ij

Rn is unimodal = Rn is an interval

Let = () and A = { : () }.
Posterior CDF for

LRT and Pearson 2k , where = (I 1)(J 1)

14

14.2

Function of Parameters

Bayesian Inference

H(r | xn ) = P [() | xn ] =

f ( | xn ) d

Bayes Theorem

Posterior Density

f (x | )f ()
f (x | )f ()
f ( | x) =
=R
Ln ()f ()
n
f (x )
f (x | )f () d

h( | xn ) = H 0 ( | xn )
Bayesian Delta Method

Definitions
n

X = (X1 , . . . , Xn )




b
b se
b 0 ()
| X n N (),

14

14.3

Priors

Continuous likelihood (subscript c denotes constant)

Choice
Subjective Bayesianism: prior should incorporate as much detail as possible
the researchs a priori knowledge via prior elicitation.
Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior).
Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior.
Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys prior (transformation-invariant):
f ()

I()

f ()

det(I())

Conjugate: f () and f ( | xn ) belong to the same parametric family

14.3.1

Conjugate Priors
Discrete likelihood

Likelihood
Bernoulli(p)
Binomial(p)

Conjugate Prior
Beta(, )
Beta(, )

Posterior hyperparameters
+
+

n
X
i=1
n
X

xi , + n
xi , +

i=1

Negative Binomial(p)
Poisson()

Beta(, )
Gamma(, )

+ rn, +
+

n
X

n
X

n
X

i=1

xi

Multinomial(p)

Dirichlet()

xi , + n
x(i)

i=1

Geometric(p)

Beta(, )

Conjugate Prior

Uniform(0, )

Pareto(xm , k)

Exponential()

Gamma(, )

Posterior hyperparameters


max x(n) , xm , k + n
n
X
+ n, +
xi
i=1

Normal(, c2 )

Normal(0 , 02 )

Normal(c , 2 )

Scaled Inverse Chisquare(, 02 )

Normal(, 2 )

Normalscaled
Inverse
Gamma(, , , )

MVN(, c )

MVN(0 , 0 )

MVN(c , )

InverseWishart(, )

Pareto(xmc , k)

Gamma(, )

Pareto(xm , kc )

Pareto(x0 , k0 )

Gamma(c , )

Gamma(0 , 0 )

Pn
 

n
0
1
i=1 xi
+
+ 2 ,
/
2
2
02
c
 0
c1
n
1
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n


n
+ n
x
,
+ n,
+ ,
+n
2
n
1X
(
x )2
2
+
(xi x
) +
2 i=1
2(n + )
1
1
0 + nc

+ n, +

n
X
i=1

1


1
1
x
,
0 0 + n


1 1
1
0 + nc
n
X
n + , +
(xi c )(xi c )T
i=1
n
X

xi
xm c
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +

log

i=1

xi

i=1

Ni

n
X

14.4
xi

Bayesian Testing

If H0 : 0 :

i=1

Z
Prior probability P [H0 ] =

i=1

i=1

n
X

n
X

Likelihood

Posterior probability P [H0 | xn ] =

f () d
Z0

f ( | xn ) d

Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),


xi

f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]

15

Marginal Likelihood
f (xn | Hi ) =

1. Estimate VF [Tn ] with VFn [Tn ].


2. Approximate VFn [Tn ] using simulation:

f (xn | , Hi )f ( | Hi ) d

(a) Repeat the following B times to get Tn,1


, . . . , Tn,B
, an iid sample from
the sampling distribution implied by Fn
i. Sample uniformly X , . . . , Xn Fn .

Posterior Odds (of Hi relative to Hj )


P [Hi | xn ]
P [Hj | xn ]

f (xn | Hi )
f (xn | Hj )
| {z }

Bayes Factor BFij

P [Hi ]
P [Hj ]
| {z }

ii. Compute Tn = g(X1 , . . . , Xn ).


(b) Then

prior odds

Bayes Factor
log10 BF10 BF10
evidence
0 0.5
1 1.5
Weak
0.5 1
1.5 10 Moderate
12
10 100 Strong
>2
> 100
Decisive
p
BF
10
1p
p =
where p = P [H1 ] and p = P [H1 | xn ]
p
1 + 1p
BF10

15

Exponential Family

b=1

16.1.1

Bootstrap Confidence Intervals

Normal-based Interval
Tn z/2 se
boot

1.
2.
3.
4.

fX (x | ) = h(x) exp {()T (x) A()}


= h(x)g() exp {()T (x)}
Vector parameter
fX (x | ) = h(x) exp

!2

Pivotal Interval

Scalar parameter

vboot = V
Fn

B
B
1 X
1 X

Tn,b

T
=
B
B r=1 n,r

s
X

Location parameter = T (F )
Pivot Rn = bn
Let H(r) = P [Rn r] be the cdf of Rn

Let Rn,b
= bn,b
bn . Approximate H using bootstrap:

B
1 X

H(r)
=
I(Rn,b
r)
B

i ()Ti (x) A()

i=1

b=1

= h(x) exp {() T (x) A()}


= h(x)g() exp {() T (x)}
Natural form
fX (x | ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}


= h(x)g() exp T T(x)

, . . . , bn,B
)
5. Let denote the sample quantile of (bn,1

6. Let r denote the sample quantile of (Rn,1


, . . . , Rn,B
), i.e., r = bn
 
7. Then, an approximate 1 confidence interval is Cn = a
, b with

a
=
b =

16
16.1

Sampling Methods
The Bootstrap

Let Tn = g(X1 , . . . , Xn ) be a statistic.



1 1 =
bn H
2
1 =
bn H
2

bn r1/2
=

2bn 1/2

bn r/2
=

2bn /2

Percentile Interval



Cn = /2
, 1/2
16

16.2

Rejection Sampling

Setup
We can easily sample from g()
We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to proportional constant: h() = R

Algorithm
1. Draw cand g()
2. Generate u Unif (0, 1)
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example

Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=

17.1

We can easily sample from the prior g() = f ()


Target is the posterior with h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()
2. Generate u Unif (0, 1)
Ln (cand )
3. Accept cand if u
Ln (bn )

16.3

Decision rule: synonymous for an estimator b


Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,

Posterior Risk
Z

h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))

h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))

r(b | x) =
(Frequentist) Risk
b =
R(, )
Bayes Risk
ZZ

Importance Sampling

b =
r(f, )

Sample from an importance function g rather than target density h.


Algorithm to obtain an approximation to E [q() | xn ]:
iid

Ln (i )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (i )
PB
n
3. E [q() | x ] i=1 q(i )wi

Decision Theory

Definitions
Unknown quantity affecting our decision:

h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))

h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)

1. Sample from the prior 1 , . . . , n f ()

17

Risk

17.2

Admissibility

b0 dominates b if

b
: R(, b0 ) R(, )
b
: R(, b0 ) < R(, )

b is inadmissible if there is at least one other estimator b0 that dominates


it. Otherwise it is called admissible.

17

17.3

Bayes Rule

Residual Sums of Squares (rss)

Bayes Rule (or Bayes Estimator)


rss(b0 , b1 ) =

b = inf e r(f, )
e
r(f, )

R
b
b
b = r(b | x)f (x) dx
(x) = inf r( | x) x = r(f, )

n
X

2i

i=1

Least Square Estimates


Theorems

bT = (b0 , b1 )T : min rss


b0 ,
b1

Squared error loss: posterior mean


Absolute error loss: posterior median
Zero-one loss: posterior mode

17.4

n
b0 = Yn b1 X
Pn
Pn
n )(Yi Yn )

(Xi X
i=1 Xi Yi nXY
Pn
b1 = i=1
=
P
n
2

2
2
i=1 (Xi Xn )
i=1 Xi nX
h
i  
0
E b | X n =
1


P
h
i
2 n1 ni=1 Xi2 X n
n
b
V |X =
X n
1
nsX
r Pn
2

b
i=1 Xi

b b0 ) =
se(
n
sX n

b b1 ) =
se(
sX n

Minimax Rules

Maximum Risk
b = sup R(, )
b
)
R(

R(a)
= sup R(, a)

Minimax Rule
e
e = inf sup R(, )
b = inf R(
)
sup R(, )

b =c
b = Bayes rule c : R(, )
Least Favorable Prior
bf = Bayes rule R(, bf ) r(f, bf )

18

Pn
where s2X = n1 i=1 (Xi X n )2 and
b2 =
of . Further properties:

Pn

2i
i=1 

an (unbiased) estimate

Linear Regression
P
P
Consistency: b0 0 and b1 1
Asymptotic normality:

Definitions
Response variable Y
Covariate X (aka predictor variable or feature)

18.1

1
n2

b0 0 D
N (0, 1)
b b0 )
se(

Simple Linear Regression

b1 1 D
N (0, 1)
b b1 )
se(

Approximate 1 confidence intervals for 0 and 1 are

Model
Yi = 0 + 1 Xi + i

and

E [i | Xi ] = 0, V [i | Xi ] = 2

b b0 )
b0 z/2 se(

b b1 )
and b1 z/2 se(

Fitted Line
The Wald test for testing H0 : 1 = 0 vs. H1 : 1 6= 0 is: reject H0 if
b b1 ).
|W | > z/2 where W = b1 /se(

rb(x) = b0 + b1 x
Predicted (Fitted) Values
Ybi = rb(Xi )
Residuals

i = Yi Ybi = Yi b0 + b1 Xi

R2


Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
2

18

If the (k k) matrix X T X is invertible,

Likelihood
L=
L1 =
L2 =

n
Y
i=1
n
Y
i=1
n
Y

f (Xi , Yi ) =

n
Y

fX (Xi )

i=1

n
Y

b = (X T X)1 X T Y
i
h
V b | X n = 2 (X T X)1

fY |X (Yi | Xi ) = L1 L2

i=1

b N , 2 (X T X)1

fX (Xi )
(

fY |X (Yi | Xi )

i=1

2
1 X
Yi (0 1 Xi )
exp 2
2 i

Estimate regression function


rb(x) =
Unbiased estimate for

1X 2


b2 =
n i=1 i

18.2

k
X

bj xj

j=1

Under the assumption of Normality, the least squares estimator is also the mle
n

b2 =

n
1 X 2

n k i=1 i

 = X b Y

mle

Prediction

b=X

Observe X = x of the covarite and want to predict their outcome Y .

b2 =

nk 2

1 Confidence Interval
Yb = b0 + b1 x
h i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
Prediction Interval

Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score

Y = X + 
where
X11
..
X= .
Xn1

..
.

Model Selection

Underfitting: too few covariates yields high bias


Overfitting: too many covariates yields high variance

Multiple Regression

18.4

Consider predicting a new observation Y for covariates X and let S J


denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues


 Pn
2
2
2
i=1 (Xi X )
b
P
n =
b
2j + 1
n i (Xi X)
Yb z/2 bn

18.3

b bj )
bj z/2 se(

X1k
..
.

1

= ...

Xnk


1
..
=.

n

Likelihood
2 n/2

L(, ) = (2 )


1
exp 2 rss
2


rss = (y X)T (y X) = ||Y X||2 =

N
X
i=1

(Yi xTi )2

Hypothesis Testing
H0 : j = 0 vs. H1 : j 6= 0 j J
Mean Squared Prediction Error (mspe)
h
i
mspe = E (Yb (S) Y )2
Prediction Risk
R(S) =

n
X
i=1

mspei =

n
X
i=1

h
i
E (Ybi (S) Yi )2
19

19

Training Error
n
X
btr (S) =
R
(Ybi (S) Yi )2

19.1

i=1

R2 (S) = 1

btr (S)
R
rss(S)
=1
=1
tss
tss

Non-parametric Function Estimation

Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )

The training error is a downward-biased estimate of the prediction risk.


i
h
btr (S) < R(S)
E R

Density Estimation

R
Estimate f (x), where f (x) = P [X A] = A f (x) dx.
Integrated Square Error (ise)
Z 
Z
2
L(f, fbn ) =
f (x) fbn (x) dx = J(h) + f 2 (x) dx
Frequentist Risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)

n
i
h
i
X
btr (S)) = E R
btr (S) R(S) = 2
bias(R
Cov Ybi , Yi

i=1

Adjusted R2

19.1.1

n 1 rss
R (S) = 1
n k tss

Histograms

Definitions

Mallows Cp statistic
b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty
Akaike Information Criterion (AIC)
`n (bS ,
bS2 )

fbn (x) =

k
BIC(S) = `n (bS ,
bS2 ) log n
2
Validation and Training
(Ybi (S) Yi )2

m = |{validation data}|, often

i=1

Leave-one-out Cross-validation
bCV (S) =
R

n
X
i=1

m
X
pbj
j=1

Bayesian Information Criterion (BIC)

m
X

Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du

Histogram Estimator

AIC(S) =

bV (S) =
R

(Yi Yb(i) ) =

n
X
i=1

Yi Ybi (S)
1 Uii (S)

U (S) = XS (XST XS )1 XS (hat matrix)

!2

n
n
or
4
2

I(x Bj )

h
i p
j
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
b
R(fn , f )
(f 0 (u)) du +
12
nh
!1/3
1
6

h = 1/3 R
2 du
n
(f 0 (u))
 2/3 Z
1/3
C
3
2
R (fbn , f ) 2/3
C=
(f 0 (u)) du
4
n
Cross-validation estimate of E [J(h)]
Z
n
m
2Xb
2
n+1 X 2
2
b
b
f(i) (Xi ) =
pb
JCV (h) = fn (x) dx

n i=1
(n 1)h (n 1)h j=1 j
20

19.1.2

Kernel Density Estimator (KDE)

k-nearest Neighbor Estimator


X
1
rb(x) =
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
k

Kernel K

i:xi Nk (x)

K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
>0
x K(x) dx K

Nadaraya-Watson Kernel Estimator


rb(x) =

n
X

wi (x)Yi

i=1

KDE

xxi
h

Pn
xxj
j=1 K
h

wi (x) =



n
1X1
x Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
(f (x)) dx +
K 2 (x) dx
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=

,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
5 2 2/5
c4
c4 = (K
R (f, fbn ) = 4/5
)
K 2 (x) dx
(f 00 )2 dx
4
n
|
{z
}

R(b
rn , r)

h4
4
Z

[0, 1]

4 Z 
2
f 0 (x)
x2 K 2 (x) dx
r00 (x) + 2r0 (x)
dx
f (x)
R
2 K 2 (x) dx
dx
nhf (x)
Z

c1
n1/5
c2
R (b
rn , r) 4/5
n
h

C(K)

Cross-validation estimate of E [J(h)]

Epanechnikov Kernel
(
K(x) =

4 5(1x2 /5)

|x| <

JbCV (h) =

K (x) = K (2) (x) 2K(x)

19.2

19.3



n
n
n
2Xb
1 X X Xi Xj
2
2
b
fn (x) dx
f(i) (Xi )
K
+
K(0)
2
n i=1
hn i=1 j=1
h
nh
K (2) (x) =

n
X

(Yi rb(i) (xi ))2 =

i=1

otherwise

Cross-validation estimate of E [J(h)]


Z

JbCV (h) =

i=1

!2

K(0)
 xx 
Pn
j
j=1 K
h

Approximation
r(x) =

X
j=1

j j (x)

J
X

j j (x)

i=1

Multivariate Regression

Non-parametric Regression

Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points


(x1 , Y1 ), . . . , (xn , Yn ) related by

(Yi rb(xi ))2

Smoothing Using Orthogonal Functions

Z
K(x y)K(y) dy

n
X

where

i = i

Y = +

0 (x1 )
..
and = .

..
.
0 (xn )

J (x1 )
..
.
J (xn )

Least Squares Estimator


Yi = r(xi ) + i
E [i ] = 0
V [i ] = 2

b = (T )1 T Y
1
T Y (for equallly spaced observations only)
n

21

Cross-validation estimate of E [J(h)]

2
n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
i=1

20

j=1

20.2

Poisson Processes

Poisson Process
{Xt : t [0, )} number of events up to and including time t
X0 = 0
Independent increments:

Stochastic Processes

Stochastic Process
(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous

{Xt : t T }
Notations: Xt , X(t)
State space X
Index set T

20.1

t0 < < tn : Xt1 Xt0



Xtn Xtn1
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h)
P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =

Markov Chains

Markov Chain

Rt
0

(s) ds

Homogeneous Poisson Process

P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]

n T, x X
(t) = Xt Po (t)

Transition probabilities
pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]

>0

Waiting Times
n-step
Wt := time at which Xt occurs

Transition matrix P (n-step: Pn )


(i, j) element is pij
pij > 0
P

i pij = 1



1
Wt Gamma t,

Chapman-Kolmogorov

Interarrival Times

pij (m + n) =

pij (m)pkj (n)

St = Wt+1 Wt

Pm+n = Pm Pn
Pn = P P = Pn

St Exp

 
1

Marginal probability
n = (n (1), . . . , n (N ))

where

i (i) = P [Xn = i]

St

0 , initial distribution
n = 0 Pn

Wt1

Wt

t
22

21

Time Series

21.1

Mean function

Strictly stationary

xt = E [xt ] =

Stationary Time Series

xft (x) dx

P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]

Autocovariance function
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t


x (t, t) = E (xt t )2 = V [xt ]
Autocorrelation function (ACF)

k N, tk , ck , h Z
Weakly stationary
 
E x2t <
t Z
 2
E xt = m
t Z
x (s, t) = x (s + r, t + r)

(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)

r, s, t Z

Autocovariance function

xy (s, t) = E [(xs xs )(yt yt )]

Cross-correlation function (CCF)


xy (s, t) = p

xy (s, t)
x (s, s)y (t, t)

(h) = E [(xt+h )(xt )]




(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)

h Z

Backshift operator
B k (xt ) = xtk

Autocorrelation function (ACF)

Difference operator

Cov [xt+h , xt ]
(t + h, t)
(h)
x (h) = p
=p
=
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)

d = (1 B)d
White Noise
2
wt wn(0, w
)

Jointly stationary time series

iid

2
0, w

Gaussian: wt N
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T

xy (h) = E [(xt+h x )(yt y )]

xy (h) = p

Random Walk
Linear Process

Drift
Pt
xt = t + j=1 wj
E [xt ] = t

xt = +

k
X
j=k

aj xtj

j wtj

where

j=

Symmetric Moving Average


mt =

xy (h)
x (0)y (h)

where aj = aj 0 and

k
X
j=k

aj = 1

(h) =

|j | <

j=

2
w

j+h j

j=

23

21.2

Estimation of Correlation

21.3.1

Detrending

Least Squares

Sample mean
n

1X
x
=
xt
n t=1
Sample variance

n 
|h|
1 X
1
x (h)
V [
x] =
n
n
h=n

1. Choose trend model, e.g., t = 0 + 1 t + 2 t2


2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt
Moving average
The low-pass filter vt is a symmetric moving average mt with aj =

Sample autocovariance function


nh
1 X

b(h) =
(xt+h x
)(xt x
)
n t=1

vt =

1
2k+1 :

k
X
1
xt1
2k + 1
i=k

Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion

Sample autocorrelation function


b(h) =

b(h)

b(0)

Differencing
t = 0 + 1 t = xt = 1

Sample cross-variance function

bxy (h) =

nh
1 X
(xt+h x
)(yt y)
n t=1

21.4

ARIMA models

Autoregressive polynomial
(z) = 1 1 z p zp

Sample cross-correlation function

bxy (h)
bxy (h) = p

bx (0)b
y (0)

z C p 6= 0

Autoregressive operator
(B) = 1 1 B p B p

Properties
1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n

21.3

Non-Stationary Time Series

Autoregressive model order p, AR (p)


xt = 1 xt1 + + p xtp + wt (B)xt = wt
AR (1)
xt = k (xtk ) +

k1
X

j (wtj )

k,||<1

j=0

Classical decomposition model

j (wtj )

j=0

{z

linear process

xt = t + st + wt
t = trend
st = seasonal component
wt = random noise term

E [xt ] =

j=0

j (E [wtj ]) = 0

(h) = Cov [xt+h , xt ] =


(h) =

(h)
(0)

2 h
w

12

= h

(h) = (h 1) h = 1, 2, . . .

24

Moving average polynomial

Seasonal ARIMA

(z) = 1 + 1 z + + q zq

z C q 6= 0

Moving average operator


(B) = 1 + 1 B + + p B p

21.4.1

MA (q) (moving average model order q)


xt = wt + 1 wt1 + + q wtq xt = (B)wt
E [xt ] =

q
X

Denoted by ARIMA (p, d, q) (P, D, Q)s


d
s
P (B s )(B)D
s xt = + Q (B )(B)wt
Causality and Invertibility

ARMA (p, q) is causal (future-independent) {j } :


xt =

j E [wtj ] = 0

(h) = Cov [xt+h , xt ] =

2
w
0

Pqh
j=0

j j+h

j=0

j < such that

wtj = (B)wt

j=0

j=0

0hq
h>q

ARMA (p, q) is invertible {j } :

MA (1)
xt = wt + wt1

2 2

(1 + )w h = 0
2
(h) = w
h=1

0
h>1
(

h=1
2
(h) = (1+ )
0
h>1

(B)xt =

j=0

j < such that

Xtj = wt

j=0

Properties
ARMA (p, q) causal roots of (z) lie outside the unit circle
(z) =

j z j =

j=0

(z)
(z)

|z| 1

ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq

ARMA (p, q) invertible roots of (z) lie outside the unit circle

(B)xt = (B)wt

(z) =

Partial autocorrelation function (PACF)

j z j =

j=0

xh1
, regression of xi on {xh1 , xh2 , . . . , x1 }
i
hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)

(z)
(z)

|z| 1

Behavior of the ACF and PACF for causal and invertible ARMA models

ACF
PACF

ARIMA (p, d, q)
d xt = (1 B)d xt is ARMA (p, q)

AR (p)
tails off
cuts off after lag p

MA (q)
cuts off after lag q
tails off q

ARMA (p, q)
tails off
tails off

(B)(1 B)d xt = (B)wt


Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1
xt =

(1 )j1 xtj + wt

when || < 1

21.5

Spectral Analysis

Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)

j=1

x
n+1 = (1 )xn +
xn

Frequency index (cycles per unit time), period 1/

25

Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs

Discrete Fourier Transform (DFT)


d(j ) = n1/2

xt =

q
X

Fourier/Fundamental frequencies
(Uk1 cos(2k t) + Uk2 sin(2k t))

j = j/n

k=1

Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2


Pq
(h) = k=1 k2 cos(2k h)
  Pq
(0) = E x2t = k=1 k2

Inverse DFT
xt = n1/2

j=0

Scaled Periodogram

2 2i0 h 2 2i0 h
e
+
e
=
2
2
Z 1/2
=
e2ih dF ()

4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1

P (j/n) =

1/2

Spectral distribution function


< 0
< 0
0

22
22.1

(h)e2ih

h=
h=

!2

Gamma Function

ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
x
Z x
Lower incomplete: (s, x) =
ts1 et dt

Ordinary: (s) =

Spectral density

2X
xt sin(2tj/n
n t=1

Math
Z

F () = F (1/2) = 0
F () = F (1/2) = (0)

f () =

d(j )e2ij t

I(j/n) = |d(j/n)|2

(h) = 2 cos(20 h)

0
F () = 2 /2

n1
X

Periodogram

Spectral representation of a periodic process

xt e2ij t

i=1

Periodic mixture

Needs

n
X

|(h)| < = (h) =

1
1

2
2

R 1/2
1/2

e2ih f () d

f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d

h = 0, 1, . . .

( + 1) = ()
>1
(n) = (n 1)!
nN

(1/2) =

22.2

Beta Function

Z 1
(x)(y)
tx1 (1 t)y1 dt =
Ordinary: B(x, y) = B(y, x) =
(x + y)
0
Z x
Incomplete: B(x; a, b) =
ta1 (1 t)b1 dt

2
White noise: fw () = w
ARMA (p, q) , (B)xt = (B)wt :

|(e2i )|2
fx () =
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
w

Regularized incomplete:
a+b1
B(x; a, b) a,bN X
(a + b 1)!
Ix (a, b) =
=
xj (1 x)a+b1j
B(a, b)
j!(a
+
b

j)!
j=a

26

Stirling numbers, 2nd kind


 

 

n
n1
n1
=k
+
k
k
k1

I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)

22.3

Series

Finite

n(n + 1)
k=
2

(2k 1) = n2

k=1
n
X

k=1
n
X

k=1
n
X

k2 =

ck =

k=0

cn+1 1
c1

n  
X
n
k=0
n 
X

=2

k=0

Balls and Urns


|B| = n, |U | = m
B : D, U : D

Binomial
Theorem:
n  
X
n nk k
a
b = (a + b)n
k

B : D, U : D

k=0

c 6= 1

k > n : Pn,k = 0

f :BU
f arbitrary
n

m


B : D, U : D


n+n1
n
m  
X
n
k=1

1
,
1p

p
|p| < 1
1p
k=1
!



X
d
1
1
k
=
p
=
dp 1 p
1 p2
pk =

kpk1 =

B : D, U : D
|p| < 1

ordered

(n i) =

i=0

unordered

Pn,k

f injective
(
mn m n
0
else
 
m
n
(
1 mn
0 else
(
1 mn
0 else

f surjective
 
n
m!
m



n1
m1
 
n
m
Pn,m

f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else

References

[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.

w/o replacement
nk =

D = distinguishable, D = indistinguishable.

[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.


The American Statistician, 62(1):4553, 2008.

Sampling

k1
Y

n 1 : Pn,0 = 0, P0,0 = 1

[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.


Brooks Cole, 1972.

Combinatorics

k out of n

m
X
k=1

k=0

22.4

Pn,i

k=0

d
dp
k=0
k=0


X
r+k1 k

x = (1 x)r r N+
k
k=0
 
X
k

p = (1 + p) |p| < 1 , C
k

n
X

Vandermondes Identity:
 

r  
X
m
n
m+n
=
k
rk
r

k=0

Infinite

pk =

Pn+k,k =

i=1

 

r+k
r+n+1
=
k
n
k=0




n
X k
n+1

=
m
m+1

n(n + 1)(2n + 1)
6
k=1
2

n
X
n(n + 1)

k3 =
2

Partitions

Binomial

n
X

1kn

  (
1 n=0
n
=
0
0 else

n!
(n k)!

 
n
nk
n!
=
=
k
k!
k!(n k)!

w/ replacement

[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,


Algebra. Springer, 2001.

nk

[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und


Statistik. Springer, 2002.


 

n1+r
n1+r
=
r
n1

[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.


Springer, 2003.
27

28

Univariate distribution relationships, courtesy of Leemis and McQueston [2].

You might also like