4 Summary

Bayesian Deep Neural Networks
Summary
Sungjoon Choi
February 9, 2018
Seoul National University
Table of contents
1. Measure theory
2. Probability
3. Random variable
4. Random process
5. Functional analysis
6. Gaussian process
7. Summary of Variational Inference
8. Stein Variational Gradient Descent

1
Measure theory
Measure theory
• set function: a function assigning a number of a set (example:

cardinality, length, area).
• σ-field B: a collection of subsets of U such that (axioms)
1. ∅ ∈ B (empty set is included.)
2. B ∈ B ⇒ B c ∈ B (closed under set complement.)
3. Bi ∈ B ⇒ ∪∞
i=1 Bi ∈ B (closed under countable union.)
2
Measure theory
• properties of σ-field B
1. U ∈ B (entire set is included.)
2. Bi ∈ B ⇒ ∩∞ i=1 Bi ∈ B (closed under countable intersection)
3. 2U is a σ-field.
4. B is either finite or uncountable, never denumerable.
5. B and C are σ-fields ⇒ B ∩ C is a σ-field but B ∪ C is not.
• B = {∅, {a}, {b, c}, {a, b, c}}
• C = {∅, {a, b}, {c}, {a, b, c}}
• B ∩ C = {∅, {a, b, c}}
(this is a σ-field)
• B ∪ C = {∅, {a}, {c}, {a, b}, {b, c}, {a, b, c}}
(this is not a σ-field as {a, c} = {a} ∩ {c} is not included.)
• σ(C) is called the σ-field generated by C.
3
Probability
Probability
• The random experiment should be well defined.

• The outcomes are all the possible results of the random experiment
each of which canot be further divided.
• The sample point w : a point representing an outcome.
• The sample space Ω: the set of all the sample points.
4
Probability
• Definition (probability)
• P defined on a measurable space (Ω, A) is a set function
P ∶ A → [0, 1] such that (probability axioms).
1. P(∅) = 0
2. P(A) ≥ 0 ∀A ⊆ Ω
3. For disjoint sets Ai and Aj ⇒ P(∪ki=1 Ai ) = ∑ki=1 P(Ai ) (countable
additivity)
4. P(Ω) = 1
5
Random variable
Random variable
• random variable:
A random variable is a real-valued function defined on Ω that is
measurable w.r.t. the probability space (Ω, A, P) and the Borel
measurable space (R, B), i.e.,
X ∶ Ω → R such that ∀B ∈ B, X −1 (B) ∈ A.
• What is random here?

• What is the result of carrying out the random experiment?
6
Random variable
• discrete random variable: There is a discrete set {xi ∶ i = 1, 2, ⋯}

such that ∑ P(X = xi ) = 1.
• probability mass function: pX (x) ≜ P(X = x) that satisfies
1. 0 ≤ pX (x) ≤ 1
2. ∑x pX (x) = 1
3. P(X ∈ B) = ∑x∈B pX (x)
7
Random variable
• continuous random variable

There is an integrable function fX (x) such that
P(X ∈ B) = ∫B fX (x)dx.
• probability density function
≤x+∆x)
fX (x) ≜ lim∆x→0 P(x<X∆x that satisfies
1. fX (x) > 1 is possible.
∞
2. ∫−∞ fX (x)dx = 1
3. P(X ∈ B) = ∫x∈B fX (x)dx
8
Random process
Random process
• random process Xt (w ), t ∈ I :
1. random sequence, random function, or random signal:
Xt ∶ Ω → the set of all sequences or functions
2. indexed family of infinite number of random variables:
Xt ∶ I → set of all random variables defined on Ω
3. Xt ∶ Ω × I → R
4. If t is fixed, then a random process becomes a random variable.
9
Random process
• A random process Xt is completely characterized if the following is

known.
• P((Xt1 , ⋯, Xtk ) ∈ B) for any B, k, and t1 , ⋯, tk
• Note that given a random process, only ’finite-dimensional’
probabilities or probability functions can be specified.
10
Functional analysis
Functional analysis
• Definition (Hilbert space)

• Inner product space containing Cauchy sequence limits.
⇒ Complete space
⇒ Always possible to fill all the holes.
⇒ R is complete, Q is not complete.
11
Functional analysis
• Definition (Kernel)
• Let X be a non-empty set. A function k ∶ X × X → R is a kernel if
there exists a Hilbert space H and a map φ ∶ X → H such that
∀x, x ′ ∈ X ,
k(x, x ′ ) ≜ ⟨φ(x), φ(x ′ )⟩H .
• Note that there is almost no condition on X .
12
Functional analysis
• Theorem (Mercer)
• Let (X , µ) be a finite measurable space and k ∈ L∞ (X 2 , µ2 ) be a
kernel such that Tk ∶ L2 (X , µ) → L2 (X , µ) is positive definite.
• Let φi ∈ L2 (X , µ) be the normalized eigenfunctions of Tk associated
with the eigenvalues λi > 0. Then:
1. The eigenvalues {λi }∞
i=1 ∣ are absolutely summable.
2.
∞
k(x, x ′ ) = ∑ λi φi (x)φi (x ′ )
i=1
holds µ2 almost everywhere, where the series converges absolutely
and uniformly µ2 almost everywhere.
• Absolutely summable is more important than it seems.

• SB: Mercer’s theorem can be interpreted as an infinite dimensional
SVD.
13
Functional analysis
• Definition (reproducing kernel Hilbert space)

• Let H be a Hilbert space of R-valued functions on X . A function
k ∶ X × X → R is a reproducing kernel on H, and H is a reproducing
kernel Hilbert space if
1. ∀x ∈ X
k(⋅, x) ∈ H
2. ∀x ∈ X , ∀f ∈ H
⟨f (⋅), k(⋅, x)⟩H = f (x) (reproducing property)
3. ∀x, x ′ ∈ X
k(x, x ′ ) = ⟨k(⋅, x), k(⋅, x ′ )⟩H
• What does this indicates?
14
Functional analysis
• Suppose we have a RKHS H, f (⋅) ∈ H, and k(⋅, x) ∈ H.

• Then the reproducing property indicates that evaluation of f (⋅) at x,
i.e., f (x) is the inner-product of k(⋅, x) and f (⋅) itself, i.e.,
f (x) = ⟨f , k(⋅, x)⟩H .
• Recall Mercer’s theorem k(x, x ′ ) = ∑∞ ′

i=1 λi φi (x)φi (x ). Then,
∞
f (x) = ⟨f , ∑ λi φi (⋅)φi (x)⟩
i=1 H
∞
= ∑ λi ⟨f , φi (⋅)⟩H φi (x)
i=1
∞
= ∑ λ̄i φi (x)
i=1
where λ̄i = λi ⟨f , φi (⋅)⟩H .
15
Gaussian process
Gaussian process
• Gaussian process: A random process X (t) is a Gaussian process

if for all k ∈ N for all t1 , . . . , tk , a random vector formed by
X (1), . . . , X (tk ) is jointly Gaussian.
• The joint density is completely specified by
• Mean: m(t) = E(X (t)), where m(⋅) is known as a mean function.
• Covariance: k(t, s) = cov(X (t), X (s)) =, where k(⋅, ⋅) is known as a
covariance function.
• Notation: X (t) ∼ GP(m(t), k(t, s))
16
Function space view
⎡f (x )⎤
⎢ 1 ⎥
⎢ ⎥
Let f (x) be a (zero-mean) Gaussian process. Then f (X) = ⎢⎢ ⋮ ⎥⎥ ∈ Rn
⎢ ⎥
⎢f (xn )⎥
⎣ ⎦
and f (x∗ ) ∈ R are jointly Gaussian, i.e.,
⎡f (x1 )⎤ ⎡ k(x1 , x∗ )⎤⎥⎞
⎢
⎢
⎥
⎥ ⎛ ⎢⎢k(x1 , x1 ) ⋯ k(x1 , xn )
⎥
⎢ ⋮ ⎥ ⎜ ⎢ ⋮ ⋱ ⋮ ⋮ ⎥⎟
⎢ ⎥ ⎜ ⎢ ⎥⎟ .
⎢f (x )⎥ ∼ N ⎜0, ⎢k(x , x ) ⋯ k(xn , xn ) k(xn , x∗ )⎥⎥⎟
⎢ n ⎥ ⎜ ⎢ ⎟
⎢ ⎥ ⎝ ⎢⎢k(x∗ , x1 )
n 1
⎥
⎢f (x∗ )⎥
⎣ ⎦ ⎣ ⋯ k(x∗ , xn ) k(x∗ , x∗ )⎥⎦⎠
17
Function space view
• We rewrite the joint Gaussian distribution as
f K (X , X ) K (X , x∗ )
[ ] ∼ N (0, [ ]) .
f∗ K (x∗ , X ) k(x∗ , x∗ )
• Recall that the conditional distribution p(x∣y) of a jointly Gaussian

x
random vector [ ] is also a Gaussian random vector with mean
y
E(x∣y) and covariance matrix Σx∣y where
E(x∣y) = E(x) + Σxy Σ−1

yy (y − E(y))
Σx∣y = Σxx − Σxy Σ−1

yy Σyx .
18
Function space view
By conditioning, we get
f∗ ∣x∗ , X , f ∼ N (µ∗ , σ∗2 )
where
µ∗ = K (x∗ , X )K (X , X )−1 f
and
σ∗ = k(x∗ , x∗ ) − K (x∗ , X )K (X , X )−1 K (X , x∗ ).
19
Function space view
• In the previous case, measurement noise is not included.

• Let y (x) = f (x) + where ∼ N (0, σn2 ). Then the covariance
between two outputs becomes:
cov(y (x1 , x2 )) = k(x1 , x2 ) + σn2 .
• Consequently, the joint distribution of y and f∗ becomes:
y K (X , X ) + σn2 I K (X , x∗ )
[ ] ∼ N (0, [ ]) .
f∗ K (x∗ , X ) k(x∗ , x∗ )
20
Function space view
By conditioning, we get
f∗ ∣x∗ , X , y ∼ N (µ∗ , σ∗2 )
where
µ∗ = K (x∗ , X )(K (X , X ) + σn2 I )−1 y
and
σ∗ = k(x∗ , x∗ ) − K (x∗ , X )(K (X , X ) + σn2 I )−1 K (X , x∗ ).
21
Comments on Gaussian process regression
• Pros: principled, probabilistic, predictive uncertainty

• Cons: computationally intensive (O(n3 ) where n is the number of
data)
22
Summary of Variational Inference
• Log marginal likelihood

= ELBO (variational free energy) + DKL (q(w ∣θ)∣∣p(w ∣D))
• Note that p(w ∣D) can hardly be computed analytically.
23
• Derivation:
ln p(D) = ∫ ln p(D)q(w ∣θ)dw
p(D)p(w ∣D)
= ∫ q(w ∣θ) ln dw
p(w ∣D)
p(w , D)
=∫ q(w ∣θ) ln dw
p(w ∣D)
p(D∣w )p(w )
p(w ∣D)
q(w ∣θ)p(D∣w )p(w )
q(w ∣θ)p(w ∣D)
q(w ∣θ) p(D∣w )p(w )
=∫ q(w ∣θ) ln dw + ∫ q(w ∣θ) ln dw
p(w ∣D) q(w ∣θ)
= DKL (q(w ∣θ)∣∣p(w ∣D)) + F[q]
where F[q] is the variational free energy or ELBO.
24
• The variational free energy or ELBO can further experessed as:

• Derivation:
p(D∣w )p(w )
F[q] = ∫ q(w ∣θ) ln dw
q(w ∣θ)
p(w )
= ∫ q(w ∣θ) ln p(D∣w )dw + ∫ q(w ∣θ) ln dw
q(w ∣θ)
= Eq(w ∣θ) [ln p(D∣w )] − DKL (q(w ∣θ)∣∣p(w ))
= likelihood under q − prior fitting term
• We try to maximize F[q] to

(
(
((((
1. maximize (
the marginal likelihood p(D)
( ( (
( ((
2. ( ((the
reduce gap between p(w ∣D) and q(w ).
3. keep the variational distribution Q(w ) close to our prior p(w ∣A).
25
Stein Variational Gradient
Descent
Stein Variational Gradient Descent
• Stein Variational Gradient Descent

• To implement the iterative procedure, (unnormalized) posterior p(x)
is given.
• Then, we draw a set of particles {xi0 }ni=1 for the initial distribution q0 .
• Each particle xi is updated as follows:
xil+1 ← xil + l φ̂∗ (xil )
where
1 n
φ̂∗ (x) = ∑ [k(xj , x)∇xjl log p(xj ) + ∇xjl k(xj , x)] .
l l l
n j=1
26
Stein Variational Gradient Descent
• The update rule has nice interpretations:
1 n
φ̂∗ (x) = ∑ [k(xj , x)∇xjl log p(xj ) + ∇xjl k(xj , x)] .
l l l
n j=1
1. k(xjl , x): Similarity between current particle to update x and j-th

particle xjl .
2. ∇x l log p(xjl ): Particle update direction of current particle to update
j
x where it is computed from j-th particle xjl .

• Note that as we are using the score function, i.e., ∇x log p(x),
unnormalized p(x) can be used!
• This is also used in policy gradient methods such as REINFORCE.
3. ∇x l k(xjl , x): This term can be interpreted as running a gradient
j
ascent method on k(⋅, ⋅). As the value of a kernel function usually
increases as the distance between two inputs decreases, it can be
interpreted as an attractive force between particles.
27
Questions?
27
Must read
Gaussian process for machine learning [1]

References i
C. E. Rasmussen and C. K. Williams.

Gaussian processes for machine learning, volume 1.
MIT press Cambridge, 2006.

4 Summary

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Summary

Uploaded by

Copyright:

Available Formats

Bayesian Deep Neural Networks

7. Summary of Variational Inference

8. Stein Variational Gradient Descent

• set function: a function assigning a number of a set (example:

• σ(C) is called the σ-field generated by C.

• The random experiment should be well defined.

X ∶ Ω → R such that ∀B ∈ B, X −1 (B) ∈ A.

• What is random here?

• discrete random variable: There is a discrete set {xi ∶ i = 1, 2, ⋯}

• continuous random variable

• A random process Xt is completely characterized if the following is

• Definition (Hilbert space)

• Note that there is almost no condition on X .

• Absolutely summable is more important than it seems.

• Definition (reproducing kernel Hilbert space)

• What does this indicates?

• Suppose we have a RKHS H, f (⋅) ∈ H, and k(⋅, x) ∈ H.

f (x) = ⟨f , k(⋅, x)⟩H .

• Recall Mercer’s theorem k(x, x ′ ) = ∑∞ ′

where λ̄i = λi ⟨f , φi (⋅)⟩H .

• Gaussian process: A random process X (t) is a Gaussian process

• We rewrite the joint Gaussian distribution as

• Recall that the conditional distribution p(x∣y) of a jointly Gaussian

E(x∣y) = E(x) + Σxy Σ−1

Σx∣y = Σxx − Σxy Σ−1

f∗ ∣x∗ , X , f ∼ N (µ∗ , σ∗2 )

• In the previous case, measurement noise is not included.

cov(y (x1 , x2 )) = k(x1 , x2 ) + σn2 .

• Consequently, the joint distribution of y and f∗ becomes:

f∗ ∣x∗ , X , y ∼ N (µ∗ , σ∗2 )

• Pros: principled, probabilistic, predictive uncertainty

• Log marginal likelihood

• The variational free energy or ELBO can further experessed as:

• We try to maximize F[q] to

• Stein Variational Gradient Descent

xil+1 ← xil + l φ̂∗ (xil )

• The update rule has nice interpretations:

1. k(xjl , x): Similarity between current particle to update x and j-th

x where it is computed from j-th particle xjl .

Gaussian process for machine learning [1]

C. E. Rasmussen and C. K. Williams.

You might also like

xil+1 ← xil + l φ̂∗ (xil )