MLSS2018 Madrid ProbThinking

Planting the Seeds
of Probabilistic
ThinkingAlgorithms
Foundations | Tricks |
Shakir Mohamed
Research Scientist, DeepMind
@shakir_za
Planting the Seeds of Probabilistic
Thinking: Foundations, Tricks and
Probabilistic Algorithms
machine learning approaches task of describing of data, to
complex systems or our world using the language and tools of probability. Almost all
of machine learning can be viewed in probabilistic terms, making probabilistic thinking
fundamental. It is, of course, not the only view. But it is through this view that we can
connect what we do in machine learning to every other computational science,
whether that be in stochastic optimisation, control theory, operations research,
econometrics, information theory, statistical physics or bio-statistics. For this reason
alone, mastery of probabilistic thinking is essential.
The aim of this tutorial is to develop flexible and broad tools that will support
your probabilistic thinking. Part 1, Foundations looks at the philosophy of machine
learning, builds an understanding of the model-inference-algorithm paradigm, and
the explores fundamental areas of machine learning - we’ll look at deep
learning, kernels and reinforcement learning. Part 2 Tricks, will look at 6 individual
probabilistic problems and a tricks to solve them, using these tricks to develop
flexibility in our thinking. Part 3: Algorithms will look at how the foundations and
tricks combine to develop machine learning algorithms, with a specific focus on the
area of deep generative models.
Shakir Mohamed
Principles to
Products
Applications
Assistive
Technology
Advancing
Science
Climate and
Energy
Healthcare Fairness and
Safety
Autonomous
systems
Planning Explanation Rapid Learning World Objects and

Reasoning Simulation Relations
Uncertainty Information Gain Causality Prediction

Information
Probability Bayesian Hypothesis Estimation Asymptotics

Principles Theory Analysis Testing Theory
Shakir Mohamed
Planting the Seeds
of Probabilistic
Thinking
Part I:
Foundations
Learning Objectives
1. Language to think about the
Philosophy of Machine
Learning
2. Understand the
Model-Inference-Algorithm paradigm
3. Use probabilistic thinking applied to

problems in supervised, unsupervised,
and reinforcement learning.
Probability
Some Definitions for
probability
Logical Probability
Statistical Degree of confirmation of
Probability a hypothesis based on
Frequency ratio of items logical analysis
Probability as
Propensity Subjective
Probability used Probability
for predictions Probability a
s a degree of belief
Probability is sufficient for the task of

reasoning under uncertainty
Probability
Probability as a Degree of
Belief
Probability is a measure of the belie

f in a proposition given evidence.
A description of a state of knowledge.
No such thing as Inherently Different observers

the probability subjective in that with different
of an event, since the it depends on the informatio
value depends on the believer’s n will have different
evidence used. information beliefs.
Probabilistic
Quantities
Probability
Condition
s
Bayes Rule
Parameterisatio
n Expectation
Gradient
Statistical
Operations
Estimation Hypothesis
and Inference Testing
Learning
Summarisation Comparison
Data Experimental
Modelling Enumeration Design
Probabilistic Models
Model: Description of the A probabilistic model writes
world, of data, of potential out these models using the
scenarios, of processes. language of probability
Peak Bad
prob(traffic Jam)
Accident
hour Weathe
r prob(sirens | Accident)
prob(peak hour | Traffic Jam)

Traffi
Sirens
c
Jam Most models in machine
learning are
probabilistic.
Probabilistic models let you learn

probability distributions of data. You can choose what to learn: Just
the mean. Or the entire
distribution.
Centrality of Inference
Inference
Artificial General Intelligence

Summarisation Comparison will be the refined instantiation
of these statistical operations.
Data
Enumeration
The core questions of

AGI will be those of
probabilistic
inference
Linear Regression
Generalised Linear E[y]
Regression
g()
⌘ = w> x + b
p(y|x) = p(y|g(⌘); ✓) ⌘=
• The basic function can be any linear Bx
function, e.g., affine, convolution. Target Regression Link Inv link Activation
Real Linear Identity Identity
µ
•
Binary Logistic Logit log Sigmoid Sigmoid
g(.) is an inverse link function that we’ll
1- µ
1
1+ exp( - ⌘)
Binary Probit Inv Gauss Gauss CDF Probit
refer to as an activation function. Binary Gumbel

CDF <D- 1 (µ)
Compl.
<D(⌘)
Gumbel
- e- x
CDF
e
log-log
lo g(-
Binary Logistic lo g(µ)) Hyperbolic Tanh
Tangent
Optimise the negative log-likelihood Categorical Multinomial

tanh(⌘)
Multin. Softmax
P ⌘i
Logit
L = - log p(y|g(⌘); ✓) Counts

Counts
Non-neg.
Poisson
Poisson
Gamma
lo g(µ)
p
(µ)
Reciprocal 1
µ exp(
j
⌘j
Sparse Tobit max

⌫) max(0; ReLU
2
Ordered Ordinal ⌫
Cum. ⌫)
Logit
1
cr(cpk - ⌘)
⌫
Deep Networks
Recursive Generalised Linear
Regression E[y]
g()
⌘l = Bx l
• Recursively compose the basic linear functions. …
• Gives a deep neural network.
E[y] = hL o . . . o hl o h0 (x) g()
A general, flexible framework for building ⌘1 = Bx 1

non-linear, parametric models
Likelihood
Probabilistic Likelihood
Model function
Likelihood of parameters
Efficient Estimators
Tests with Good
• Statistically efficient (Cramer-Rao lower Power
Prescribed Likelihoods
bound)
• Asymptotically unbiased, consistent • Likelihood ratio tests
• Maximum entropy (principle of indifference) • Can construct small
confidence regions
Pool Information
Widely-applicable
• Combine different data sources
• Handle data that is incompletely • Knowledge outside the data can
observed, distorted, samples with be used, like constraints on domain
bias or prior probabilities.
• Can offset or correct these issues.
Misspecification: Inefficient estimates; or confidence intervals/tests can fail
completely.
Estimation Theory
E[y]
Probabilistic
Model
g()
Likelihood
function ⌘l = Bx l
Maximum Likelihood …
Optimisation
Objective g()
• Straightforward and natural way to learn

parameters
• Can be biased in finite sample size, e.g., Gaussian ⌘1 = Bx 1
variances with N and N-1.
• Easy to observe overfitting of parameters.
Estimation Theory
Probabilistic
Model
Likelihood
function
Maximum a Posteriori • Generalises the MLE (uniform prior)

(MAP) • Shrinkage: shrink parameters back
to initial beliefs.
Optimisation
Objective
• Not every regulariser corresponds a
valid probability distribution.
Regularisation
✦ Regularisation is essential to overcome the limitations

of maximum likelihood estimation.
✦ Other names: Regularisation, penalised regression,
shrinkage.
A wide range of available regularisation techniques:

• Large data sets
• Input noise/jittering and data
augmentation/expansion.
• L2 /L1 regularisation (Weight decay, Gaussian prior)
• Binary or Gaussian Dropout
MAP Estimation
Type of Solution
What is maximum is not
necessarily typical
Uncertainty Parameterisation sensitive

Can be reported using confidence Location of max will change
intervals or bootstrap estimates. depending on parameterisation
Invariant MAP
Popular
Example
Change of Bernoulli
variables
Uniform
Mode of the prior
Parameterisation Parameterisation
1 2
Transform
New prior
MAP Est.
Clear sensitivity: Sensitive to units, affects interpretability,

affects gradients, learning stability, design of models.
Invariant MAP
Use a modified probabilistic model that removes
sensitivity
Invariant MAP
• Use the Fisher information

• Connection to the natural gradients and trust-region
optimisation.
• Uninformative priors.
Proposed solutions have not fully dealt with the underlying issues.
Bayesian Analysis
Issues arise as a consequence
• Reasoning only about the most likely solution, and
of:
• Not maintaining knowledge of the underlying variability (and averaging
over this).
Motivates learning more than the mean.

This is the core of a Bayesian
philosophy.
Pragmatic Bayesian Approach for

Probabilistic Reasoning in Deep
Networks. (and all of machine learning)
Bayesian reasoning over some, but not

all parts of our models (yet).
Bayesian Analysis
Interested in reasoning about two important
quantities
Evidenc
Posterio
• In Bayesian analysis, things that are not.

observed must be integrated over - averaged
out.
• This makes computation difficult.
• Integration is the central operation.
Learning and
Inference Machine learning makes a distinction bet ween
Statistics inference and learning:
, no distinction between
learning and inference - • Inference: reason about (and compute)
only inference (or unknown probability distributions.
estimation). • (Parameter) Learning is finding point
estimates of quantities in the model.
Bayesian
statistics, all
quantities are Software Decision making
probability distributions, so engineering, inference and AI, refer to learning
there is only the problem is the forward evaluation of in general as the means of
of inference. a understanding and acting
trained model (to get based on past
predictions). experience
(data).
Two Streams of ML
Deep Learning Bayesian
Reasoning
- Mainly conjugate and linear models

+ Rich non-linear models for
classification and sequence - Potentially intractable inference,
prediction. computationally expensive or long
+ Scalable learning using stochastic
simulation time.
approximation and conceptually
simple. + Unified framework for model building,
+ Easily composable with other
inference, prediction and decision making
gradient- based methods
+ Explicit accounting for uncertaint
- Only point
y and variability of outcomes
estimates
- Hard to score models, do selectio
+ Robust to overfitting; tools for model
n and complexity penalisation.
selection and composition.
Natural to consider the marriage of these approaches: Bayesian Deep

Learning
Bayesian Regression
Probabilistic models over
functions
Prio
r
Observation
• Ways ofmodel Posterior

learning distributions over
functions and maintaining uncertainty
over functions.
• Difficult in parametric models (like deep y
networks) because of high-dimensional
parameter space.
• Many ways to learn the posterior
distribution. Focus of Part III
Density Estimation
Learn probability distributions over the data
itself
μ Σ • Can learn distributions of some things
and point estimates of others.
• Deep Generative Models and
z Unsupervised learning - more in Part
III
W
y
n = 1, …, N
Factor Analysis /
PCA
z ⇠ N (z|µ, ⌃)
y ⇠ N (y|W z, y
0"2 I )
Decision-making
Probabilistic models of environments and
actions
Prior over
actions
Interaction only
Reward/Utility
External Environment
Observation/
Action Sensation Setup is commo
Environment n in experimental
design, causal learning,
reinforcement learning.
Decision-maker
Probabilistic Dualities
Basis Function Regression
Move from primal variables to dual variables
Kernel trick and

methods
Probability distributions over functions
Gaussian
processes
Deep and Hierarchical
Hierarchical Model: models where the (prior) probability distributions c
an be decomposed into a sequence of conditional distributions
Foundations
How will you approach your ML research and
practice?
In general: For the ML Core:
Human-centred, Probabilistic and pragmatic in
interdisciplinary approach approach
Sociologica
l
Psychological Architecture-Loss
Componential
Physiological
Sun’s Phenomenological
Levels
Model-Inference-Algorithm
Architecture-Loss
O: So# max
P²: Plus
T² :Time B²:
s Weight
W ²: Weight S¹:
Sigmoid
P¹:
Plus
T¹:Time B¹:Weigh
s t
W ¹: X :Input
Weight
1. Computational 2. Error
Graphs propagation
3.
Algorithms
1. 2. Learning
Models Principles
Shakir Mohamed
Models
Directed and Undirected Fully-
observed
Parametric, Non-parametric Latent

And semi-parametric Variable
μ, Σ
z1 z2 … zD
Model y1 y2 … yD
s n = 1, …, N
Shakir Mohamed
Learning Principles
Statistical
Inference
Direct Indirect
Laplace Maximum Two Sample Method of

approximatio Likelihood Comparison Moments
n
Ma Variational Approx Bayesian Transpo! ation
ximum a Inference Computation methods
posteriori
Integr. Nested Max Mean
Cavity Methods
Laplace Discrepency
Approx
Expectation Markov chain
Maximisatio Monte Carlo
n
Noise Sequential
Contrastiv Monte
Learning
e Carlo
Principle
Shakir Mohamed
s
Algorithms
A given model and learning principle can be implemented in many ways.
Convolutional neural network Implicit Generative Model

+ penalised maximum + Two-sample testing
likelihood
● Optimisation methods ● Unsupervised-as-supervised learning
(SGD, Adagrad) ● Approximate Bayesian Computation
● Regularisation (L (ABC)
1, L2, batchnorm, ● Generative adversarial network
dropout) (GAN)
Latent variable model Restricted Boltzmann Machine
+ variational inference + maximum likelihood
● VEM algorithm ● Contrastive
● Expectation ● Divergence Persistent
● propagation ● CD
● Approximate message passing ● Parallel
Variational auto-encoders Tempering
(VAE) Natural gradients
Shakir Mohamed
Final Words
Subjective Probability Deep Learning, Estimation theory,
Probability as a degree of belief hierarchical models, dualities
Probabilistic Likelihood
Probabilistic descriptions Model function
of systems and data
E[y]
Peak Bad
Accident
hour Weather
g()
Traffi
c
Sirens ⌘l =
Jam
Bx l
…
g()
⌘ 1 = Bx1
Shakir Mohamed
Planting the Seeds
of Probabilistic
ThinkingAlgorithms
Shakir Mohamed
@shakir_za
Last Time …
3.
Algorithms
Convolutional neural network
+ penalised maximum likelihood
● Optimisation methods
(SGD, Adagrad)
● Regularisation (L
1, L2, batchnorm,
dropout)
Latent variable model
+ variational inference
1. ● VEM algorithm 2. Learning

● Expectation
Models ● propagation Principles
● Approximate message passing
Variational auto-encoders
Shakir Mohamed (VAE)
Planting the Seeds
of Probabilistic
Thinking
Part II:
Tricks
Learning Objectives
1. Develop tools to
manipulate distributions by
studyin g 6 probability
questions.
2. Build connections between

concepts in machine learnin
g and those in other computational
sciences.
Shakir Mohamed
Inferential Questions
Probabilistic dexterity is needed to solve the fundamental
problems of machine learning and artificial intelligence.
Evidence Moment Parameter

Estimatio Computatio Estimation
nZ n Z
p(x) = p(x, E[f (z)|x] = f (z)p(z|
z)dz x)dz
Predictio Plannin
n g
Hypothesis Experimental
Testing Design
B = log p(x|H 1 ) - log p(x|
HMohamed
Shakir 2)
Identity Trick
Transform an expectation w.r.t. distribution
p, into an expectation w.r.t. distribution q.
Do this by introducing a probabilistic

one
Shakir Mohamed
Identity Trick
Z
Integral problem p(x) = p(x|
Z z)p(z)dz
q(z)
Probabilistic p(x) = p(x|z)p(z)
one q(z)
Z dz
p(z)
p(z) q(z) f Re-group/re-weight p(x) = p(x|z) q(z)d
(z)
q(z)
z
Conditions
• q(z)>0, when p(x|
z)p(z) ≠ 0.
• q(z) is known/easy
toMohamed
Shakir handle.
Importance Sampling
1 X
Monte Carlo p(x) = w ( s) p(x|z( s) )
Estimator S s
( s) p(z) z( s) ⇠
w = q(z)
q(z)
p(z) q(z) f
(z)
Identity Trick Elsewhere
• Manipulate stochastic
gradients
z • Derive probability bounds
• RL for policy corrections
Shakir Mohamed
Hutchinson’s Trick
Compute the trace of a matrix:
• KL bet ween two Gaussians.
• Gradient of a log-determinant.
Trace problem
Zero mean unit var
Identity
Trick Linear
operations Trace
property
Shakir Mohamed
Probability Flow
Tricks
Distribution and sample
Transformatio
n Unconscious
Statistician
Makes entropy
Change of
computatio
Variables n Compute
and backpropagation
easy.
Begin with a diagonal
Gaussian and improve by
change of variables.
Triangular Jacobians allow

for computational Linear time computation of the determinant and its gradient.
efficiency.
Shakir Mohamed
Normalising Flows
Shakir Mohamed
Stochastic
Optimisation
Common gradient problem
Z • Don’t know this expectation
in general.
r ¢ Eq¢ ( z ) [f ✓ (z)] = r q¢ f✓ dz • Gradient is of the

parameters
(z) (z) of the distribution w.r.t.

which
the expectation is taken.
1. Pathwise estimator: Differentiate the function f(z)

2. Score-function estimator: Differentiate the density q(z|
x)
Typical problem areas

• Sensitivity analysis
• Generative models and inference
• Reinforcement learning and control
• Operations research and inventory control
• Monte Carlo simulation
• Finance and asset pricing
Shakir Mohamed
Reparameterisation
Trickss aDistributions can be expressed a
transformations of other
distributions.
z⇠ z⇠
p(z) qcp(z) ✏ ⇠
z = g(✏,
μ ¢) p(✏)
r
✓
x = µ+
R Rz
Samplers, one-liners and change-of-
variables
d✏
p(z) = p(✏) =) |p(z)dz| = |
dz
p(✏)d✏|
Shakir Mohamed
Pathwise Estimator
(Non-rigorous) Z
Derivation
r ¢ Eq( z ) [f (z)] = r ¢ q¢ (z)f Known transformation
(z)dz
z ⇠ p(z)
Change of variables
μ
r
= r ¢ Ep( ✏) [f (g(¢, ✏)] = Ep( ✏) [r ¢ f Inv fn Thm
✓
(g(¢, ✏)]
x= µ+
Rz
= Ep( ✏) [r f ✓ (g(✏,
R
r ¢ Eq¢ ( z) [f ✓ (z)] ¢
¢ ))]
Other names When to use
• Unconscious statistician • Function f is differentiable
• Stochastic backpropagation • Density q is known with a suitable transform o
• Perturbation analysis f a simpler base distribution: inverse CDF, location-
• Reparameterisation trick scale transform, or other co-ordinate transform.
• Affine-independent inference • Easy to sample from base distribution.
Shakir Mohamed
Log-derivative Trick
Score function is the derivative of a log-likelihood
function.
q¢ (z)
r
r ¢ log q¢ (z) = q¢
¢
(z)
Several useful
properties
Expected Eq( z ) [r ¢ log q¢ (z)] = ↞Show
score 0 this
Fisher
Information
Shakir Mohamed
Score-function
Estimator
Z
r E [f (z)] = r q
¢ f
q¢ ( z ) ✓ ¢ ✓ Leibnitz integral
(z) (z)dz rule
Z
q¢
= r ¢ q¢ (z)f Identity
(z)
q¢
Z (z)dz
(z)
= q¢ (z)r ¢ log q¢ (z)f Log-deriv
(z)dz
= Eq¢ ( z ) [f (z)r ¢ log q¢ Gradient
(z)]
Contro
r ¢ E q¢ ( z) [f ✓ (z)] = E q¢ ( z) [(f (z) - c)r ¢ log q¢ l
Variate
(z)]
Other names When to use
• Likelihood ratio method • Function is not differentiable, not
• REINFORCE and policy gradients analytical.
• Automated & Black-box • Distribution q is easy to sample from.
inference • Density q is known and differentiable.
Shakir Mohamed
Bounding Tricks
An important result from convex analysis lets us
move expectations through a function:
For concave functions f(.) f(x)

f (E[x]) 2: E[f (x)]
Logarithms are strictly concave allowing us to use Jensen’s

inequality.
log Z p(x)g(x)dx 2 p(x) log Z
g(x)dx
Other Bounding Tricks
Bounding Trick Elsewhere
• Fenchel duality
Optimisation; Variational Inference; • Holder’s inequality
Rao- Blackwell Theorem; • Monge-Kantorovich
Inequality
Shakir Mohamed
Evidence Bounds Z
Integral problem p(x) = p(x|
z)p(z)dz
Z
q(z)
Proposa p(x) = p(x|z)p(z)
l q(z)
Z dz
p(z)
Importance Weight p(x) = p(x|z)
q(z)
Z q(z)dz ✓
p(z)
Jensen’s log p(x) q(z) p(x|z) ◆
Z Z
inequality 2: log q( z)
log p(x)g(x)dx 2 p(x) log
Z Z
g(x)dx dz q(z)
= q(z) log p(x|z) - q(z) log
p(z)
Lower Eq( z ) [log p(x|z)] -

bound
K L [q(z)kp(z)]
Shakir Mohamed
Density Ratio Trick
The ratio of two densities can be computed using a classifier of
using samples drawn from the two distributions.
p⇤(x p(y = 1|x)

=
) p(y = - 1|
q(x) x)
Density Ratio Trick Elsewhere

• Generative Adversarial Net works (GANs)
• Noise contrastive estimation, Classifier-
ABC
• Two-sample testing
• Covariate-shift, calibration
Shakir Mohamed
Density Ratio
Estimation
Assign { y1 , . . . , yN } = { +1, . . . , +1, - 1, . . . , -
labels 1}⇤
Equivalence p (x) = p(x|y = q(x) = p(x|y = -
1) 1)
p⇤(x p(y|
Density )q(x) Bayes’ p(x|y) =
Ratio x)p(x)
Rule
p(y)
p⇤(x p(x|y = 1)
Conditional =
)q(x) p(x|y = -
1 ) /
p(y = + 1|x)
Bayes’ = p(y = + 1|x)p(x) p(y = - 1|

p(y =
x)p(x) p(y = -
Subst.
+ 1) 1) p(y = - 1|x)
p⇤(x = p(y = 1|
Class )q(x) x)
p(y = - 1|
probability x)
Computing a density ratio is equivalent to class probability
estimation.
Shakir Mohamed
Final Words
Strengthen your probabilistic
dexterity.
Identity
Hutchinson’
s Flows
q¢ (z)
Log-derivative r
r
log q¢ (z) = q¢
¢
¢
(z)
Reparameterisatio z = g(✏, ✏ ⇠
n ¢) p(✏)
p⇤(x = p(y = 1|x)
Density )q(x) p(y = - 1|
Ratio x)
Shakir Mohamed
Planting the Seeds
of Probabilistic
ThinkingAlgorithms
Shakir Mohamed
@shakir_za
Last Time …
Subjective
Probability
Probability a
s a degree of belief
E[y]
Model-Inference-Algorithm g()
⌘l =
Bx l
…
g()
⌘ 1 = Bx1
Shakir Mohamed
Manipulating
Integrals
Evidence Moment Parameter
Estimation Computation Estimation
Z Z
p(x) = p(x, E[f (z)|x] = f (z)p(z|
z)dz x)dz
Prediction Planning
Hypothesis Testing Experimental Design
B = log p(x|H 1 ) - log p(x|

H 2)
Shakir Mohamed
Your Tricks
Strengthen your probabilistic
dexterity.
Identity
Hutchinson’
s Flows
q¢ (z)
Log-derivative r
r
log q¢ (z) = q¢
¢
¢
(z)
Reparameterisatio z = g(✏, ✏ ⇠
n ¢) p(✏)
p⇤(x = p(y = 1|x)
Density )q(x) p(y = - 1|
Ratio x)
Shakir Mohamed
Planting the Seeds
of Probabilistic
Thinking
Part III:
Algorithms
Learning Objectives
1. Have knowledge of different
types of probabilistic models
for unsupervised learning.
2. Understand generative algorithms

(pixelCNN, VAEs, GANs) within the
model-inference-algorithm
framework.
3. Build awareness of the breadth of

applications of generative models.
Beyond
Classification
Move beyond
associating inputs
Understand and simulate
how the world evolves
to outputs
Recognise objects in the Detect surprising

world and their factors events in the world
of variation
Establish concepts as Anticipat e

useful for reasonin and generate rich
g and decision making plans for the future
Generative Models
A model that allows us to
learn a simulator of data
Models that allow for

(conditional) density
estimation
Approaches for
unsupervised learning of
data
Characteristics are:
- Probabilistic models of data that
allow for uncertainty to be
- captured. High-dimensional data.
-Data distribution is targeted.
Applications
Super-resolution, Planning,
Compression, Product AI
Exploration
Text-to-speech s Intrinsic
motivation
Science Model-based
RL
Proteomics,
Drug
Discovery,
Astronomy,
High-energy
physics
Assistive
Technologies
Fully-observed conditional generative

model
Compression-Communication
Original
JPEG
JPEG-2000
VAE1
Compression rate:
VAE2 0.2bits/dimension
Generative Design
Video from work of Memo
Aktem
Advancing Science
Advancing Healthcare
Types of Generative
Models
Fully-observed Undirected
models Models
Sum-Product
z Networks
Latent
variable f(z)
models
x
Types of Generative
Models
Design Dimensions
Data: binary, real-valued, nominal, strings,
images.
Dependency: independent, sequential,
temporal, spatial.
Representation: continuous or discrete
Dimension: parametric or non-parametric
Computational
complexity Modelling
capacity
Bias, uncertainty,
calibration Interpretability
Fully-observed
Models xi
Fully-observed Model Parameters are
global variables.
models
xk f(x)
Stochastic activations
Model observed data directly & unobserved
xj random variables are
without introducing any new local variables.
unobserved local variables.
Markov Models
x 1 ⇠ Cat(x 1 |⇡) xt xt+1 xt+2 xt+3 …

x 2 ⇠ Cat(x 2 |⇡(x 1 ))
x i ⇠ Cat(x i |⇡(x < n ))

Hartebeest Pixel
Y CNN
p(x) = p(x i |f (x < i ;
✓)) White Whale
All conditionali probabilities
described by deep
networks.
Properties
+ Can directly encode how observed points are

+ related.
+ Any data type can be used
For directed graphical
+ Parameter learningmodels:
simple: Log-likelihood is directly computable,
no approximation needed.
+ Easy to scale-up to large models, many optimisation tools
- available.
- ForOrder sensitive.
undirected models,
- Parameter learning difficult: Need to compute normalising
constants.
- Generation can be slow: iterate through elements sequentially,
or using a Markov chain.
Directed
NADE, EoNADE
Normal Means
Fully-visible sigmoid
Continuous
belief networks
Markov Models
Pixel CNN/RNN
N-AR(p)
RNN Language mod.
RNADE
Context tree switching
Discrete Continuous
xt xt+3 …
xt+1
Boltzmann Machines
Discrete Markov
x t+2
Gaussian MRFs
Random Fields Ising, Log-linear models
Hopfield
and Potts Models
Undirected
Latent Variable
Models
z z Latent variable
z
f(z) models
f(z) Introduce an unobserved f(z)
x local random variables that
represents hidden causes.
x
x
Prescribed models Implicit models
Use observer likelihood Likelihood-free or
s and assume observation simulation-based
noise. models.
Diggle and Gratton (1984); Mohamed and Lakshminarayanan (2016)
Deep Latent Gaussian Model Prescribed Models
z3
z3 ⇠ N (0, I )
z2 |z3 ⇠ N (µ(z3 ),
z2 ⌃(z3 ))
z1 |z2 ⇠ N (µ(z2 ),
z3 ⌃(z2 ))
x|z1 ⇠ N (µ(z1 ),
Convolution ⌃(z1 ))
al DRAW
x
Properties
+ Easy sampling.
+ Easy way to include hierarchy and depth.
+ Easy to encode structure believed to generate the data
+ Avoids order dependency assumptions: marginalisation of latent
variables induces dependencies.
+ Latents provide compression and representation the data.
+ Scoring, model comparison and selection possible using the
marginalised likelihood.
- Inversion process to determine latents corresponding to a input
is difficult in general
- Difficult to compute marginalised likelihood requiring
approximations.
- Not easy to specify rich approximations for latent posterior
distribution.
Cascaded Indian Sigmoid Belief Net
Buffet process Deep auto-regressive
Hierarchical Dirichlet networks (DARN)
process
Deep Nonparametric Discrete Deep Parametric Discrete
Non-parametric
Indian buffet process Deep Gaussian
Dirichlet process Deep processes
mixture Discrete Recurrent Gaussian
Process
GP State space
Direct Nonparametric Discrete
model
Deep Nonparametic
Nonlinear factor
Continuous
Hidden Markov Model analysis
Discrete LVM Nonlinear Gaussian
Sparse LVMs Continuous belief network
Direct/
Deep Latent Gaussian
Linear
(VAE, DRAW)
Linear Parametric Discrete Parametric Deep Parametric Continuous
PCA, factor analysis Gaussian process

Independent LVM
components analysis
Gaussian LDS
Latent Gauss Field
Linear Parametric Continuous Direct Nonparametric Continuous
Implicit Models
Change of variables for invertible
functions
z⇠
- 1 z Implicit models
p(z)
p(x) = p(z) det@z Transform an unobserved
μ @f f(z) noise source usin
g a parameterised
x function.
r
✓
x = µ + Rz
R
Generator
Networks
z ⇠ N (0, I )
x = f (z; ✓)
The transformation function is parameterised by a linear or
deep net work (fully-connected, convolutional or recurrent).
Properties
+ Easy sampling, and natural to specify.
+ Easy to compute expectations without knowing final distribution.
+ Can exploit with large-scale classifiers and convolutional
- net works. Difficult to satisfy constraints: Difficult to maintain
invertibility, and challenging optimisation.
- Lack of noise model (likelihood):
- Difficult to extend to generic data types
- Difficult to account for noise in observed
- data.
Hard to compute marginalised likelihood for model scoring,

comparison and selection.
Bedrooms
Convolutional
generative
adversarial network
Stochastic One-liners and
z
Differential Equations inverse sampling
Hamiltonian and Distrib. warping
Langevin SDE Normalising flows
Diffusion Models GAN generator nets
Non- and volume Non- and volume
preserving flows preserving transforms
Diffusions
Continuous Functions
(z) Discrete time
ftime
Shakir
z z ~ q(z | x)
z
Model Inference
p(x |z) Network
q(z |x) Generator
x r eal x gen
x ~ p(x | z)
Data x
D¢
Prescribed latent Implicit latent variable

variable model models and
s and variational estimation- by-
inference comparison
Variational Autoencoders Generative Adversarial

(VAEs) Networks (GANs)
Shakir Mohamed
Model Evidence
Model evidence (or marginal likelihood, partition function):
Integrating out any global and local variables enables z
model scoring, comparison, selection, moment estimation,
f(z)
normalisation, posterior computation and prediction.
We take steps to improve the model x

evidence for given data samples.
Learning principle: Model Evidence

Z
p(z) q(z) f
(z)
p(x) = p(x,
z
z)dz
Integral is intractable in gener Basic idea:
al and requires Transform the integral int o
approximation. an expectation over a simple,
known distribution.
Shakir Mohamed
Variational Inference
Z
p(x) = p(x,
z)dz
F (x, q) = Eq( z ) [log p(x|z)] -

K L [q(z)kp(z)]
This bound is exactly of the form we are looking for.
• Variational free energy: We obtain a functional and are free to
choose the distribution q(z) that best matches the true posterior.
• Evidence lower bound (ELBO): principled bound on the
marginal likelihood, or model evidence.
• Certain choices of q(z) makes this quantity easier to
compute. Examples to come.
Identity Bounding
Shakir Mohamed
Variational Methods
Variational Principle
General family of methods for approximating
complicated densities by a simpler class of
densities.
K L [q(z|y)kp(z| Approximation class

y)]
True posterior
q¢
(z)
Determi
nistic approximation procedures
with
bounds
Shakir on probabilities
Mohamed of interest.
Variational Bound
Interpreting the bound:
F (x, q) = Eq( z ) [log p(x|z)] - K L [q(z)kp(z)]

Approx. Posterior Reconstruction Penalty
• Approximate posterior distribution q(z): Best match to true

posterior p(z|y), one of the unknown inferential quantities of interest to
us.
• Reconstruction cost: The expected log-likelihood measure how
well samples from q(z) are able to explain the data y.
• Penalty: Ensures the the explanation of the data q(z) doesn’t deviate
too far from your beliefs p(z). A mechanism for realising Okham’s
razor.
Shakir Mohamed
Variational Bound
F (x, q) = Eq( z ) [log p(x|z)] - K L [q(z)kp(z)]
Some comments on q:
• Integration is now optimisation: optimise for q(z) directly.
• I write q(z) to simplify the notation, but it depends on the data, q(z|x).
• Easy convergence assessment since we wait until the free energy
(loss) reaches convergence.
• Variational parameters: parameters of q(z)
• E.g., if a Gaussian, variational parameters are mean and variance.
• Optimisation allows us to tighten the bound and get as close as
possible to the true marginal likelihood.
Shakir Mohamed
Real
Posteriors
Require flexible approximations for the types of posteriors we are likely to
see.
Shakir Mohamed
Mean-Fields
Mean-field methods assume that the distribution is
factorised.
True Fully-factorised
Posterior
z2 z2
z1 z3 z1 z3
Most Expressive Least Expressive
q⇤(z|x) / p(x| Y
qM F (z|x) = q(zk
z)p(z) k
)
Restricted class of approximations: every dimension (or
subset of dimensions) of the posterior is independent.
Shakir Mohamed
Structured Mean-
field
Structured mean-field: introduce dependencies into our
factorisation.
True Structured Approx. Fully-factorised

Posterior
z2 z2 z2
z1 z3 z1 z3
z1 z3
q⇤(z|x) / p(x| Y Y
q(z) = qk (zk | qM F (z|x) =
z)p(z) k k
{zj } j 6= k ) q(zk )
Shakir Mohamed
Latent Gaussian
Models Examples: GP regression or DLGM.
Probabilistic z ⇠ N (z|0, 1) y ⇠ p(y|f ✓
Model
(z)) Y
Mean-field approx
q(z) = N (zi |µi , i
i
o2 )
Variational F (y, q) = Eq( z ) [log p(y|z)] - K L [ q(z)kp(z)]
bound
X
F (y, q) = Eq( z ) [log p(y|z)] - i
K L [ q(zi )kp(zi )]
X i
F (y, q) = Eq( z ) [log p(y|z)] - K L [ N (zi |µi , a-2 )kN (zi |
i
0, 1)]
1 X (- 2 + µ2 - 1 - ln -2
F (y, q) = Eq( z ) [log p(y|f ✓ (z))] - i i
i
i )
2
Shakir Mohamed
Families of
Approximations
True Families of Posterior Approximations Fully-factorised
Posterior
Normalising Structured mean-field Covariance models
flows
+
z2
zK
z1 z2 z3 …
z2
…
z4
Mixtures
z1
z1 z3 Auxiliary
z variables
p(z)
y z1 z3
z0
x ⍵ z1 z2 z3
x p(x |z) r (! |x,
z)
q⇤(z|x) / p(x| Y
qM F (z|x) =
z)p(z) k
q(zk )
Shakir Mohamed
Variational
Optimisation
F (x, q) = Eq( z ) [log p(x|z)] -
K L [q(z)kp(z)]
2 2
(a) (b)
• Variational EM τ
1
τ
• Stochastic Variational
Inference Doubly Stochastic
0 0
• −1 0 µ 1 −1 0 µ 1
2 2
(c) (d)
Variational Inference τ
1
τ
• Amortised Inference
0 0
−1 0 µ 1 −1 0 µ 1
Shakir Mohamed
Variational EM
F (x, q) = Eq( z ) [log p(x|z)] -
K L [q(z)kp(z)]
Alternating optimisation for log
p(x)
the variational parameter K L [q|| …
p⇤] …
s and then model parameters F (x,
(VEM). q)
Repeat:
Initialisation t=1 Convergence
E-step ¢/ r ¢ F (x, Var. params

M- q) Model
step ✓/ r ✓ F (x, params
E M
q)
Shakir Mohamed
Amortised Inference
Repeat:
E-step (compute q)
Instead of solving for every
For i = 1, … N
observation, amortise using a model.
¢ n / r ¢ Eq¢ ( z ) [log p✓ (x n |zn )] - r
z ~ q(z | x)
¢ K L [q(zn )kp(z)]
M-step
1 X Eq¢ ( z ) [r log p✓ (x n |
✓/ ✓
N n zn )] Inference
• Inference network: q is an encoder, an inverse model, Network
recognition model. q(z |x)
• Parameters of q are now a set of global parameters
used for inference of all data points - test and train.
• Amortise (spread) the cost of inference over all data.
• Joint optimisation of variational and model parameters.
Inference networks provide an efficient mechanism for

Data x
posterior inference with memory
Shakir Mohamed
Stochastic Gradients
Z
r ¢ Eq¢ ( z) [f ✓ (z)] = r q¢ f✓ dz
(z) (z)
Doubly stochastic
estimators
Pathwise Estimator
When easy to use transformation Score-function
is estimator When function f non-
available and differentiable differentiable and q(z) is easy to
= Ep( ✏) [r f ✓ (g(✏,
function f.
¢ sample from.
¢z))]
⇠
z⇠
p(z) = Eq( z ) [f ✓ (z)r ¢ log q¢
(z))]
μ
qcp(z) ✏⇠
z = g(✏, p(✏) r ✓
¢)
Reparameterisation
R
x = µ + Rz Identity Log-derivative
Shakir Mohamed
Variational
Autoencoder
F (x, q) = E [log p(x|z)] -
q( z )
z z ~ q(z | x)
K L [q(z)kp(z)] Model
p(x |z)
Approx. Posterior Reconstruction Penalty Inference
Stochastic encoder-decoder system Network
q(z |x)
to implement variational inference.
• Model (Decoder): likelihood p(x|z).

• Inference (Encoder): variational distribution q(z|x)
x ~ p(x | z)
• Transforms an auto-encoder into a generative Data x
model
Specific combination of variational inference in latent
variable models using inference networks
Variational Auto-encoder
But don’t forget what your model is, and what inference you
use.
Shakir Mohamed
Learning by
Comparison
For some models, we only have access to an
unnormalised probability or partial knowledge of the Basic idea:
Transform into
distribution.
learning a model of
z q(x) p*(x) the density ratio.
f(z) We compare the

estimated distribution q(x) to
x the true distribution
p*(x)
Ratios p(x (1)

)
Learning principle: Two-sample
tests ⇤
p(x ( 2 ) )
p (x
)q(x) = 1 p⇤(x) =
q(x)
Interest is not in estimating the marginal probabilities, only in how they are
related.
Shakir Mohamed
Estimation by
Comparison
Density Estimation
Two steps by Comparison
1. Use a hypothesis
H 0 : p⇤ = q✓ vs.
test or comparison to
p⇤ 6= q✓
obtain some model to
L (✓, ¢ )
tells how data from our
model differs from
observed data.
Density Density Ratio

2. Adjust model to ⇤
p
better match the data Difference r¢ = q
distribution using the r ¢ = p⇤ -
q✓
✓
comparison model Mixtures with
identical moments Bf
from step 1. [r ⇤kr ]
Max Mean Moment Bregman Class Probability

Discrepenc Matching Divergenc f-Divergence
Estimation
y e
f (u) = u log u - (u + 1) log(u +
1)
Shakir Mohamed
Adversarial Learning
p⇤(x) p(y = 1| p(y = + 1|x) = D ✓ p(y = - 1|x) = 1 - D ✓ Scoring Function
x)
q(x) = p(y = - 1|x) (x) (x)
F (x, ✓, ¢ ) = Ep⇤ ( x ) [log D ✓ (x)] + Eq¢ ( x ) [log(1 - Bernoulli Loss
D ✓ (x)]
Generative Adversarial
Networks
Alternating min max F (x, ✓,
optimisation ¢)
✓
Comparison loss ✓ / r ✓ Ep ⇤ ( x ) [l og D ✓ ( x )] + r ✓ Eq¢ ( x ) [l og( 1 - D ✓
Generative ¢ / - r ¢ Eq( z ) [log(1 - D ✓ (f ¢
(loss
x)]
(z))]
Instances of testing and z z ⇠ p(z)
inference: x ge n = f ¢ (z)
f(z)
• Unsupervised-as-supervised learning
• Classifier ABC x gen x obs
• Noise-contrastive estimation
• Adversarial learning and GANs
Density-ratio Reparameterisation
Shakir Mohamed
Method of Moments
f ¢ (x)
…
Moment estimator Tangent of posterior
hl (h l - 1 ; odds.
¢l )
r f ¢ l (x)
Moment Vector
¢
¢l
…
h 1 (x ; ¢ 1 )
r f ¢ 1 (x)
!
<
¢
¢1
LG LM • Consistent estimators: the number of

(✓) (¢) moments is greater than the number of
z model parameters.
Moment Network • Features should not be not co-linear.
Model g✓ f¢ (x) • More stable than adversarial training.
(z) • Does not require frequent updating of
the classifier.
x gen x r eal
Shakir Mohamed
Convergent
Approaches
Marginal KL Information
• Replace density ratios by classifiers, replace posteriors with implicit models,

• Views from optimal transport and connections to integral probability
metrics.
Shakir Mohamed
Environments and
Rewards
Environment as a generative
Action Prior
process:
p(a)
An unknown likelihood;
Not known analytically;
Only able to observe
its outcomes.
Prior over actions
Interaction only
Environmen
t or Model
Long-term reward p(R(s,a))
All the key inferential questions can now be asked in this simple
framework.
Shakir Mohamed
Planning-as-Inference
Simplest question
What is the posterior distribution over actions?
Maximising the probability of the return log
p(R).
Variational inference in the hierarchical model

Action Prior
p(a)
Recover policy search methods:

Uniform prior over distributions Environment
Continuous policy parameters or Model
Can evaluate environment, but not p(R(s,a))
differentiate.
Shakir Mohamed
Policy Search
Free Energy
Policy gradient using score-function gradient
Action Prior
Appearance of the entropy penalty is natur
p(a)
al and alternative priors easy to consider.
Can easily incorporate prior knowledge of the action
space.
Use any of the tools of probabilistic inference Environmen
available. Easily handle stochastic and deterministic t or Model
policies. p(R(s,a))
Identity Log-derivative
Shakir Mohamed
Hierarchical Planning
Prior
p(z|x) Action
Inference
q(a1..T|z)
log p(z|x)
Action Prior
p(a1..T|z)
z -Inference
q(z |x)
log p(a1..T|z)
Environment
or Model
p(R |a, x)
Data x
log p(R|a, x)
Variational
F (✓) = Eq( a,z | x ) [R(a|x)]MDP
⇡
- ↵K L [q✓ (z|x)kp(z|x)] + ↵H[⇡ ✓
(a|z)]
Shakir Mohamed
With a more realistic expansion as
i on graphical Exte model
Action Prior
rn a
DerivelEn Bellman’s
viron equation as p(a)
ment
a
different writing of message
Obse
passing. S e ns
rvat i
on/
at i o n
Application
Inte rnal of the EM Environmen
En viron t
algorithm
policy search for mbecomes
en Ag e n
t
Optio
n
KB possible.
c
Easily consider
State
t
other variational
Environment
or Model
Repr
.
methods, like
Criti
EP.
Both model-free and model-based p(R(s,a))
S t at e
Plmethods
an n e emerge.Embedd
r i ng
Shakir Mohamed
Final Words
K L [q(z|y)kp(z|y)] Approximation class
z
True posterior
f(z)
q¢ (z)
x
z z ~ q(z | x)
z
Model Inference
p(x |z) Network
q(z |x)
Generator
x r eal x gen
x ~ p(x | z)
Data x
D¢
Shakir Mohamed
Build Communities
Shakir Mohamed
shakirm.com/
feedback
Planting the Seeds

of Probabilistic
ThinkingAlgorithms
Shakir Mohamed
@shakir_za
Not exhaustive list, and many references to be
updated.
Probabilistic Thinking
Cheeseman, P.C., 1985, August. In Defense of Probability. In IJCAI (Vol. 2, pp. 1002-1009).
Murphy, K.P., 2012. Machine Learning: A Probabilistic Perspective.
Efron, B., 1982. Maximum likelihood and decision theory. The annals of Statistics, pp.340-356.
Stochastic Optimisation
P L’Ecuyer, Note: On the interchange of derivative and expectation for likelihood ratio derivative estimators, Management
Science, 1995
Peter W Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications of the ACM,
1990 Michael C Fu, Gradient estimation, Handbooks in operations research and management science, 2006
Ronald J Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine
learning, 1992
Paul Glasserman, Monte Carlo methods in financial engineering, , 2003
Luc Devroye, Random variate generation in one line of code, Proceedings of the 28th conference on Winter simulation,
1996
L. Devroye, Non-uniform random variate generation, , 1986
Omiros Papaspiliopoulos, Gareth O Roberts, Martin Skold, A general framework for the parametrization of hierarchical
models, Statistical Science, 2007
Michael C Fu, Gradient estimation, Handbooks in operations research and management science, 2006
Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "Black Box Variational Inference." In AISTATS, pp. 814-822. 2014.
Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." arXiv preprint
arXiv:1402.0030 (2014).
Lázaro-Gredilla, Miguel. "Doubly stochastic variational Bayes for non-conjugate inference." (2014).
Wingate, David, and Theophane Weber. "Automated variational inference in probabilistic programming." arXiv preprint
arXiv: 1301.1299 (2013).
Paisley, John, David Blei, and Michael Jordan. "Variational Bayesian inference with stochastic search." arXiv preprint
arXiv: 1206.6430 (2012).
Shakir Mohamed
• Applications of Deep Generative Models
• Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in
deep generative models.” ICML 2014
• Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes.” ICLR 2014
• Gregor, Karol, et al. "Towards Conceptual Compression." arXiv preprint arXiv:1604.08772 (2016).
• Eslami, S. M., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, K., & Hinton, G. E. (2016). Attend, Infer, Repeat: Fast
Scene Understanding with Generative Models. arXiv preprint arXiv:1603.08575.
• Oh, Junhyuk, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. "Action-conditional video prediction using
deep networks in atari games." In Advances in Neural Information Processing Systems, pp. 2863-2871. 2015.
• Rezende, Danilo Jimenez, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. "One-Shot Generalization in
Deep Generative Models." arXiv preprint arXiv:1603.05106 (2016).
• Rezende, Danilo Jimenez, S. M. Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.
"Unsupervised Learning of 3D Structure from Images." arXiv preprint arXiv:1607.00662 (2016).
• Kingma, Diederik P., Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. "Semi-supervised learning with deep
generative models." In Advances in Neural Information Processing Systems, pp. 3581-3589. 2014.
• Maaløe, Lars, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. "Auxiliary Deep Generative Models." arXiv
preprint arXiv:1602.05473 (2016).
• Odena, Augustus. "Semi-Supervised Learning with Generative Adversarial Networks." arXiv preprint arXiv:1606.01583 (2016).
• Springenberg, Jost Tobias. "Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks."
arXiv preprint arXiv:1511.06390 (2015).
• Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and
Demis Hassabis. "Model-Free Episodic Control." arXiv preprint arXiv:1606.04460 (2016).
• Higgins, Irina, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, and Alexander
Lerchner. "Early Visual Concept Learning with Unsupervised Deep Learning." arXiv preprint arXiv:1606.05579 (2016).
• Bellemare, Marc G., Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. "Unifying Count-
Based Exploration and Intrinsic Motivation." arXiv preprint arXiv:1606.01868 (2016).
• Alexander (Sasha) Vezhnevets, Mnih, Volodymyr, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, and Koray Kavukcuoglu.
"Strategic Attentive Writer for Learning Macro-Actions." arXiv preprint arXiv:1606.04695 (2016).
• Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. "DRAW: A recurrent neural network for image
generation." arXiv preprint arXiv:1502.04623 (2015).
Shakir Mohamed
• Fully-observed M odels
• Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759
• (2016). Larochelle, Hugo, and Iain Murray. "The Neural Autoregressive Distribution Estimator." In AISTATS, vol. 1, p. 2. 2011.
• Uria, Benigno, Iain Murray, and Hugo Larochelle. "A Deep and Tractable Density Estimator." In ICML, pp. 467-475. 2014.
• Veness, Joel, Kee Siong Ng, Marcus Hutter, and Michael Bowling. "Context tree switching." In 2012 Data Compression Conference, pp. 327-
336. IEEE, 2012.
• Rue, Havard, and Leonhard Held. Gaussian Markov random fields: theory and applications. CRC Press, 2005.
• Wainwright, Martin J., and Michael I. Jordan. "Graphical models, exponential families, and variational inference." Foundations and
Trends® in Machine Learning 1, no. 1-2 (2008): 1-305.
• Implicit Probabilistic M odels

• Tabak, E. G., and Cristina V. Turner. "A family of nonparametric density estimation algorithms." Communications on Pure and Applied
Mathematics 66, no. 2 (2013): 145-164.
• Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing flows." arXiv preprint arXiv:1505.05770 (2015).
• Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
"Generative adversarial nets." In Advances in Neural Information Processing Systems, pp. 2672-2680. 2014.
• Verrelst, Herman, Johan Suykens, Joos Vandewalle, and Bart De Moor. "Bayesian learning and the Fokker-Planck machine." In Proceedings
of the International Workshop on Advanced Black-box Techniques for Nonlinear Modeling, Leuven, Belgium, pp. 55-61. 1998.
• Devroye, Luc. "Random variate generation in one line of code." In Proceedings of the 28th conference on Winter simulation, pp. 265-272.
IEEE Computer Society, 1996.
• Ravuri, S., Mohamed, S., Rosca, M. and Vinyals, O., 2018. Learning Implicit Generative Models with the Method of Learned Moments. arXiv
preprint arXiv:1806.11006.
Shakir Mohamed
Latent variable models
Dayan, Peter, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. "The helmholtz machine." Neural computation 7,
no. 5 (1995): 889-904.
Hyvärinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis (Vol. 46). John Wiley & Sons.
Gregor, Karol, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. "Deep autoregressive networks." arXiv
preprint arXiv:1310.8499 (2013).
Ghahramani, Zoubin, and Thomas L. Griffi ths. "Infinite latent feature models and the Indian buffet process." In Advances in
neural information processing systems, pp. 475-482. 2005.
Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal, and David M. Blei. "Hierarchical dirichlet processes." Journal of the
american statistical association (2012).
Adams, Ryan Prescott, Hanna M. Wallach, and Zoubin Ghahramani. "Learning the Structure of Deep Sparse Graphical Models."
In AISTATS, pp. 1-8. 2010.
Lawrence, Neil D. "Gaussian process latent variable models for visualisation of high dimensional data." Advances in
neural information processing systems 16.3 (2004): 329-336.
Damianou, Andreas C., and Neil D. Lawrence. "Deep Gaussian Processes." In AISTATS, pp. 207-215. 2013.
Mattos, César Lincoln C., Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A. Barreto, and Neil D. Lawrence.
"Recurrent Gaussian Processes." arXiv preprint arXiv:1511.06644 (2015).
Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative
filtering." In Proceedings of the 24th international conference on Machine learning, pp. 791-798. ACM, 2007.
Saul, Lawrence K., Tommi Jaakkola, and Michael I. Jordan. "Mean field theory for sigmoid belief networks." Journal of
artificial intelligence research 4, no. 1 (1996): 61-76.
Frey, Brendan J., and Geoffrey E. Hinton. "Variational learning in nonlinear Gaussian belief networks." Neural Computation 11,
no. 1 (1999): 193-213.
Shakir Mohamed
Inference and Learning
Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. "An introduction to variational methods
for graphical models." Machine learning 37, no. 2 (1999): 183-233.
Hoffman, Matthew D., David M. Blei, Chong Wang, and John William Paisley. "Stochastic variational inference." Journal of
Machine Learning Research 14, no. 1 (2013): 1303-1347.
Honkela, Antti, and Harri Valpola. "Variational learning and bits-back coding: an information-theoretic view to Bayesian
learning." IEEE Transactions on Neural Networks 15, no. 4 (2004): 800-810.
Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint
arXiv:1509.00519 (2015).
Li, Yingzhen, and Richard E. Turner. "Variational Inference with R\ 'enyi Divergence." arXiv preprint arXiv:1602.02311
(2016). Borgwardt, Karsten M., and Zoubin Ghahramani. "Bayesian two-sample tests." arXiv preprint arXiv:0906.4032
(2009). Gutmann, Michael, and Aapo Hyvärinen. "Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models." AISTATS. Vol. 1. No. 2. 2010.
Tsuboi, Yuta, Hisashi Kashima, Shohei Hido, Steffen Bickel, and Masashi Sugiyama. "Direct Density Ratio Estimation for
Large- scale Covariate Shift Adaptation." Information and Media Technologies 4, no. 2 (2009): 529-546.
Sugiyama, Masashi, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge
University Press, 2012.
Amortised Inference
Gershman, Samuel J., and Noah D. Goodman. "Amortized inference in probabilistic reasoning." In Proceedings of the 36th
Annual Conference of the Cognitive Science Society. 2014.
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in
deep generative models." arXiv preprint arXiv:1401.4082 (2014).
Heess, Nicolas, Daniel Tarlow, and John Winn. "Learning to pass expectation propagation messages." In Advances in
Neural Information Processing Systems, pp. 3219-3227. 2013.
Jitkrittum, Wittawat, Arthur Gretton, Nicolas Heess, S. M. Eslami, Balaji Lakshminarayanan, Dino Sejdinovic, and Zoltán
Szabó. "Kernel-based just-in-time learning for passing expectation propagation messages." arXiv preprint arXiv:1503.02551
(2015). Korattikara, Anoop, Vivek Rathod, Kevin Murphy, and Max Welling. "Bayesian dark knowledge." arXiv preprint
arXiv:1506.04416 (2015).
Shakir Mohamed

MLSS2018 Madrid ProbThinking

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLSS2018 Madrid ProbThinking

Uploaded by

Copyright:

Available Formats

Planting the Seeds

Planning Explanation Rapid Learning World Objects and

Uncertainty Information Gain Causality Prediction

Probability Bayesian Hypothesis Estimation Asymptotics

3. Use probabilistic thinking applied to

Probability is sufficient for the task of

Probability is a measure of the belie

No such thing as Inherently Different observers

prob(peak hour | Traffic Jam)

Probabilistic models let you learn

Artificial General Intelligence

The core questions of

refer to as an activation function. Binary Gumbel

Optimise the negative log-likelihood Categorical Multinomial

L = - log p(y|g(⌘); ✓) Counts

Sparse Tobit max

A general, flexible framework for building ⌘1 = Bx 1

• Straightforward and natural way to learn

Maximum a Posteriori • Generalises the MLE (uniform prior)

✦ Regularisation is essential to overcome the limitations

A wide range of available regularisation techniques:

Uncertainty Parameterisation sensitive

Mode of the prior

Clear sensitivity: Sensitive to units, affects interpretability,

• Use the Fisher information

Motivates learning more than the mean.

Pragmatic Bayesian Approach for

Bayesian reasoning over some, but not

• In Bayesian analysis, things that are not.

- Mainly conjugate and linear models

Natural to consider the marriage of these approaches: Bayesian Deep

• Ways ofmodel Posterior

Move from primal variables to dual variables

Kernel trick and

Probability distributions over functions

Parametric, Non-parametric Latent

Laplace Maximum Two Sample Method of

A given model and learning principle can be implemented in many ways.

Convolutional neural network Implicit Generative Model

1. ● VEM algorithm 2. Learning

2. Build connections between

Evidence Moment Parameter

Do this by introducing a probabilistic

Zero mean unit var

Triangular Jacobians allow

r ¢ Eq¢ ( z ) [f ✓ (z)] = r q¢ f✓ dz • Gradient is of the

(z) (z) of the distribution w.r.t.

1. Pathwise estimator: Differentiate the function f(z)

Typical problem areas

For concave functions f(.) f(x)

Logarithms are strictly concave allowing us to use Jensen’s

Lower Eq( z ) [log p(x|z)] -

p⇤(x p(y = 1|x)

Density Ratio Trick Elsewhere

Bayes’ = p(y = + 1|x)p(x) p(y = - 1|

Hypothesis Testing Experimental Design

B = log p(x|H 1 ) - log p(x|

2. Understand generative algorithms

3. Build awareness of the breadth of

Recognise objects in the Detect surprising

Establish concepts as Anticipat e

Models that allow for

Fully-observed conditional generative

x 1 ⇠ Cat(x 1 |⇡) xt xt+1 xt+2 xt+3 …