Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 118

Planting the Seeds

of Probabilistic
ThinkingAlgorithms
Foundations | Tricks |

Shakir Mohamed
Research Scientist, DeepMind

@shakir_za
Planting the Seeds of Probabilistic
Thinking: Foundations, Tricks and
Probabilistic Algorithms
machine learning approaches task of describing of data, to
complex systems or our world using the language and tools of probability. Almost all
of machine learning can be viewed in probabilistic terms, making probabilistic thinking
fundamental. It is, of course, not the only view. But it is through this view that we can
connect what we do in machine learning to every other computational science,
whether that be in stochastic optimisation, control theory, operations research,
econometrics, information theory, statistical physics or bio-statistics. For this reason
alone, mastery of probabilistic thinking is essential.

The aim of this tutorial is to develop flexible and broad tools that will support
your probabilistic thinking. Part 1, Foundations looks at the philosophy of machine
learning, builds an understanding of the model-inference-algorithm paradigm, and
the explores fundamental areas of machine learning - we’ll look at deep
learning, kernels and reinforcement learning. Part 2 Tricks, will look at 6 individual
probabilistic problems and a tricks to solve them, using these tricks to develop
flexibility in our thinking. Part 3: Algorithms will look at how the foundations and
tricks combine to develop machine learning algorithms, with a specific focus on the
area of deep generative models.
Shakir Mohamed
Principles to
Products
Applications
Assistive
Technology
Advancing
Science
Climate and
Energy
Healthcare Fairness and
Safety
Autonomous
systems

Planning Explanation Rapid Learning World Objects and


Reasoning Simulation Relations

Uncertainty Information Gain Causality Prediction


Information

Probability Bayesian Hypothesis Estimation Asymptotics


Principles Theory Analysis Testing Theory

Shakir Mohamed
Planting the Seeds
of Probabilistic
Thinking
Part I:
Foundations
Learning Objectives
1. Language to think about the
Philosophy of Machine
Learning

2. Understand the
Model-Inference-Algorithm paradigm

3. Use probabilistic thinking applied to


problems in supervised, unsupervised,
and reinforcement learning.
Probability
Some Definitions for
probability
Logical Probability
Statistical Degree of confirmation of
Probability a hypothesis based on
Frequency ratio of items logical analysis

Probability as
Propensity Subjective
Probability used Probability
for predictions Probability a
s a degree of belief

Probability is sufficient for the task of


reasoning under uncertainty
Probability
Probability as a Degree of
Belief

Probability is a measure of the belie


f in a proposition given evidence.
A description of a state of knowledge.

No such thing as Inherently Different observers


the probability subjective in that with different
of an event, since the it depends on the informatio
value depends on the believer’s n will have different
evidence used. information beliefs.
Probabilistic
Quantities
Probability

Condition
s

Bayes Rule

Parameterisatio

n Expectation

Gradient
Statistical
Operations
Estimation Hypothesis
and Inference Testing
Learning

Summarisation Comparison

Data Experimental
Modelling Enumeration Design
Probabilistic Models
Model: Description of the A probabilistic model writes
world, of data, of potential out these models using the
scenarios, of processes. language of probability

Peak Bad
prob(traffic Jam)
Accident
hour Weathe
r prob(sirens | Accident)

prob(peak hour | Traffic Jam)


Traffi
Sirens
c
Jam Most models in machine
learning are
probabilistic.

Probabilistic models let you learn


probability distributions of data. You can choose what to learn: Just
the mean. Or the entire
distribution.
Centrality of Inference
Inference

Artificial General Intelligence


Summarisation Comparison will be the refined instantiation
of these statistical operations.
Data
Enumeration

The core questions of


AGI will be those of
probabilistic
inference
Linear Regression
Generalised Linear E[y]
Regression
g()
⌘ = w> x + b
p(y|x) = p(y|g(⌘); ✓) ⌘=
• The basic function can be any linear Bx
function, e.g., affine, convolution. Target Regression Link Inv link Activation
Real Linear Identity Identity
µ


Binary Logistic Logit log Sigmoid Sigmoid
g(.) is an inverse link function that we’ll
1- µ
1
1+ exp( - ⌘)
Binary Probit Inv Gauss Gauss CDF Probit

refer to as an activation function. Binary Gumbel


CDF <D- 1 (µ)
Compl.
<D(⌘)
Gumbel
- e- x
CDF
e
log-log
lo g(-
Binary Logistic lo g(µ)) Hyperbolic Tanh
Tangent

Optimise the negative log-likelihood Categorical Multinomial


tanh(⌘)
Multin. Softmax
P ⌘i
Logit

L = - log p(y|g(⌘); ✓) Counts


Counts
Non-neg.
Poisson
Poisson
Gamma
lo g(µ)
p
(µ)
Reciprocal 1
µ exp(
j

⌘j

Sparse Tobit max


⌫) max(0; ReLU
2
Ordered Ordinal ⌫
Cum. ⌫)
Logit
1
cr(cpk - ⌘)

Deep Networks
Recursive Generalised Linear
Regression E[y]

g()

⌘l = Bx l
• Recursively compose the basic linear functions. …
• Gives a deep neural network.
E[y] = hL o . . . o hl o h0 (x) g()

A general, flexible framework for building ⌘1 = Bx 1


non-linear, parametric models
Likelihood
Probabilistic Likelihood
Model function

Likelihood of parameters

Efficient Estimators
Tests with Good
• Statistically efficient (Cramer-Rao lower Power
Prescribed Likelihoods

bound)
• Asymptotically unbiased, consistent • Likelihood ratio tests
• Maximum entropy (principle of indifference) • Can construct small
confidence regions

Pool Information
Widely-applicable
• Combine different data sources
• Handle data that is incompletely • Knowledge outside the data can
observed, distorted, samples with be used, like constraints on domain
bias or prior probabilities.
• Can offset or correct these issues.
Misspecification: Inefficient estimates; or confidence intervals/tests can fail
completely.
Estimation Theory
E[y]
Probabilistic
Model
g()

Likelihood
function ⌘l = Bx l

Maximum Likelihood …
Optimisation
Objective g()

• Straightforward and natural way to learn


parameters
• Can be biased in finite sample size, e.g., Gaussian ⌘1 = Bx 1
variances with N and N-1.
• Easy to observe overfitting of parameters.
Estimation Theory
Probabilistic
Model

Likelihood
function

Maximum a Posteriori • Generalises the MLE (uniform prior)


(MAP) • Shrinkage: shrink parameters back
to initial beliefs.
Optimisation
Objective
• Not every regulariser corresponds a
valid probability distribution.
Regularisation

✦ Regularisation is essential to overcome the limitations


of maximum likelihood estimation.
✦ Other names: Regularisation, penalised regression,
shrinkage.

A wide range of available regularisation techniques:


• Large data sets
• Input noise/jittering and data
augmentation/expansion.
• L2 /L1 regularisation (Weight decay, Gaussian prior)
• Binary or Gaussian Dropout
MAP Estimation
Type of Solution
What is maximum is not
necessarily typical

Uncertainty Parameterisation sensitive


Can be reported using confidence Location of max will change
intervals or bootstrap estimates. depending on parameterisation
Invariant MAP
Popular
Example
Change of Bernoulli
variables
Uniform

Mode of the prior

Parameterisation Parameterisation
1 2
Transform

New prior

MAP Est.

Clear sensitivity: Sensitive to units, affects interpretability,


affects gradients, learning stability, design of models.
Invariant MAP
Use a modified probabilistic model that removes
sensitivity

Invariant MAP

• Use the Fisher information


• Connection to the natural gradients and trust-region
optimisation.
• Uninformative priors.

Proposed solutions have not fully dealt with the underlying issues.
Bayesian Analysis
Issues arise as a consequence
• Reasoning only about the most likely solution, and
of:
• Not maintaining knowledge of the underlying variability (and averaging
over this).

Motivates learning more than the mean.


This is the core of a Bayesian
philosophy.

Pragmatic Bayesian Approach for


Probabilistic Reasoning in Deep
Networks. (and all of machine learning)

Bayesian reasoning over some, but not


all parts of our models (yet).
Bayesian Analysis
Interested in reasoning about two important
quantities

Evidenc

Posterio

• In Bayesian analysis, things that are not.


observed must be integrated over - averaged
out.
• This makes computation difficult.
• Integration is the central operation.
Learning and
Inference Machine learning makes a distinction bet ween
Statistics inference and learning:
, no distinction between
learning and inference - • Inference: reason about (and compute)
only inference (or unknown probability distributions.
estimation). • (Parameter) Learning is finding point
estimates of quantities in the model.

Bayesian
statistics, all
quantities are Software Decision making
probability distributions, so engineering, inference and AI, refer to learning
there is only the problem is the forward evaluation of in general as the means of
of inference. a understanding and acting
trained model (to get based on past
predictions). experience
(data).
Two Streams of ML
Deep Learning Bayesian
Reasoning

- Mainly conjugate and linear models


+ Rich non-linear models for
classification and sequence - Potentially intractable inference,
prediction. computationally expensive or long
+ Scalable learning using stochastic
simulation time.
approximation and conceptually
simple. + Unified framework for model building,
+ Easily composable with other
inference, prediction and decision making
gradient- based methods
+ Explicit accounting for uncertaint
- Only point
y and variability of outcomes
estimates
- Hard to score models, do selectio
+ Robust to overfitting; tools for model
n and complexity penalisation.
selection and composition.

Natural to consider the marriage of these approaches: Bayesian Deep


Learning
Bayesian Regression
Probabilistic models over
functions

Prio
r

Observation

• Ways ofmodel Posterior


learning distributions over
functions and maintaining uncertainty
over functions.
• Difficult in parametric models (like deep y
networks) because of high-dimensional
parameter space.
• Many ways to learn the posterior
distribution. Focus of Part III
Density Estimation
Learn probability distributions over the data
itself
μ Σ • Can learn distributions of some things
and point estimates of others.
• Deep Generative Models and
z Unsupervised learning - more in Part
III
W

y
n = 1, …, N

Factor Analysis /
PCA

z ⇠ N (z|µ, ⌃)
y ⇠ N (y|W z, y
0"2 I )
Decision-making
Probabilistic models of environments and
actions

Prior over

actions

Interaction only

Reward/Utility
External Environment
Observation/
Action Sensation Setup is commo
Environment n in experimental
design, causal learning,
reinforcement learning.
Decision-maker
Probabilistic Dualities
Basis Function Regression

Move from primal variables to dual variables

Kernel trick and


methods

Probability distributions over functions

Gaussian
processes
Deep and Hierarchical
Hierarchical Model: models where the (prior) probability distributions c
an be decomposed into a sequence of conditional distributions
Foundations
How will you approach your ML research and
practice?
In general: For the ML Core:
Human-centred, Probabilistic and pragmatic in
interdisciplinary approach approach

Sociologica
l
Psychological Architecture-Loss

Componential

Physiological
Sun’s Phenomenological
Levels
Model-Inference-Algorithm
Architecture-Loss
O: So# max

P²: Plus

T² :Time B²:
s Weight

W ²: Weight S¹:
Sigmoid
P¹:
Plus
T¹:Time B¹:Weigh
s t

W ¹: X :Input
Weight

1. Computational 2. Error
Graphs propagation
Model-Inference-Algorithm

3.
Algorithms

1. 2. Learning
Models Principles

Shakir Mohamed
Models
Directed and Undirected Fully-
observed

Parametric, Non-parametric Latent


And semi-parametric Variable
μ, Σ

z1 z2 … zD

Model y1 y2 … yD
s n = 1, …, N

Shakir Mohamed
Learning Principles
Statistical
Inference

Direct Indirect

Laplace Maximum Two Sample Method of


approximatio Likelihood Comparison Moments
n
Ma Variational Approx Bayesian Transpo! ation
ximum a Inference Computation methods
posteriori
Integr. Nested Max Mean
Cavity Methods
Laplace Discrepency
Approx
Expectation Markov chain
Maximisatio Monte Carlo
n
Noise Sequential
Contrastiv Monte
Learning
e Carlo
Principle
Shakir Mohamed
s
Algorithms

A given model and learning principle can be implemented in many ways.

Convolutional neural network Implicit Generative Model


+ penalised maximum + Two-sample testing
likelihood
● Optimisation methods ● Unsupervised-as-supervised learning
(SGD, Adagrad) ● Approximate Bayesian Computation
● Regularisation (L (ABC)
1, L2, batchnorm, ● Generative adversarial network
dropout) (GAN)
Latent variable model Restricted Boltzmann Machine
+ variational inference + maximum likelihood
● VEM algorithm ● Contrastive
● Expectation ● Divergence Persistent
● propagation ● CD
● Approximate message passing ● Parallel
Variational auto-encoders Tempering
(VAE) Natural gradients
Shakir Mohamed
Final Words
Subjective Probability Deep Learning, Estimation theory,
Probability as a degree of belief hierarchical models, dualities

Probabilistic Likelihood
Probabilistic descriptions Model function
of systems and data
E[y]
Peak Bad
Accident
hour Weather

g()

Traffi
c
Sirens ⌘l =
Jam
Bx l

Model-Inference-Algorithm
g()

⌘ 1 = Bx1

Shakir Mohamed
Planting the Seeds
of Probabilistic
ThinkingAlgorithms
Foundations | Tricks |

Shakir Mohamed
Research Scientist, DeepMind

@shakir_za
Last Time …

3.
Algorithms
Convolutional neural network
+ penalised maximum likelihood
● Optimisation methods
(SGD, Adagrad)
● Regularisation (L
1, L2, batchnorm,
dropout)
Latent variable model
+ variational inference

1. ● VEM algorithm 2. Learning


● Expectation
Models ● propagation Principles
● Approximate message passing
Variational auto-encoders
Shakir Mohamed (VAE)
Planting the Seeds
of Probabilistic
Thinking
Part II:
Tricks
Learning Objectives

1. Develop tools to
manipulate distributions by
studyin g 6 probability
questions.

2. Build connections between


concepts in machine learnin
g and those in other computational
sciences.

Shakir Mohamed
Inferential Questions
Probabilistic dexterity is needed to solve the fundamental
problems of machine learning and artificial intelligence.

Evidence Moment Parameter


Estimatio Computatio Estimation
nZ n Z
p(x) = p(x, E[f (z)|x] = f (z)p(z|
z)dz x)dz
Predictio Plannin
n g

Hypothesis Experimental
Testing Design
B = log p(x|H 1 ) - log p(x|
HMohamed
Shakir 2)
Identity Trick
Transform an expectation w.r.t. distribution
p, into an expectation w.r.t. distribution q.

Do this by introducing a probabilistic


one
Shakir Mohamed
Identity Trick
Z
Integral problem p(x) = p(x|
Z z)p(z)dz
q(z)
Probabilistic p(x) = p(x|z)p(z)
one q(z)
Z dz
p(z)
p(z) q(z) f Re-group/re-weight p(x) = p(x|z) q(z)d
(z)
q(z)
z

Conditions
• q(z)>0, when p(x|
z)p(z) ≠ 0.
• q(z) is known/easy
toMohamed
Shakir handle.
Importance Sampling

1 X
Monte Carlo p(x) = w ( s) p(x|z( s) )
Estimator S s

( s) p(z) z( s) ⇠
w = q(z)
q(z)

p(z) q(z) f
(z)
Identity Trick Elsewhere
• Manipulate stochastic
gradients
z • Derive probability bounds
• RL for policy corrections
Shakir Mohamed
Hutchinson’s Trick
Compute the trace of a matrix:
• KL bet ween two Gaussians.
• Gradient of a log-determinant.
Trace problem

Zero mean unit var

Identity

Trick Linear

operations Trace

property
Shakir Mohamed
Probability Flow
Tricks
Distribution and sample

Transformatio

n Unconscious

Statistician

Makes entropy
Change of
computatio
Variables n Compute
and backpropagation
easy.
Begin with a diagonal
Gaussian and improve by
change of variables.

Triangular Jacobians allow


for computational Linear time computation of the determinant and its gradient.
efficiency.
Shakir Mohamed
Normalising Flows

Shakir Mohamed
Stochastic
Optimisation
Common gradient problem
Z • Don’t know this expectation
in general.

r ¢ Eq¢ ( z ) [f ✓ (z)] = r q¢ f✓ dz • Gradient is of the


parameters

(z) (z) of the distribution w.r.t.


which
the expectation is taken.

1. Pathwise estimator: Differentiate the function f(z)


2. Score-function estimator: Differentiate the density q(z|
x)

Typical problem areas


• Sensitivity analysis
• Generative models and inference
• Reinforcement learning and control
• Operations research and inventory control
• Monte Carlo simulation
• Finance and asset pricing
Shakir Mohamed
Reparameterisation
Trickss aDistributions can be expressed a
transformations of other
distributions.
z⇠ z⇠
p(z) qcp(z) ✏ ⇠
z = g(✏,
μ ¢) p(✏)

r

x = µ+
R Rz
Samplers, one-liners and change-of-
variables
d✏
p(z) = p(✏) =) |p(z)dz| = |
dz
p(✏)d✏|
Shakir Mohamed
Pathwise Estimator
(Non-rigorous) Z
Derivation
r ¢ Eq( z ) [f (z)] = r ¢ q¢ (z)f Known transformation
(z)dz
z ⇠ p(z)
Change of variables
μ

r
= r ¢ Ep( ✏) [f (g(¢, ✏)] = Ep( ✏) [r ¢ f Inv fn Thm

(g(¢, ✏)]
x= µ+
Rz
= Ep( ✏) [r f ✓ (g(✏,
R
r ¢ Eq¢ ( z) [f ✓ (z)] ¢

¢ ))]
Other names When to use
• Unconscious statistician • Function f is differentiable
• Stochastic backpropagation • Density q is known with a suitable transform o
• Perturbation analysis f a simpler base distribution: inverse CDF, location-
• Reparameterisation trick scale transform, or other co-ordinate transform.
• Affine-independent inference • Easy to sample from base distribution.

Shakir Mohamed
Log-derivative Trick
Score function is the derivative of a log-likelihood
function.

q¢ (z)
r
r ¢ log q¢ (z) = q¢
¢
(z)
Several useful
properties
Expected Eq( z ) [r ¢ log q¢ (z)] = ↞Show
score 0 this

Fisher
Information
Shakir Mohamed
Score-function
Estimator
Z
r E [f (z)] = r q
¢ f
q¢ ( z ) ✓ ¢ ✓ Leibnitz integral
(z) (z)dz rule
Z

= r ¢ q¢ (z)f Identity
(z)

Z (z)dz
(z)
= q¢ (z)r ¢ log q¢ (z)f Log-deriv
(z)dz
= Eq¢ ( z ) [f (z)r ¢ log q¢ Gradient
(z)]
Contro
r ¢ E q¢ ( z) [f ✓ (z)] = E q¢ ( z) [(f (z) - c)r ¢ log q¢ l
Variate
(z)]
Other names When to use
• Likelihood ratio method • Function is not differentiable, not
• REINFORCE and policy gradients analytical.
• Automated & Black-box • Distribution q is easy to sample from.
inference • Density q is known and differentiable.
Shakir Mohamed
Bounding Tricks
An important result from convex analysis lets us
move expectations through a function:

For concave functions f(.) f(x)


f (E[x]) 2: E[f (x)]

Logarithms are strictly concave allowing us to use Jensen’s


inequality.
log Z p(x)g(x)dx 2 p(x) log Z
g(x)dx
Other Bounding Tricks
Bounding Trick Elsewhere
• Fenchel duality
Optimisation; Variational Inference; • Holder’s inequality
Rao- Blackwell Theorem; • Monge-Kantorovich
Inequality
Shakir Mohamed
Evidence Bounds Z
Integral problem p(x) = p(x|
z)p(z)dz
Z
q(z)
Proposa p(x) = p(x|z)p(z)
l q(z)
Z dz
p(z)
Importance Weight p(x) = p(x|z)
q(z)
Z q(z)dz ✓
p(z)
Jensen’s log p(x) q(z) p(x|z) ◆
Z Z
inequality 2: log q( z)
log p(x)g(x)dx 2 p(x) log
Z Z
g(x)dx dz q(z)
= q(z) log p(x|z) - q(z) log
p(z)

Lower Eq( z ) [log p(x|z)] -


bound
K L [q(z)kp(z)]
Shakir Mohamed
Density Ratio Trick
The ratio of two densities can be computed using a classifier of
using samples drawn from the two distributions.

p⇤(x p(y = 1|x)


=
) p(y = - 1|
q(x) x)

Density Ratio Trick Elsewhere


• Generative Adversarial Net works (GANs)
• Noise contrastive estimation, Classifier-
ABC
• Two-sample testing
• Covariate-shift, calibration

Shakir Mohamed
Density Ratio
Estimation
Assign { y1 , . . . , yN } = { +1, . . . , +1, - 1, . . . , -
labels 1}⇤
Equivalence p (x) = p(x|y = q(x) = p(x|y = -
1) 1)

p⇤(x p(y|
Density )q(x) Bayes’ p(x|y) =
Ratio x)p(x)
Rule
p(y)
p⇤(x p(x|y = 1)
Conditional =
)q(x) p(x|y = -
1 ) /
p(y = + 1|x)

Bayes’ = p(y = + 1|x)p(x) p(y = - 1|


p(y =
x)p(x) p(y = -
Subst.
+ 1) 1) p(y = - 1|x)
p⇤(x = p(y = 1|
Class )q(x) x)
p(y = - 1|
probability x)
Computing a density ratio is equivalent to class probability
estimation.
Shakir Mohamed
Final Words
Strengthen your probabilistic
dexterity.

Identity

Hutchinson’

s Flows
q¢ (z)
Log-derivative r
r
log q¢ (z) = q¢
¢
¢
(z)
Reparameterisatio z = g(✏, ✏ ⇠
n ¢) p(✏)
p⇤(x = p(y = 1|x)
Density )q(x) p(y = - 1|
Ratio x)
Shakir Mohamed
Planting the Seeds
of Probabilistic
ThinkingAlgorithms
Foundations | Tricks |

Shakir Mohamed
Research Scientist, DeepMind

@shakir_za
Last Time …
Subjective
Probability
Probability a
s a degree of belief

E[y]
Model-Inference-Algorithm g()

⌘l =
Bx l


g()

⌘ 1 = Bx1

Shakir Mohamed
Manipulating
Integrals
Evidence Moment Parameter
Estimation Computation Estimation
Z Z
p(x) = p(x, E[f (z)|x] = f (z)p(z|
z)dz x)dz

Prediction Planning

Hypothesis Testing Experimental Design

B = log p(x|H 1 ) - log p(x|


H 2)
Shakir Mohamed
Your Tricks
Strengthen your probabilistic
dexterity.

Identity

Hutchinson’

s Flows
q¢ (z)
Log-derivative r
r
log q¢ (z) = q¢
¢
¢
(z)
Reparameterisatio z = g(✏, ✏ ⇠
n ¢) p(✏)
p⇤(x = p(y = 1|x)
Density )q(x) p(y = - 1|
Ratio x)
Shakir Mohamed
Planting the Seeds
of Probabilistic
Thinking
Part III:
Algorithms
Learning Objectives
1. Have knowledge of different
types of probabilistic models
for unsupervised learning.

2. Understand generative algorithms


(pixelCNN, VAEs, GANs) within the
model-inference-algorithm
framework.

3. Build awareness of the breadth of


applications of generative models.
Beyond
Classification
Move beyond
associating inputs
Understand and simulate
how the world evolves
to outputs

Recognise objects in the Detect surprising


world and their factors events in the world
of variation

Establish concepts as Anticipat e


useful for reasonin and generate rich
g and decision making plans for the future
Generative Models
A model that allows us to
learn a simulator of data

Models that allow for


(conditional) density
estimation

Approaches for
unsupervised learning of
data
Characteristics are:
- Probabilistic models of data that
allow for uncertainty to be
- captured. High-dimensional data.
-Data distribution is targeted.
Applications

Super-resolution, Planning,
Compression, Product AI
Exploration
Text-to-speech s Intrinsic
motivation
Science Model-based
RL

Proteomics,
Drug
Discovery,
Astronomy,
High-energy
physics
Assistive
Technologies

Fully-observed conditional generative


model
Compression-Communication

Original

JPEG

JPEG-2000

VAE1

Compression rate:
VAE2 0.2bits/dimension
Generative Design
Video from work of Memo
Aktem
Advancing Science
Advancing Healthcare
Types of Generative
Models
Fully-observed Undirected
models Models

Sum-Product
z Networks
Latent
variable f(z)
models
x
Types of Generative
Models
Design Dimensions
Data: binary, real-valued, nominal, strings,
images.
Dependency: independent, sequential,
temporal, spatial.
Representation: continuous or discrete
Dimension: parametric or non-parametric
Computational
complexity Modelling
capacity
Bias, uncertainty,
calibration Interpretability
Fully-observed
Models xi
Fully-observed Model Parameters are
global variables.
models
xk f(x)
Stochastic activations
Model observed data directly & unobserved
xj random variables are
without introducing any new local variables.
unobserved local variables.
Markov Models

x 1 ⇠ Cat(x 1 |⇡) xt xt+1 xt+2 xt+3 …


x 2 ⇠ Cat(x 2 |⇡(x 1 ))

x i ⇠ Cat(x i |⇡(x < n ))


Hartebeest Pixel
Y CNN
p(x) = p(x i |f (x < i ;
✓)) White Whale
All conditionali probabilities
described by deep
networks.
Properties

+ Can directly encode how observed points are


+ related.
+ Any data type can be used
For directed graphical
+ Parameter learningmodels:
simple: Log-likelihood is directly computable,
no approximation needed.
+ Easy to scale-up to large models, many optimisation tools
- available.
- ForOrder sensitive.
undirected models,
- Parameter learning difficult: Need to compute normalising
constants.
- Generation can be slow: iterate through elements sequentially,
or using a Markov chain.
Directed

NADE, EoNADE
Normal Means
Fully-visible sigmoid
Continuous
belief networks
Markov Models
Pixel CNN/RNN
N-AR(p)
RNN Language mod.
RNADE
Context tree switching
Discrete Continuous

xt xt+3 …
xt+1
Boltzmann Machines
Discrete Markov
x t+2
Gaussian MRFs
Random Fields Ising, Log-linear models
Hopfield
and Potts Models

Undirected
Latent Variable
Models
z z Latent variable
z
f(z) models
f(z) Introduce an unobserved f(z)
x local random variables that
represents hidden causes.
x
x
Prescribed models Implicit models
Use observer likelihood Likelihood-free or
s and assume observation simulation-based
noise. models.
Diggle and Gratton (1984); Mohamed and Lakshminarayanan (2016)
Deep Latent Gaussian Model Prescribed Models
z3
z3 ⇠ N (0, I )
z2 |z3 ⇠ N (µ(z3 ),
z2 ⌃(z3 ))
z1 |z2 ⇠ N (µ(z2 ),
z3 ⌃(z2 ))
x|z1 ⇠ N (µ(z1 ),
Convolution ⌃(z1 ))
al DRAW
x
Properties

+ Easy sampling.
+ Easy way to include hierarchy and depth.
+ Easy to encode structure believed to generate the data
+ Avoids order dependency assumptions: marginalisation of latent
variables induces dependencies.
+ Latents provide compression and representation the data.
+ Scoring, model comparison and selection possible using the
marginalised likelihood.
- Inversion process to determine latents corresponding to a input
is difficult in general
- Difficult to compute marginalised likelihood requiring
approximations.
- Not easy to specify rich approximations for latent posterior
distribution.
Cascaded Indian Sigmoid Belief Net
Buffet process Deep auto-regressive
Hierarchical Dirichlet networks (DARN)
process

Deep Nonparametric Discrete Deep Parametric Discrete

Non-parametric
Indian buffet process Deep Gaussian
Dirichlet process Deep processes
mixture Discrete Recurrent Gaussian
Process
GP State space
Direct Nonparametric Discrete
model
Deep Nonparametic
Nonlinear factor
Continuous
Hidden Markov Model analysis
Discrete LVM Nonlinear Gaussian
Sparse LVMs Continuous belief network
Direct/
Deep Latent Gaussian
Linear
(VAE, DRAW)
Linear Parametric Discrete Parametric Deep Parametric Continuous

PCA, factor analysis Gaussian process


Independent LVM
components analysis
Gaussian LDS
Latent Gauss Field
Linear Parametric Continuous Direct Nonparametric Continuous
Implicit Models
Change of variables for invertible
functions
z⇠
- 1 z Implicit models
p(z)
p(x) = p(z) det@z Transform an unobserved
μ @f f(z) noise source usin
g a parameterised
x function.
r

x = µ + Rz
R
Generator
Networks

z ⇠ N (0, I )
x = f (z; ✓)
The transformation function is parameterised by a linear or
deep net work (fully-connected, convolutional or recurrent).
Properties
+ Easy sampling, and natural to specify.
+ Easy to compute expectations without knowing final distribution.
+ Can exploit with large-scale classifiers and convolutional
- net works. Difficult to satisfy constraints: Difficult to maintain
invertibility, and challenging optimisation.
- Lack of noise model (likelihood):
- Difficult to extend to generic data types
- Difficult to account for noise in observed
- data.

Hard to compute marginalised likelihood for model scoring,


comparison and selection.
Bedrooms

Convolutional
generative
adversarial network
Stochastic One-liners and

z
Differential Equations inverse sampling
Hamiltonian and Distrib. warping
Langevin SDE Normalising flows
Diffusion Models GAN generator nets
Non- and volume Non- and volume
preserving flows preserving transforms

Diffusions
Continuous Functions
(z) Discrete time

ftime
Shakir
Model-Inference-Algorithm
z z ~ q(z | x)
z
Model Inference
p(x |z) Network
q(z |x) Generator

x r eal x gen

x ~ p(x | z)
Data x

Prescribed latent Implicit latent variable


variable model models and
s and variational estimation- by-
inference comparison

Variational Autoencoders Generative Adversarial


(VAEs) Networks (GANs)
Shakir Mohamed
Model Evidence
Model evidence (or marginal likelihood, partition function):
Integrating out any global and local variables enables z
model scoring, comparison, selection, moment estimation,
f(z)
normalisation, posterior computation and prediction.

We take steps to improve the model x


evidence for given data samples.

Learning principle: Model Evidence


Z
p(z) q(z) f
(z)

p(x) = p(x,
z
z)dz
Integral is intractable in gener Basic idea:
al and requires Transform the integral int o
approximation. an expectation over a simple,
known distribution.
Shakir Mohamed
Variational Inference
Z
p(x) = p(x,
z)dz

F (x, q) = Eq( z ) [log p(x|z)] -


K L [q(z)kp(z)]
This bound is exactly of the form we are looking for.
• Variational free energy: We obtain a functional and are free to
choose the distribution q(z) that best matches the true posterior.
• Evidence lower bound (ELBO): principled bound on the
marginal likelihood, or model evidence.
• Certain choices of q(z) makes this quantity easier to
compute. Examples to come.

Identity Bounding

Shakir Mohamed
Variational Methods
Variational Principle
General family of methods for approximating
complicated densities by a simpler class of
densities.

K L [q(z|y)kp(z| Approximation class


y)]

True posterior


(z)
Determi
nistic approximation procedures
with
bounds
Shakir on probabilities
Mohamed of interest.
Variational Bound
Interpreting the bound:

F (x, q) = Eq( z ) [log p(x|z)] - K L [q(z)kp(z)]


Approx. Posterior Reconstruction Penalty

• Approximate posterior distribution q(z): Best match to true


posterior p(z|y), one of the unknown inferential quantities of interest to
us.
• Reconstruction cost: The expected log-likelihood measure how
well samples from q(z) are able to explain the data y.

• Penalty: Ensures the the explanation of the data q(z) doesn’t deviate
too far from your beliefs p(z). A mechanism for realising Okham’s
razor.
Shakir Mohamed
Variational Bound
F (x, q) = Eq( z ) [log p(x|z)] - K L [q(z)kp(z)]
Approx. Posterior Reconstruction Penalty

Some comments on q:
• Integration is now optimisation: optimise for q(z) directly.
• I write q(z) to simplify the notation, but it depends on the data, q(z|x).
• Easy convergence assessment since we wait until the free energy
(loss) reaches convergence.
• Variational parameters: parameters of q(z)
• E.g., if a Gaussian, variational parameters are mean and variance.
• Optimisation allows us to tighten the bound and get as close as
possible to the true marginal likelihood.

Shakir Mohamed
Real
Posteriors
Require flexible approximations for the types of posteriors we are likely to
see.

Shakir Mohamed
Mean-Fields
Mean-field methods assume that the distribution is
factorised.

True Fully-factorised
Posterior

z2 z2

z1 z3 z1 z3

Most Expressive Least Expressive

q⇤(z|x) / p(x| Y
qM F (z|x) = q(zk
z)p(z) k
)
Restricted class of approximations: every dimension (or
subset of dimensions) of the posterior is independent.
Shakir Mohamed
Structured Mean-
field
Structured mean-field: introduce dependencies into our
factorisation.

True Structured Approx. Fully-factorised


Posterior
z2 z2 z2

z1 z3 z1 z3
z1 z3

Most Expressive Least Expressive

q⇤(z|x) / p(x| Y Y
q(z) = qk (zk | qM F (z|x) =
z)p(z) k k
{zj } j 6= k ) q(zk )
Shakir Mohamed
Latent Gaussian
Models Examples: GP regression or DLGM.
Probabilistic z ⇠ N (z|0, 1) y ⇠ p(y|f ✓
Model
(z)) Y
Mean-field approx
q(z) = N (zi |µi , i
i
o2 )
Variational F (y, q) = Eq( z ) [log p(y|z)] - K L [ q(z)kp(z)]
bound
X
F (y, q) = Eq( z ) [log p(y|z)] - i
K L [ q(zi )kp(zi )]
X i
F (y, q) = Eq( z ) [log p(y|z)] - K L [ N (zi |µi , a-2 )kN (zi |
i
0, 1)]
1 X (- 2 + µ2 - 1 - ln -2
F (y, q) = Eq( z ) [log p(y|f ✓ (z))] - i i
i
i )
2
Shakir Mohamed
Families of
Approximations
True Families of Posterior Approximations Fully-factorised
Posterior
Normalising Structured mean-field Covariance models
flows
+

z2
zK
z1 z2 z3 …
z2

z4
Mixtures
z1
z1 z3 Auxiliary
z variables
p(z)
y z1 z3

z0

x ⍵ z1 z2 z3
x p(x |z) r (! |x,
z)

Most Expressive Least Expressive

q⇤(z|x) / p(x| Y
qM F (z|x) =
z)p(z) k
q(zk )
Shakir Mohamed
Variational
Optimisation
F (x, q) = Eq( z ) [log p(x|z)] -
K L [q(z)kp(z)]
Approx. Posterior Reconstruction Penalty
2 2
(a) (b)

• Variational EM τ

1
τ

• Stochastic Variational
Inference Doubly Stochastic
0 0

• −1 0 µ 1 −1 0 µ 1
2 2
(c) (d)

Variational Inference τ

1
τ

• Amortised Inference
0 0
−1 0 µ 1 −1 0 µ 1

Shakir Mohamed
Variational EM
F (x, q) = Eq( z ) [log p(x|z)] -
K L [q(z)kp(z)]
Alternating optimisation for log
p(x)
the variational parameter K L [q|| …
p⇤] …
s and then model parameters F (x,
(VEM). q)

Repeat:
Initialisation t=1 Convergence

E-step ¢/ r ¢ F (x, Var. params


M- q) Model
step ✓/ r ✓ F (x, params
E M
q)

Shakir Mohamed
Amortised Inference
Repeat:
E-step (compute q)
Instead of solving for every
For i = 1, … N
observation, amortise using a model.
¢ n / r ¢ Eq¢ ( z ) [log p✓ (x n |zn )] - r
z ~ q(z | x)
¢ K L [q(zn )kp(z)]
M-step

1 X Eq¢ ( z ) [r log p✓ (x n |
✓/ ✓
N n zn )] Inference
• Inference network: q is an encoder, an inverse model, Network
recognition model. q(z |x)
• Parameters of q are now a set of global parameters
used for inference of all data points - test and train.
• Amortise (spread) the cost of inference over all data.
• Joint optimisation of variational and model parameters.

Inference networks provide an efficient mechanism for


Data x
posterior inference with memory

Shakir Mohamed
Stochastic Gradients
Z
r ¢ Eq¢ ( z) [f ✓ (z)] = r q¢ f✓ dz
(z) (z)

Doubly stochastic
estimators
Pathwise Estimator
When easy to use transformation Score-function
is estimator When function f non-
available and differentiable differentiable and q(z) is easy to
= Ep( ✏) [r f ✓ (g(✏,
function f.
¢ sample from.

¢z))]

z⇠
p(z) = Eq( z ) [f ✓ (z)r ¢ log q¢
(z))]
μ

qcp(z) ✏⇠
z = g(✏, p(✏) r ✓

¢)
Reparameterisation
R
x = µ + Rz Identity Log-derivative

Shakir Mohamed
Variational
Autoencoder
F (x, q) = E [log p(x|z)] -
q( z )
z z ~ q(z | x)

K L [q(z)kp(z)] Model
p(x |z)
Approx. Posterior Reconstruction Penalty Inference
Stochastic encoder-decoder system Network
q(z |x)
to implement variational inference.

• Model (Decoder): likelihood p(x|z).


• Inference (Encoder): variational distribution q(z|x)
x ~ p(x | z)
• Transforms an auto-encoder into a generative Data x

model
Specific combination of variational inference in latent
variable models using inference networks
Variational Auto-encoder
But don’t forget what your model is, and what inference you
use.
Shakir Mohamed
Learning by
Comparison
For some models, we only have access to an
unnormalised probability or partial knowledge of the Basic idea:
Transform into
distribution.
learning a model of
z q(x) p*(x) the density ratio.

f(z) We compare the


estimated distribution q(x) to
x the true distribution
p*(x)

Ratios p(x (1)


)
Learning principle: Two-sample
tests ⇤
p(x ( 2 ) )

p (x
)q(x) = 1 p⇤(x) =
q(x)
Interest is not in estimating the marginal probabilities, only in how they are
related.
Shakir Mohamed
Estimation by
Comparison
Density Estimation
Two steps by Comparison
1. Use a hypothesis
H 0 : p⇤ = q✓ vs.
test or comparison to
p⇤ 6= q✓
obtain some model to
L (✓, ¢ )
tells how data from our
model differs from
observed data.

Density Density Ratio


2. Adjust model to ⇤
p
better match the data Difference r¢ = q
distribution using the r ¢ = p⇤ -
q✓

comparison model Mixtures with
identical moments Bf

from step 1. [r ⇤kr ]

Max Mean Moment Bregman Class Probability


Discrepenc Matching Divergenc f-Divergence
Estimation
y e
f (u) = u log u - (u + 1) log(u +
1)
Shakir Mohamed
Adversarial Learning
p⇤(x) p(y = 1| p(y = + 1|x) = D ✓ p(y = - 1|x) = 1 - D ✓ Scoring Function
x)
q(x) = p(y = - 1|x) (x) (x)
F (x, ✓, ¢ ) = Ep⇤ ( x ) [log D ✓ (x)] + Eq¢ ( x ) [log(1 - Bernoulli Loss
D ✓ (x)]
Generative Adversarial
Networks
Alternating min max F (x, ✓,
optimisation ¢)

Comparison loss ✓ / r ✓ Ep ⇤ ( x ) [l og D ✓ ( x )] + r ✓ Eq¢ ( x ) [l og( 1 - D ✓
Generative ¢ / - r ¢ Eq( z ) [log(1 - D ✓ (f ¢
(loss
x)]
(z))]
Instances of testing and z z ⇠ p(z)
inference: x ge n = f ¢ (z)
f(z)
• Unsupervised-as-supervised learning
• Classifier ABC x gen x obs
• Noise-contrastive estimation
• Adversarial learning and GANs
Density-ratio Reparameterisation
Shakir Mohamed
Method of Moments
f ¢ (x)

Moment estimator Tangent of posterior
hl (h l - 1 ; odds.
¢l )
r f ¢ l (x)
Moment Vector

¢
¢l

h 1 (x ; ¢ 1 )
r f ¢ 1 (x)
!
<

¢
¢1

LG LM • Consistent estimators: the number of


(✓) (¢) moments is greater than the number of
z model parameters.
Moment Network • Features should not be not co-linear.
Model g✓ f¢ (x) • More stable than adversarial training.
(z) • Does not require frequent updating of
the classifier.
x gen x r eal

Shakir Mohamed
Convergent
Approaches

Marginal KL Information

• Replace density ratios by classifiers, replace posteriors with implicit models,


• Views from optimal transport and connections to integral probability
metrics.
Shakir Mohamed
Environments and
Rewards
Environment as a generative
Action Prior
process:
p(a)
An unknown likelihood;
Not known analytically;
Only able to observe
its outcomes.
Prior over actions

Interaction only
Environmen
t or Model
Long-term reward p(R(s,a))

All the key inferential questions can now be asked in this simple
framework.
Shakir Mohamed
Planning-as-Inference
Simplest question
What is the posterior distribution over actions?
Maximising the probability of the return log
p(R).

Variational inference in the hierarchical model


Action Prior
p(a)

Recover policy search methods:


Uniform prior over distributions Environment
Continuous policy parameters or Model
Can evaluate environment, but not p(R(s,a))
differentiate.
Shakir Mohamed
Policy Search
Free Energy

Policy gradient using score-function gradient

Action Prior
Appearance of the entropy penalty is natur
p(a)
al and alternative priors easy to consider.
Can easily incorporate prior knowledge of the action
space.
Use any of the tools of probabilistic inference Environmen
available. Easily handle stochastic and deterministic t or Model
policies. p(R(s,a))
Identity Log-derivative

Shakir Mohamed
Hierarchical Planning
Prior
p(z|x) Action
Inference
q(a1..T|z)
log p(z|x)
Action Prior
p(a1..T|z)

z -Inference
q(z |x)
log p(a1..T|z)
Environment
or Model
p(R |a, x)
Data x

log p(R|a, x)

Variational
F (✓) = Eq( a,z | x ) [R(a|x)]MDP

- ↵K L [q✓ (z|x)kp(z|x)] + ↵H[⇡ ✓
(a|z)]
Shakir Mohamed
With a more realistic expansion as
i on graphical Exte model
Action Prior
rn a
DerivelEn Bellman’s
viron equation as p(a)
ment
a
different writing of message
Obse
passing. S e ns
rvat i
on/
at i o n
Application
Inte rnal of the EM Environmen
En viron t
algorithm
policy search for mbecomes
en Ag e n
t

Optio
n
KB possible.
c
Easily consider
State
t
other variational
Environment
or Model
Repr
.
methods, like
Criti

EP.
Both model-free and model-based p(R(s,a))
S t at e
Plmethods
an n e emerge.Embedd
r i ng

Shakir Mohamed
Final Words
K L [q(z|y)kp(z|y)] Approximation class
z
True posterior

f(z)
q¢ (z)

x
z z ~ q(z | x)
z
Model Inference
p(x |z) Network
q(z |x)
Generator

x r eal x gen
x ~ p(x | z)
Data x

Shakir Mohamed
Build Communities

Shakir Mohamed
shakirm.com/
feedback

Planting the Seeds


of Probabilistic
ThinkingAlgorithms
Foundations | Tricks |

Shakir Mohamed
Research Scientist, DeepMind

@shakir_za
Not exhaustive list, and many references to be
updated.
Probabilistic Thinking
Cheeseman, P.C., 1985, August. In Defense of Probability. In IJCAI (Vol. 2, pp. 1002-1009).
Murphy, K.P., 2012. Machine Learning: A Probabilistic Perspective.
Efron, B., 1982. Maximum likelihood and decision theory. The annals of Statistics, pp.340-356.

Stochastic Optimisation
P L’Ecuyer, Note: On the interchange of derivative and expectation for likelihood ratio derivative estimators, Management
Science, 1995
Peter W Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications of the ACM,
1990 Michael C Fu, Gradient estimation, Handbooks in operations research and management science, 2006
Ronald J Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine
learning, 1992
Paul Glasserman, Monte Carlo methods in financial engineering, , 2003
Luc Devroye, Random variate generation in one line of code, Proceedings of the 28th conference on Winter simulation,
1996
L. Devroye, Non-uniform random variate generation, , 1986
Omiros Papaspiliopoulos, Gareth O Roberts, Martin Skold, A general framework for the parametrization of hierarchical
models, Statistical Science, 2007
Michael C Fu, Gradient estimation, Handbooks in operations research and management science, 2006
Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "Black Box Variational Inference." In AISTATS, pp. 814-822. 2014.
Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." arXiv preprint
arXiv:1402.0030 (2014).
Lázaro-Gredilla, Miguel. "Doubly stochastic variational Bayes for non-conjugate inference." (2014).
Wingate, David, and Theophane Weber. "Automated variational inference in probabilistic programming." arXiv preprint
arXiv: 1301.1299 (2013).
Paisley, John, David Blei, and Michael Jordan. "Variational Bayesian inference with stochastic search." arXiv preprint
arXiv: 1206.6430 (2012).

Shakir Mohamed
• Applications of Deep Generative Models
• Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in
deep generative models.” ICML 2014
• Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes.” ICLR 2014
• Gregor, Karol, et al. "Towards Conceptual Compression." arXiv preprint arXiv:1604.08772 (2016).
• Eslami, S. M., Heess, N., Weber, T., Tassa, Y., Kavukcuoglu, K., & Hinton, G. E. (2016). Attend, Infer, Repeat: Fast
Scene Understanding with Generative Models. arXiv preprint arXiv:1603.08575.
• Oh, Junhyuk, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. "Action-conditional video prediction using
deep networks in atari games." In Advances in Neural Information Processing Systems, pp. 2863-2871. 2015.
• Rezende, Danilo Jimenez, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. "One-Shot Generalization in
Deep Generative Models." arXiv preprint arXiv:1603.05106 (2016).
• Rezende, Danilo Jimenez, S. M. Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.
"Unsupervised Learning of 3D Structure from Images." arXiv preprint arXiv:1607.00662 (2016).
• Kingma, Diederik P., Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. "Semi-supervised learning with deep
generative models." In Advances in Neural Information Processing Systems, pp. 3581-3589. 2014.
• Maaløe, Lars, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. "Auxiliary Deep Generative Models." arXiv
preprint arXiv:1602.05473 (2016).
• Odena, Augustus. "Semi-Supervised Learning with Generative Adversarial Networks." arXiv preprint arXiv:1606.01583 (2016).
• Springenberg, Jost Tobias. "Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks."
arXiv preprint arXiv:1511.06390 (2015).
• Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and
Demis Hassabis. "Model-Free Episodic Control." arXiv preprint arXiv:1606.04460 (2016).
• Higgins, Irina, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, and Alexander
Lerchner. "Early Visual Concept Learning with Unsupervised Deep Learning." arXiv preprint arXiv:1606.05579 (2016).
• Bellemare, Marc G., Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. "Unifying Count-
Based Exploration and Intrinsic Motivation." arXiv preprint arXiv:1606.01868 (2016).
• Alexander (Sasha) Vezhnevets, Mnih, Volodymyr, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, and Koray Kavukcuoglu.
"Strategic Attentive Writer for Learning Macro-Actions." arXiv preprint arXiv:1606.04695 (2016).
• Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. "DRAW: A recurrent neural network for image
generation." arXiv preprint arXiv:1502.04623 (2015).

Shakir Mohamed
• Fully-observed M odels
• Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759
• (2016). Larochelle, Hugo, and Iain Murray. "The Neural Autoregressive Distribution Estimator." In AISTATS, vol. 1, p. 2. 2011.
• Uria, Benigno, Iain Murray, and Hugo Larochelle. "A Deep and Tractable Density Estimator." In ICML, pp. 467-475. 2014.
• Veness, Joel, Kee Siong Ng, Marcus Hutter, and Michael Bowling. "Context tree switching." In 2012 Data Compression Conference, pp. 327-
336. IEEE, 2012.
• Rue, Havard, and Leonhard Held. Gaussian Markov random fields: theory and applications. CRC Press, 2005.
• Wainwright, Martin J., and Michael I. Jordan. "Graphical models, exponential families, and variational inference." Foundations and
Trends® in Machine Learning 1, no. 1-2 (2008): 1-305.

• Implicit Probabilistic M odels


• Tabak, E. G., and Cristina V. Turner. "A family of nonparametric density estimation algorithms." Communications on Pure and Applied
Mathematics 66, no. 2 (2013): 145-164.
• Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing flows." arXiv preprint arXiv:1505.05770 (2015).
• Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
"Generative adversarial nets." In Advances in Neural Information Processing Systems, pp. 2672-2680. 2014.
• Verrelst, Herman, Johan Suykens, Joos Vandewalle, and Bart De Moor. "Bayesian learning and the Fokker-Planck machine." In Proceedings
of the International Workshop on Advanced Black-box Techniques for Nonlinear Modeling, Leuven, Belgium, pp. 55-61. 1998.
• Devroye, Luc. "Random variate generation in one line of code." In Proceedings of the 28th conference on Winter simulation, pp. 265-272.
IEEE Computer Society, 1996.
• Ravuri, S., Mohamed, S., Rosca, M. and Vinyals, O., 2018. Learning Implicit Generative Models with the Method of Learned Moments. arXiv
preprint arXiv:1806.11006.

Shakir Mohamed
Latent variable models
Dayan, Peter, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. "The helmholtz machine." Neural computation 7,
no. 5 (1995): 889-904.
Hyvärinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis (Vol. 46). John Wiley & Sons.
Gregor, Karol, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. "Deep autoregressive networks." arXiv
preprint arXiv:1310.8499 (2013).
Ghahramani, Zoubin, and Thomas L. Griffi ths. "Infinite latent feature models and the Indian buffet process." In Advances in
neural information processing systems, pp. 475-482. 2005.
Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal, and David M. Blei. "Hierarchical dirichlet processes." Journal of the
american statistical association (2012).
Adams, Ryan Prescott, Hanna M. Wallach, and Zoubin Ghahramani. "Learning the Structure of Deep Sparse Graphical Models."
In AISTATS, pp. 1-8. 2010.
Lawrence, Neil D. "Gaussian process latent variable models for visualisation of high dimensional data." Advances in
neural information processing systems 16.3 (2004): 329-336.
Damianou, Andreas C., and Neil D. Lawrence. "Deep Gaussian Processes." In AISTATS, pp. 207-215. 2013.
Mattos, César Lincoln C., Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A. Barreto, and Neil D. Lawrence.
"Recurrent Gaussian Processes." arXiv preprint arXiv:1511.06644 (2015).
Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative
filtering." In Proceedings of the 24th international conference on Machine learning, pp. 791-798. ACM, 2007.
Saul, Lawrence K., Tommi Jaakkola, and Michael I. Jordan. "Mean field theory for sigmoid belief networks." Journal of
artificial intelligence research 4, no. 1 (1996): 61-76.
Frey, Brendan J., and Geoffrey E. Hinton. "Variational learning in nonlinear Gaussian belief networks." Neural Computation 11,
no. 1 (1999): 193-213.

Shakir Mohamed
Inference and Learning
Jordan, Michael I., Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. "An introduction to variational methods
for graphical models." Machine learning 37, no. 2 (1999): 183-233.
Hoffman, Matthew D., David M. Blei, Chong Wang, and John William Paisley. "Stochastic variational inference." Journal of
Machine Learning Research 14, no. 1 (2013): 1303-1347.
Honkela, Antti, and Harri Valpola. "Variational learning and bits-back coding: an information-theoretic view to Bayesian
learning." IEEE Transactions on Neural Networks 15, no. 4 (2004): 800-810.
Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint
arXiv:1509.00519 (2015).
Li, Yingzhen, and Richard E. Turner. "Variational Inference with R\ 'enyi Divergence." arXiv preprint arXiv:1602.02311
(2016). Borgwardt, Karsten M., and Zoubin Ghahramani. "Bayesian two-sample tests." arXiv preprint arXiv:0906.4032
(2009). Gutmann, Michael, and Aapo Hyvärinen. "Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models." AISTATS. Vol. 1. No. 2. 2010.
Tsuboi, Yuta, Hisashi Kashima, Shohei Hido, Steffen Bickel, and Masashi Sugiyama. "Direct Density Ratio Estimation for
Large- scale Covariate Shift Adaptation." Information and Media Technologies 4, no. 2 (2009): 529-546.
Sugiyama, Masashi, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge
University Press, 2012.

Amortised Inference
Gershman, Samuel J., and Noah D. Goodman. "Amortized inference in probabilistic reasoning." In Proceedings of the 36th
Annual Conference of the Cognitive Science Society. 2014.
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in
deep generative models." arXiv preprint arXiv:1401.4082 (2014).
Heess, Nicolas, Daniel Tarlow, and John Winn. "Learning to pass expectation propagation messages." In Advances in
Neural Information Processing Systems, pp. 3219-3227. 2013.
Jitkrittum, Wittawat, Arthur Gretton, Nicolas Heess, S. M. Eslami, Balaji Lakshminarayanan, Dino Sejdinovic, and Zoltán
Szabó. "Kernel-based just-in-time learning for passing expectation propagation messages." arXiv preprint arXiv:1503.02551
(2015). Korattikara, Anoop, Vivek Rathod, Kevin Murphy, and Max Welling. "Bayesian dark knowledge." arXiv preprint
arXiv:1506.04416 (2015).

Shakir Mohamed

You might also like