Professional Documents
Culture Documents
Ghahramani Probabilistic Method
Ghahramani Probabilistic Method
Zoubin Ghahramani1,2
1
University of Cambridge
2
Uber AI Labs
zoubin@eng.cam.ac.uk
http://mlg.eng.cam.ac.uk/zoubin/
https://www.uber.com/info/ailabs/
Microsoft AI Summer School
Cambridge 2017
An exciting time for Machine Learning!
Zoubin Ghahramani 2 / 51
A PPLICATIONS OF M ACHINE L EARNING
Zoubin Ghahramani 3 / 51
Computer Games
Zoubin Ghahramani 4 / 51
ML: MANY METHODS WITH MANY LINKS TO
OTHER FIELDS
Kernel Methods,
SVMs, Gaussian
Reinforcement Learning Processes Symbolic Systems,
Control Theory Logic-based (ILP),
Decision Making Relational Learning
Probabilistic Models
Bayesian Methods
Neural Networks
Deep Learning Graphical Models
Causal Inference
Decision Trees
Unsupervised Learning: Random Forests
Feature Discovery
Clustering Linear Statistical Models
Dim. Reduction Logistic Regression
Zoubin Ghahramani 5 / 51
Neural networks and deep learning
Zoubin Ghahramani 6 / 51
N EURAL NETWORKS
Neural networks are tunable nonlinear func-
y tions with many parameters.
outputs
Zoubin Ghahramani 9 / 51
Beyond deep learning
Zoubin Ghahramani 10 / 51
T OOLBOX VS MODELLING VIEWS OF M ACHINE
L EARNING
Zoubin Ghahramani 11 / 51
M ACHINE L EARNING AS
P ROBABILISTIC M ODELLING
Zoubin Ghahramani 12 / 51
BAYES RULE
P(hypothesis)P(data|hypothesis)
P(hypothesis|data) = P
h P(h)P(data|h)
Zoubin Ghahramani 13 / 51
O NE SLIDE ON BAYESIAN MACHINE LEARNING
Everything follows from two simple rules:
P
Sum rule: P(x) = y P(x, y)
Product rule: P(x, y) = P(x)P(y|x)
Learning:
Prediction: Z
P(x|D, m) = P(x|θ, D, m)P(θ|D, m)dθ
Model Comparison:
P(D|m)P(m)
P(m|D) =
P(D)
Zoubin Ghahramani 14 / 51
W HY SHOULD WE CARE ?
Zoubin Ghahramani 15 / 51
W HAT DO I MEAN BY BEING BAYESIAN ?
Zoubin Ghahramani 16 / 51
BAYESIAN D EEP L EARNING
Bayesian deep learning can be implemented in many ways:
I Laplace approximations (MacKay, 1992)
I variational approximations (Hinton and van Camp, 1993; Graves, 2011)
I MCMC (Neal, 1993)
I Stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011)
I Probabilistic back-propagation (Hernandez-Lobato et al, 2015, 2016)
I Dropout as Bayesian averaging (Gal and Ghahramani, 2015, 2016)
Zoubin Ghahramani 18 / 51
W HEN IS THE PROBABILISTIC APPROACH
ESSENTIAL ?
I Decision making
I Data compression
Zoubin Ghahramani 19 / 51
W HY DOES U BER CARE ?
Many aspects of learning and intelligence depend crucially on
the careful probabilistic representation of uncertainty:
I Forecasting, Decision making, Learning from limited, noisy, and
missing data, Learning complex personalised models, Data
compression, Automating modelling, discovery, and experiment
design
Holdout negative lo
the patterns in the data that may not be captured by the model.
• Composite kernels can express many types of structure • . . . so one kernel is described by a noun phrase, and the others modify it
This analysis was automatically generated
An automatic report Gaussian
Modelling structure through
for the process
datasetkernels : 10-sulphuric• Text descriptions
SE ⇥ Lin + Per
Automatically
4.1areModerately
+ SE
describing
complemented
model properties
statistically
by plots significant
of the posteriordiscrepancies 0.3 MFSp M=50
Lin ⇥ Lin SE ⇥ Per Lin + Per SE + Per becomes (after simplification)
An automatic report which for the dataset : 03-mauna 4.1.1 Component 2 : An approximately periodic function with a period of 1.0 years MFSp M=200
local variation repeating structure
• The kernel specifies linear functions
structures are likely under the GP
- which determines the generalisation properties of the model.
• Composite kernels can express many types TheofAutomatic
structureStatistician
prior
Kernels How to automatically
(SE can ⇥ Lin)be distributed
The
+ (SE
describeinto
following
⇥ Per)
arbitrarily
+a(SE).
sum
discrepancies
• The kernel is distributed into a sum of products
been detected.
complex kernels:
of products
between the prior and posterior distributions for this component have (w/ Hensman & Matthews, AISTATS 2015)
EPFitc M=4
Example Probabilistic0.2
Squared Periodic Linear ⇥ Lin
• Sums of kernels are sums ofSE Per +
+ product SE separately
Exponential (SE)
+ Per Sums of kernels becomes
correspond
(Per)
functions so each
(Lin)
is described
• The
• Each kernel in a product qq the
modifies plot hasinan
model unexpectedly
a consistent way. . . large positive deviation from equality (x = y). This
Program for a Hidden Markov Model (HMM) EPFitc M=50
Lin ⇥ Lin SE ⇥ Per LinThe Per
+ Automatic SE (afterto sums of functions
periodicAn automatic
periodic report
Abstractfor the dataset : 02-solar • . . . so one kernel is described
Statistician simplification)
quadratic locally discrepancy has anand
estimated modify it of 0.049.
the others p-value
Automatic
express SE1 ⇥ SE2
many typesStatistician
of structure
= Kernels can plot or +
light negative
be distributed tails
into a sum if it+occurs as the left.
of products 0.1
This report was produced by the Automatic Bayesian Covariance Discovery Sums of kernels correspond to +sums
Per +of
SE functions
(ABCD)model
The raw data and
200
f (x ) +f full
algorithm.
(x ) f (xposterior withlocally
quadratic
,x ) are shownIfinwithin
extrapolationsperiodic fperiodic
1(x) 1.⇠ is gp(0, (x)This ⇠
anda sinusoid.
k1resembles
) entire f2signal
This component is approximately periodic with a period of 10.8 years. Across periods the shape of
gp(0, kSE ) then
this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function
figure
each period very smooth and component 21643
applies until
⇥and f (x)+f2(x)
Lin1
⇠ gp(0, k1 +k2).
SE ⇥ Per
Sums of kernels correspond to sums of functions
SE
Time (seconds)
Categorical([0.15, 0.15, 0.7])] # Trans distr for each state.
1 1 2 2 1 2
functions periodic with trend with noise
from 1716 onwards.
variation amplitude Therefore, a sum of kernels can be described as a sum of+independent
periodicityfunctions.
150
smooth
• stochasticFigure
optimisation of variational parameters
methods onand inducing
dataset. point locations
This component explains 71.5% of the residual variance; this increases the total variance explained trend + short-term deviation
100
1 Executive summary Raw data from 72.8% to 92.3%. The addition of this component reduces the cross validated
= MAE by 16.82% + + data = [Nil, 0.9, 0.8, 0.7, 0, ‐0.025, ‐5, ‐2, ‐0.1, 0, 0.13]
40
increasing 10
1960
growing
1965 1970 1975 1980 1985 1990 1995 0.4
f1(x) ⇠ roughly
gp(0, ) andsmooth
k1200 ⇠ gp(0,
f2(x)trend an adjective
then +
f1(x)+f 2(x) ⇠ gp(0, k1 +k2).
f1(x1) +f2(x2) f (x1, x2)
1361.5
variation
0
−10
amplitude
0
−0.2
Therefore, a sum of kernels can be described as a sum of independent functions.
1361
100 @model hmm begin # Define a model hmm.
Figure 1: Raw data (left) and model posterior with extrapolation (right)
Raw data −0.4
1360.5
1362
that the test statistic was larger in magnitude under the posterior compared to the prior unexpectedly
0
adjective 0.4
@assume(states[1] ~ initial)
error (%)
−200 −150 −100 −50 0 50 100 150 200 250
additive components explain
Figure 1: Raw 90.5%
data of theand
(left) variation
model in the datawith
posterior as shown by the coefficient
extrapolation (right) of de-
# min min locin a product
max max loc corresponds
max min
1360.5
termination (R2 ) values in table 1. The first 8 additive components explain 99.8% of the variationPer functions
1 0.502
Each kernel
0.582 0.341 repeatroughly
0.413 0.341 0.679
to an adjective for i = 2:length(data)
0.38
We can
in the
No data. build
structure models
After the first
1360
bythea cross
6 components
1650 greedy
1700 validatedsearch
mean absolute error (MAE) does not
1750 1800
We can build models by a greedy search
1850 19002 Lin
0.802
1950 2000 2050 0.199Figure
standard 17: ACF
Kernel
0.558 How
0.630
deviation (top left),
it0.049
varies modifies periodogram
the prior (top right) and quantile-quantile (bottom left) uncertainty
0.785
linearly @assume(states[i] ~ trans[states[i‐1]])
decrease by moresearch than 0.1%. Thishassuggests thatfive subsequent terms are modellingin the data.very Theshort term
The structure
trends, uncorrelated
component explainsnoise
algorithm
or areofartefacts
98.6%
identified
of the model
the variation
data (left)inand
additive components
themodel or search
data asposterior
shown procedure.
by Short summaries
theextrapolation
coefficient of(right)
3 0.251
first additive
of the
determination4 0.527
0.475plots.
0.503 0.504
Kernel How
0.799
SE
The blue it modifies
0.447
functions
0.481
andtheshading
line0.534
change
0.430
prior
0.769 are the pointwise mean and 90% confidence interval of the plots
0.616smoothly 0.36 We have presented two novel variational bounds for
0.477under the prior distribution for component 2. The green line and green dashed lines are the corre- @observe(data[i] ~ Normal(statesmean[states[i]], 0.4))
Figure 1: Raw with SE functions change smoothly
additive
(R2 ) components
values in table are 1.asThe
follows:
first 2 additive components explain 99.9% of the variation in the5data. 0.493 0.503 0.487 0.518 0.381
Per functions
Per functions repeat
repeat
SE After the first 3 componentsNo
more• Lin
A very
than smooth
0.1%. This function.
suggests
thestructure
cross validated meanNoabsolute
Per
that subsequent termseight
structureerror (MAE) does
are modelling Tableshort
very
Figurenot
2: Model
term
decrease
9: Pointwise
checking
byof residuals after addingsponding
posterior
statistics for each component.
Lin
component
Lin 4
quantities under
Cumulative
the posterior.
standard deviation varies linearly
standard deviation variesforlinearly
probabilities minimum of end 0.34 performing sparse GP classification. The first, like
The structure search algorithm has identified Example
additive components thetrends,
data. uncor-
indescription The first 4
autocorrelation offunction (ACF) and its location. Cumulative probabilities for maximum of peri-
related
A noise
components
• termination
or
constant.
• additive are
are as(R
This
components artefacts
function
follows:
2
) values
ofapplies
explain the92.3%
in table
model
from
SEThe
1.
ofor1964
the
withfirst
search untilprocedure.
variation 1990.
6 additive
in the data
ofLin
Short
1.0components
as summaries
shown by the
odogram Perand
explain
the additive
coefficient
its location.
99.7% of the variation
of de-
p-values for maximum 4.2 Model checking
and minimum plots for
deviations components
of QQ-plot without statistically significant discrepancies
from straight
@predict states 0.32 the existing GFITC method,
Scalable makes
Variational anProcess
Gaussian approximation
Classification
Lin + Per
An approximately
in
A•SE
Lin ⇥ SE• decrease
the data.function.
smooth After the
. .more
A verybysmooth . than
periodic
This
monotonically
function
firstfunction
5Lin
Lin
0.1%.
components
applies from
⇥increasing
This
a period
the cross
suggestsfunction.
Per 1969validated
Per
years.
until 1977. mean line.absolute error (MAE) doesExample
that subsequent terms are modelling SE very
|{z} Example
not
⇥ termdescription
short Per
description
Probabilistic
-log p(y)
Lin ⇥ SE + SE Lin ⇥•(SE + Per)noise.
• Uncorrelated approximately function linearly 8amplitude KLSp M=4
Uncorrelated
• A constant. This This function applies from 1643 until 1716.4.1 Moderately statistically significant discrepancies
• A rapidly varying
• Uncorrelated noise. smooth
This
• A smooth function. This function
function.
function applies from 1990
. . . applies until
onwards.
. . . 1643 and . .from 1716
. 4.1.1 onwards.2 : An approximately periodic function with a period of 1.0 years
Component
Codebeen available at github.com/jamesrobertlloyd/gpss-research
0.64 ence,0.4we necessarily introduce additional parameters KLSp M=50
Lin ⇥ SEModel SE
+Modelchecking • .statistics
checking . .statistics
An Lin
approximately
areare ⇥summarised
(SE
summarised + in
periodic Per)
function
table2 2with
intable inin.section
.section
a. period 4.4.ofThese
10.8 years. This
statistics
statistics function
have
have applieshas
Per
revealed
not revealed until chosen to act as the noun while SE and Lin modify the description KLSp M=200
The following discrepancies between the prior and posterior distributions for this component have
. . . statistically
. . .significant
any inconsistencies
• A
and. .from
between
rapidly
. the
1643 discrepancies
varying
1716modelonwards.
between
smooth
andthe data and
observed
function.
model in component
data. Code 1.
been available
detected. at github.com/jamesrobertlloyd/gpss-research anglicanHMM :: Dist [n]
0.62 (variational means and variances, or the parameters MFSp M=4
InThis 2function applies ofuntil 1643components
and from 1716 on-
Programming
TheThe
restrest
of the document
of the document is structured
is structured as follows.
as follows. In sectionsection the forms
2 the forms of the additive
the additive components anglicanHMM = fmap (take (length values) . fst) $ score (length values ‐ 1)
are are
described andandwards.
their • The assumptions
qq assumptions
plot has an unexpectedly large positive deviation from equality (x = y). This of EP0.3factors), which naturally scale linearly with N . MFSp M=50
described . posterior
.their
. posterior distributions
distributions
... are aredisplayed.
. . . In section
displayed. In section 3 the modelling
3 the modelling
at github.com/jamesrobertlloyd/gpss-research
• Uncorrelated
of each component
model. Sectiontion
noise with standard
are discussed
applies until
4 discusses model 1643
referencedeviation
and from
checking
to how this
1716 onwards.
statistics,
increasing
affects the linearly discrepancy
away
extrapolations
with plots showing the form of any detected
frommade 1837. Code
has theavailable
an
byThis estimated
func- p-value of 0.049.
0.6 2
(hmm init trans gen) where MFSp M=200
discrepancies • between
Uncorrelated the model
noise and withobserved
1
standarddata. deviation increasing linearly The positive
away from deviation
1952.inThis thefunc-
qq-plot can indicate heavy positive tails if it occurs at the right of the
plot or light negative tails if it occurs as the left.
states = [0,1,2] 10 103 104 105 Our 0.2
proposed KLSP bound outperforms the state-of- EPFitc M=4
Automatic
tion applies until 1643 and from 1716 onwards. EPFitc M=50
init = uniform states
• Uncorrelated noise. This function applies
1 from 1643 until 1716. Time (seconds)
trans 0 = fromList $ zip states [0.1,0.5,0.4] the art GFITC method on benchmark datasets, and is EPFitc M=200
Model checking statistics are summarised in table 2 in section 4. These statistics have revealed
statistically significant discrepancies between the data and model in component 8. Figure 3: Performance of the airline delay dataset.
trans 1 = fromList $ zip states [0.2,0.2,0.6] capable
0.1 of being optimized in a stochastic fashion as
1
The red horizontal line depicts performance of a linear
trans 2 = fromList $ zip states [0.15,0.15,0.7] we have100 shown, making
101 GP classification
102 applicable
103to 104
Large-scale
for several other GP models: if the likelihood factorizes
6 Discussion
Results for GP classification on 5.9 million
addNoise = flip Normal 1
in0.38Ndata
0.4
points.
error (%)
score 0 d = d , then our method is applicable through Gauss-
Figure 17: ACF (top left), periodogram (top right) and quantile-quantile (bottom left) uncertainty 0.36 We have presented two novel variational bounds for
plots. The blue line and shading are the pointwise mean and 90% confidence interval of the plots
under the prior distribution for component 2. The green line and green dashed lines are the corre-
score n d = score (n‐1) $ condition d (prob . (`pdf` (values !! n))
our randomly selected hold-out set of 100000 points, Hermite
0.34
quadrature of the log likelihood. We also note
performing sparse GP classification. The first, like
. addNoise . (!! n) . snd)
sponding quantities under the posterior.
that
0.32 2it is possible to relax the restriction
the existingof q(u)method,
GFITC to makes an approximation
Inference
this achieved an error rate of 37%, the negative-log 10 103 104 105 to the covariance matrix before introducing the non-
4.2 Model checking plots for components without statistically significant discrepancies
probability of the held-out data was 0.642. a Gaussian form, and mixture model approximations
conjugate likelihood. These approaches are somewhat
0.66
4.2.1 Component 1 : A very smooth monotonically increasing function
follow straightforwardly, allowing scalable extensions
unsatisfactory since in performing approximate infer-
-log p(y)
No discrepancies between the prior and posterior of this component have been detected 0.64 ence, we necessarily introduce additional parameters
We built a Gaussian process kernel using the sum of of0.62Nguyen and Bonilla [2014]. (variational means and variances, or the parameters
8
a Matern- 32 and a linear kernel. For each, we intro- 0.6 2
of EP factors), which naturally scale linearly with N .
duced parameters which allowed for the scaling of the 10 103 104 105 Our proposed KLSP bound outperforms the state-of-
Acknowledgements JH was supported by a
Time (seconds) the art GFITC method on benchmark datasets, and is
inputs (sometimes called Automatic Relevance Deter- MRC
Figure fellowship,
3: Performance AM delay
of the airline anddataset.
ZG by EPSRC
capable grant in a stochastic fashion as
of being optimized
mination, ARD). Using a similar optimization scheme TheEP/I036575/1,
red horizontal line depicts performance
and of a linear Focussed
a Google we have shown,Research
making GP classification applicable to
classifier, whilst the blue line shows performance of the big data for the first time.
to the MNIST data above, our method was able to ex- award. optimized KLsparse method.
stochastically
In future work, we note that this work opens the door
ceed the performance of a linear model in a few min- for several other GP models: if the likelihood factorizes
utes, as shown in Figure 3. The kernel parameters at in N , then our method is applicable through Gauss-
Automating
Hermite quadrature of the log likelihood. We also note
convergence suggested that the problem is highly non- our randomly selected hold-out set of 100000 points,
that it is possible to relax the restriction of q(u) to
this achieved an error rate of 37%, the negative-log
linear: the relative variance of the linear kernel was probability of the held-out data was 0.642. a Gaussian form, and mixture model approximations
follow straightforwardly, allowing scalable extensions
negligible. The optimized lengthscales for the Matern We built a Gaussian process kernel using the sum of of Nguyen and Bonilla [2014].
Machine Learning
part of the covariance suggested that the most useful a Matern- 32 and a linear kernel. For each, we intro-
duced parameters which allowed for the scaling of the Acknowledgements JH was supported by a
features were the time of day and time of year. inputs (sometimes called Automatic Relevance Deter- MRC fellowship, AM and ZG by EPSRC grant
mination, ARD). Using a similar optimization scheme EP/I036575/1, and a Google Focussed Research
to the MNIST data above, our method was able to ex- award.
358
ceed the performance of a linear model in a few min-
358
x
aussian process defines a distribution over functions p(f ) which can be used for
sian regression:
p(f )p(D|f )
p(f |D) =
p(D) Bayesian Deep Bayesian
f = (f (x1), f (x2), . . . , f (xn)) be an n-dimensional vector of function values
uated at n points xi 2 X . Note, f is a random variable.
Learning Optimisation
nition: p(f ) is a Gaussian process if for any finite subset {x1, . . . , xn} ⇢ X ,
marginal distribution over that subset p(f ) is multivariate Gaussian.
Zoubin Ghahramani 22 / 51
Automating Inference:
Probabilistic Programming
Zoubin Ghahramani 23 / 51
P ROBABILISTIC P ROGRAMMING
Problem: Probabilistic model development and the derivation
of inference algorithms is time-consuming and error-prone.
Zoubin Ghahramani 24 / 51
P ROBABILISTIC P ROGRAMMING
Problem: Probabilistic model development and the derivation
of inference algorithms is time-consuming and error-prone.
Solution:
I Develop Probabilistic Programming Languages for
expressing probabilistic models as computer programs that
generate data (i.e. simulators).
I Derive Universal Inference Engines for these languages
that do inference over program traces given observed data
(Bayes rule on computer programs).
Zoubin Ghahramani 24 / 51
P ROBABILISTIC P ROGRAMMING
Problem: Probabilistic model development and the derivation
of inference algorithms is time-consuming and error-prone.
Solution:
I Develop Probabilistic Programming Languages for
expressing probabilistic models as computer programs that
generate data (i.e. simulators).
I Derive Universal Inference Engines for these languages
that do inference over program traces given observed data
(Bayes rule on computer programs).
Example languages: BUGS, Infer.NET, BLOG, STAN, Church,
Venture, Anglican, Probabilistic C, Stochastic Python*, Haskell,
Turing, WebPPL ...
Example inference algorithms: Metropolis-Hastings, variational
inference, particle filtering, particle cascade, slice sampling*, particle
MCMC, nested particle inference*, austerity MCMC*
Zoubin Ghahramani 24 / 51
P ROBABILISTIC
The Modelling Interface -PHMM
ROGRAMMING
Example
K = 5; N = 201; initial = fill(1.0 / K, K)
means = (collect(1.0:K)*2-K)*2
Show video
Zoubin Ghahramani 26 / 51
Automating Optimisation:
Bayesian optimisation
Zoubin Ghahramani 27 / 51
BAYESIAN O PTIMISATION
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 28 / 51
BAYESIAN O PTIMISATION
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 28 / 51
BAYESIAN O PTIMISATION : IN A NUTSHELL
Exploration in black-box optimization
Black-box optimization
in a nutshell:
1 initial sample
2 initialize our model
3 get the acquisition
function ↵(x)
4 optimize it!
xnext = arg max ↵(x)
5 sample new data;
update model
6 repeat!
7 make
recommendation
[Močkus et al., 1978, Jones et al., 1998, Jones, 2001]
3/36
Zoubin Ghahramani 29 / 51
BAYESIAN O PTIMISATION : WHY IS IT IMPORTANT ?
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 30 / 51
BAYESIAN O PTIMISATION : WHY IS IT IMPORTANT ?
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 30 / 51
BAYESIAN O PTIMISATION : WHY IS IT IMPORTANT ?
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 30 / 51
BAYESIAN O PTIMISATION : WHY IS IT IMPORTANT ?
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 30 / 51
BAYESIAN O PTIMISATION : WHY IS IT IMPORTANT ?
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 30 / 51
BAYESIAN O PTIMISATION : WHY IS IT IMPORTANT ?
t=3 t=4
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
new
observ.
Posterior
Posterior
Acquisition function
Acquisition function
next
point
Zoubin Ghahramani 31 / 51
T HE AUTOMATIC S TATISTICIAN
Language of models Translation
Evaluation Checking
Zoubin Ghahramani 32 / 51
T HE AUTOMATIC S TATISTICIAN
Language of models Translation
Evaluation Checking
Evaluation Checking
Bayesian Bayesian
Linear Logistic
Regression Regression
Kernel Kernel
Regression Classification
GP GP
Regression Classification
Classification
Bayesian
Kernel
Zoubin Ghahramani 35 / 51
I NGREDIENTS OF
W AN AUTOMATIC STATISTICIAN
? HAT WOULD AN AUTOMATIC STATISTICIAN DO
Evaluation Checking
40 Start
20
0 SE RQ L IN P ER
−20
2000 2005 2010
SE + RQ ... P ER + RQ ... P ER × RQ
30 Start
20
10 SE RQ L IN P ER
0
2000 2005 2010
SE + RQ ... P ER + RQ ... P ER × RQ
40 Start
30
20 SE RQ L IN P ER
10
2000 2005 2010
SE + RQ ... P ER + RQ ... P ER × RQ
Start
40
20
SE RQ L IN P ER
0
2000 2005 2010
SE + RQ ... P ER + RQ ... P ER × RQ
600 600
500
500
400
400
300
300
200
200 100
100 0
1950 1952 1954 1956 1958 1960 1962 1950 1952 1954 1956 1958 1960 1962
Zoubin Ghahramani 38 / 51
E XAMPLE R EPORTS
See http://www.automaticstatistician.com
Zoubin Ghahramani 39 / 51
G OOD PREDICTIVE PERFORMANCE AS WELL
Zoubin Ghahramani 40 / 51
Automatic Statistician for Classification
Zoubin Ghahramani 25 / 50
A Sample Report
of variable ‘plasma glucose’ and a 2-way interaction between variables ‘pedigree function’ and
‘age’ (SE1 + SE3 ◊ SE4 ). This model can achieve a cross-validated error rate of 22.38% and a
negative log marginal likelihood of 353.28.
AutoGPC Data Analysis Report on Pima Dataset We have also analysed the relevance of each input variable. They are listed below in Table
1 in descending order of inferred relevance (i.e. in ascending order of cross-validated error of
the best one-dimensional GP classifier). 1.0
The Automatic Statistician
Table 1: Input variables
24th May 2016 Statistics Classifier Performance
⇡ = (f )
Dimension Variable Min Max Mean Kernel NLML Error 0.5
1 plasma glucose 44.00 199.00 121.88 smooth 381.22 25.00%
This report is automatically generated by the AutoGPC. AutoGPC is an automatic Gaussian
2 body mass index 18.20 67.10 32.47 smooth 347.18 33.71%
process (GP) classifier that aims to find the structure underlying a dataset without human input.
3 pedigree function 0.08 2.42 0.47 smooth 366.12 33.71%
For more information, please visit http://github.com/charles92/autogpc.
– Baseline – – – constant 373.79 34.39%
The training dataset spans 4 input dimensions, which are variables ‘plasma glucose’, ‘body
4 age 21.00 81.00 33.35 smooth 344.69 34.54% 0.0
mass index’, ‘pedigree function’ and ‘age’. There is one binary output variable y, where y = 0
(negative) represents ‘not diabetic’, and y = 1 (positive) represents ‘diabetic’.
In the rest of the report, we will first describe the contribution of each individual input
The training dataset contains 724 data points. Among them, 249 (34.39%) have positive 50 100 150 200
variable (Section 2). This is followed by a detailed analysis of the additive components that plasma glucose
class labels, and the other 475 (65.61%) have negative labels. All input dimensions as well as
jointly make up the best model (Section 3).
the class label assignments are plotted in Figure 1.
Figure 2: Trained classifier on variable ‘plasma glucose’.
2 Individual Variable Analysis
214.5 61.21 2.5541 87.0 First, we try to classify the training samples using only one of the 4 input dimensions. By
considering the best classification performance that is achievable in each dimension, we are
able to infer which dimensions are the most relevant to the class label assignment.
Note that in Table 1 the baseline performance is also included, which is achieved with
a constant GP kernel. The effect of the baseline classifier is to indiscriminately classify all
samples as either positive or negative, whichever is more frequent in the training data. The
classifier performance in each input dimension will be compared against this baseline to tell if
the input variable is discriminative in terms of class determination. 1.0
⇡ = (f )
Variable ‘plasma glucose’ has mean value 121.88 and standard deviation 30.73. Its observed 0.5
minimum and maximum are 44.00 and 199.00 respectively. A GP classifier trained on this
28.5 14.29 -0.1471 15.0 variable alone can achieve a cross-validated error of 25.00%.
plasma glucose body mass index pedigree function age Compared with the null model (baseline), this variable contains some evidence of class label
assignment. There is a positive correlation between the value of this variable and the likelihood
of the sample being classified as positive. The GP posterior trained on this variable is plotted in 0.0
Figure 1: The input dataset. Positive samples (‘diabetic’) are coloured red, and negative ones Figure 2.
(‘not diabetic’) blue. 0.0 0.5 1.0 1.5 2.0 2.5
2.2 Variable ‘pedigree function’ pedigree function
Variable ‘pedigree function’ has mean value 0.47 and standard deviation 0.33. Its observed
1 Executive Summary minimum and maximum are 0.08 and 2.42 respectively. A GP classifier trained on this variable Figure 3: Trained classifier on variable ‘pedigree function’.
alone can achieve a cross-validated error of 33.71%.
The best kernel that we have found relates the class label assignment to input variables ‘plasma This variable provides little evidence of class label assignment given a baseline error rate of
glucose’, ‘pedigree function’ and ‘age’. In specific, the model involves an additive combination 34.39%. The GP posterior trained on this variable is plotted in Figure 3.
1 2 3
1.0
⇡ = (f )
⇡ = (f )
0.5
0.5
⇡ = (f )
0.5
0.0
0.0
10 20 30 40 50 60 70 80 90
0.0 age 50 100 150 200
plasma glucose
Variable ‘age’ has mean value 33.35 and standard deviation 11.76. Its observed minimum and
90
maximum are 21.00 and 81.00 respectively. A GP classifier trained on this variable alone can Table 2: Classification performance of the full model, its additive components (if any), all input 80
achieve a cross-validated error of 34.54%. variables, and the baseline.
70
The classification performance (in terms of error rate) based on this variable alone is even 0.250
worse than that of the naïve baseline classifier. The GP posterior trained on this variable is Dimensions Kernel expression NLML Error 60
age
50
plotted in Figure 5.
1 SE1 381.22 25.00% 40
0.250
3, 4 SE3 ◊ SE4 422.79 30.80% 30 0.500
4 5 6
2
Obtained from [Smith et al., 1988]
Qiurui He (CUED) AutoGPC 1 June 2016 12 / 19
Zoubin Ghahramani 26 / 50
A Sample Report
ofIvariable
Full model‘plasmadescription
glucose’ and a 2-way interaction between variables ‘pedigree function’ and
‘age’ (SE + SEmodel
. . . ,1 the 3 ◊ SEinvolves
4 ). This model can achieve
an additive a cross-validated
combination error rate
of variable of 22.38%
‘plasma glu- and a
negative log marginal likelihood of 353.28.
cose’ and a 2-way interaction between variables ‘pedigree function’ and ‘age’
We have also analysed the relevance of each input variable. They are listed below in Table
(SE1 + SE3 ⇥ SE4 ). This model can achieve a cross-validated error rate of
1 in descending order of inferred relevance (i.e. in ascending order of cross-validated error of
22.38% and a negative log marginal likelihood of 353.28.
the best one-dimensional GP classifier).
Qiurui He (CUED)
Table AutoGPC
2: Input variables 1 June 2016 13 / 19
Zoubin Ghahramani
Statistics Classifier Performance 27 / 50
A Sample Report
Zoubin Ghahramani 28 / 50
A Sample Report
Table 3: Classification performance of the full model, its additive components (if any), all input
Zoubin Ghahramani 29 / 50
T HE AUTO ML COMPETITION
I New algorithms for building machine learning systems that learn
under strict time, CPU, memory and disk space constraints,
making decisions about where to allocate computational
resources so as to maximise statistical performance.
I Second and First place in the first two rounds of the AutoML
classification challenge to “design machine learning methods
capable of performing all model selection and parameter tuning
without any human intervention.”
Zoubin Ghahramani 41 / 51
C ONCLUSIONS
I Probabilistic programming
I Bayesian optimisation
Zoubin Ghahramani 42 / 51
C OLLABORATORS
Zoubin Ghahramani 43 / 51
PAPERS
General:
Ghahramani, Z. (2013) Bayesian nonparametrics and the probabilistic approach to modelling.
Philosophical Trans. Royal Society A 371: 20110553.
Ghahramani, Z. (2015) Probabilistic machine learning and artificial intelligence Nature
521:452–459. http://www.nature.com/nature/journal/v521/n7553/full/nature14541.html
Automatic Statistician:
Website: http://www.automaticstatistician.com
Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2013) Structure
Discovery in Nonparametric Regression through Compositional Kernel Search. ICML 2013.
Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2014) Automatic
Construction and Natural-language Description of Nonparametric Regression Models AAAI
2014. http://arxiv.org/pdf/1402.4304v2.pdf
Lloyd, J. R., and Ghahramani, Z. (2015) Statistical Model Criticism using Kernel Two Sample
Tests. http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf. NIPS 2015.
Bayesian Optimisation:
Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014) Predictive entropy
search for efficient global optimization of black-box functions. NIPS 2014
Hernández-Lobato, J.M., Gelbart, M.A., Adams, R.P., Hoffman, M.W., Ghahramani, Z. (2016)
A General Framework for Constrained Bayesian Optimization using Information-based Search.
Journal of Machine Learning Research. 17(160):1–53.
Zoubin Ghahramani 44 / 51
PAPERS II
Probabilistic Programming:
Turing: https://github.com/yebai/Turing.jl
Chen, Y., Mansinghka, V., Ghahramani, Z. (2014) Sublinear-Time Approximate MCMC
Transitions for Probabilistic Programs. arXiv:1411.1690
Ge, Hong, Adam Scibior, and Zoubin Ghahramani (2016) Turing: rejuvenating probabilistic
programming in Julia. (In preparation).
Bayesian neural networks:
José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable
learning of Bayesian neural networks. ICML, 2015.
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model
uncertainty in deep learning. ICML, 2016.
Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent
neural networks. NIPS, 2016.
José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, and
Richard E Turner. Black-box alpha divergence minimization. ICML, 2016.
Zoubin Ghahramani 45 / 51
BACKGROUND : G AUSSIAN PROCESSES
Consider the problem of nonlinear regression: You want to
learn a function f with error bars from data D = {X, y}
y
outputs
Bayesian neural network
weights
Data: D = {(x(n) , y(n) )}Nn=1 = (X, y)
Parameters θ are weights of neural net
hidden
units
prior p(θ|α)
weights
posterior p(θ|α, D) ∝ p(y|X,R θ)p(θ|α)
inputs prediction p(y0 |D, x0 , α) = p(y0 |x0 , θ)p(θ|D, α) dθ
x
Zoubin Ghahramani 47 / 51
T HE ATOMS OF OUR LANGUAGE OF MODELS
0 0 0 0
Zoubin Ghahramani 48 / 51
T HE COMPOSITION RULES OF OUR LANGUAGE
I Two main operations: addition, multiplication
0 0
quadratic locally
L IN × L IN SE × P ER
functions periodic
Zoubin Ghahramani 49 / 51
M ODEL C HECKING AND C RITICISM
Zoubin Ghahramani 50 / 51
BAYESIAN O PTIMISATION
dictive Entropy Search with Unknown Constraints