Graphical Models - Learning

Based on J. A.
Bilmes,A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, TR-97021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, The EM Algorithm and Extensions, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Kollers NIPS99.
MINVOLSET
Advanced I WS 06/07
Advanced WS 06/07 I
Outline
PULMEMBOLUS
INTUBATION
KINKEDTUBE
VENTMACH
DISCONNECT
PAP
SHUNT
VENTLUNG PRESS
VENITUBE
MINOVL
FIO2
VENTALV
ANAPHYLAXIS
PVSAT
ARTCO2
TPR
SAO2
INSUFFANESTH
EXPCO2
HYPOVOLEMIA
LVFAILURE
CATECHOL
Graphical Models - Learning HR ERRCAUTER
LVEDVOLUME
STROEVOLUME HISTORY ERRBLOWOUTPUT
Parameter Estimation
CVP
PCWP
CO HRBP
HREKG
HRSAT
BP
Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel

Albert-Ludwigs University Freiburg, Germany
Advanced WS 06/07 I
What is Learning?
Advanced WS 06/07 I
Why bothering with learning?
Agents are said to learn if they improve their performance over time based on experience.
The problem of understanding intelligence is said to be the greatest problem in science today and the problem for this century as deciphering the genetic code was for the second half of the last onethe problem of learning represents a gateway to understanding intelligence in man and machines. -- Tomasso Poggio and Steven Smale, 2003.
Bayesian Networks
Bayesian Networks
Bottleneck of knowledge aquisition Expensive, difficult Normally, no expert is around Data is cheap ! Huge amount of data avaible, e.g. Clinical tests Web mining, e.g. log files ....
- Learning
- Learning
Bayesian Networks
Introduction Reminder: Probability theory Basics of Bayesian Networks Modeling Bayesian networks Inference (VE, Junction tree) [Excourse: Markov Networks] Learning Bayesian networks Relational Models
Advanced WS 06/07 I
Why Learning Bayesian Networks?
Advanced WS 06/07 I
Learning With Bayesian Networks

E A B
Conditional independencies & graphical language capture structure of many real-world distributions Graph structure provides much insight into domain
Allows knowledge discovery
Data + Priori Info
E e e
Bayesian Networks
B b b b b
P(A | E,B)
.9 .7 .8 .99 .1 .3 .2 .01
- Learning Bayesian Networks - Learning Bayesian Networks
Learned model can be used for many tasks Supports all the features of probabilistic learning
Model selection criteria Dealing with missing data & hidden variables
e e
Advanced WS 06/07 I

E A B
Advanced WS 06/07 I
What does the data look like?

attributes/variables
complete data set

A6 false false ... true
A1 true false ... true
A2 true true ... false
A3 false true ... false
- Learning
Data + Priori Info
B b b b b
P(A | E,B)
.9 .7 .8 .99 .1 .3 .2 .01
X1 X2 ... XM data cases
Learning algorithm
e e e e
Bayesian Networks
Advanced WS 06/07 I

incomplete data set
Advanced WS 06/07 I

incomplete data set
A1 true
A3
A4 true
A6 false false ...
? ?
...
- Learning
... true
... false
... true
... false
...
... false
... true
...
Bayesian Networks
missing value
Advanced WS 06/07 I

hidden/ latent
Advanced WS 06/07 I
Hidden variable Examples
incomplete data set

A4 A5 false false ... true A6 false false ... Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering, ...
A1 true
A3
1. Parameter reduction:
2 2 2 2 2 2
X1
118
X2
X3
? ?
...
X1 hidden
X2 H H Y2
4 16
X3
34
true
- Learning
32
... true
... false
Bayesian Networks
missing value
Bayesian Networks
- Learning
16
Y1
Y2
Y3
64
Y1
Y3
Bayesian Networks
- Learning
Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering, ...
A1 true
A2 true true
A3
A4 true
A5 false false
A6 false false
? ?
Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering, ...
[Slides due to Eamonn Keogh]
Advanced WS 06/07 I
Hidden variable Examples
Advanced WS 06/07 I
2. Clustering:
hidden Cluster assignment of cluster
X1
X2
...
Xn
attributes
Hidden variables also appear in clustering Autoclass model:

Hidden variables assigns class labels Observed attributes are independent given the class
- Learning - Learning
Bayesian Networks
Advanced WS 06/07 I
Advanced WS 06/07 I
Iteration 1
- Learning
Iteration 2
- Learning Bayesian Networks
The cluster means are randomly assigned
Bayesian Networks
Bayesian Networks
Advanced WS 06/07 I
Advanced WS 06/07 I
Iteration 5
- Learning
Iteration 25
- Learning
Bayesian Networks
What is a natural grouping among these objects?
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
Bayesian Networks
Advanced WS 06/07 I

Fixed structure Fixed variables Hidden variables
Advanced WS 06/07 I
Parameter Estimation
be set of data is called a data case
- Learning Bayesian Networks - Learning Bayesian Networks
Let
i overXmX RVs
A B Easiest problem counting
Selection of arcs New domain with no domain expert Data mining Encompasses to difficult subproblem, Only Structural EM is known
- Learning
Scientific discouvery
fully
Advanced WS 06/07 I
iid - assumption: All data cases are independently sampled from identical distributions Find: X Parameters of CPDs which match the data best
Bayesian Networks
What does best matching mean ?
Bayesian Networks
- Learning
observed
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks
Partially
Maximum Likelihood - Parameter Estimation
Advanced WS 06/07 I
Find paramteres which have most likely produced the data
Advanced WS 06/07 I
Advanced WS 06/07 I

1. MAP parameters

1. MAP parameters
2. Data is equally likely for all parameters.

- Learning - Learning Advanced WS 06/07 I
Bayesian Networks
Advanced WS 06/07 I

1. MAP parameters
Find: ML parameters
- Learning
Bayesian Networks
Bayesian Networks
- Learning
2. Data is equally likely for all parameters 3. All parameters are apriori equally likely
Bayesian Networks
Advanced WS 06/07 I
Advanced WS 06/07 I
Find: ML parameters
Likelihood given the data of the paramteres
Find: ML parameters
Likelihood given the data of the paramteres
- Learning
Bayesian Networks
Log-Likelihood of the paramteres given the data
Advanced WS 06/07 I
Maximum Likelihood
Advanced WS 06/07 I

This is one of the most commonly used estimators in statistics Intuitively appealing
fully
- Learning
Consistent: estimate converges to best possible value as the number of examples grow Asymptotic efficiency: estimate is as close to the true value as possible given a particular training set
?
Bayesian Networks
Bayesian Networks
- Learning
observed
Partially
Advanced WS 06/07 I
Known Structure, Complete Data

E E A B A B
Advanced WS 06/07 I
ML Parameter Estimation
A1 true false ... true A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true
E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y>
B b b b b
P(A | E,B)
.9 .7 .8 .99 .1 .3 .2 .01
- Learning - Learning Advanced WS 06/07 I
E e e e e
B b b b b
P(A | E,B)
? ? ? ? ? ? ? ?
Learning algorithm
Network structure is specified
Learner needs to estimate parameters
e e e e
Bayesian Networks
Data does not contain missing values
Advanced WS 06/07 I
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true
(iid)
... true
(iid)
... true
- Learning
Bayesian Networks
Bayesian Networks
- Learning
Bayesian Networks
Advanced WS 06/07 I
Advanced WS 06/07 I
(iid)
... true
(iid)
... true
(BN semantics)
- Learning
(BN semantics)
- Learning Advanced WS 06/07 I
Bayesian Networks
Advanced WS 06/07 I
(iid)
... true
(iid)
... true
(BN semantics)
- Learning
(BN semantics)
Bayesian Networks
Bayesian Networks
Only local parameters of family of Aj involved
- Learning
Bayesian Networks
Advanced WS 06/07 I
Advanced WS 06/07 I
(iid)
... true
(iid)
... true
(BN semantics)
- Learning
(BN semantics)
Bayesian Networks
Each factor individually !!
Each factor individually !!
Advanced WS 06/07 I
Decomposability of Likelihood
Advanced WS 06/07 I
Likelihood for Multinominals
If the data set if complete (no question marks) we can maximize each local likelihood function independently, and then combine the solutions to get an MLE solution.
Random variable V with 1,...,K values
Bayesian Networks
where Nk is the counts of state k in data
Bayesian Networks
decomposition of the global problem to independent, local sub-problems. This allows efficient solutions to the MLE problem.
- Learning
This constraint implies that the choice on I influences the choice on j (i<>j)
Bayesian Networks
- Learning
Decomposability of the likelihood
Advanced WS 06/07 I
Likelihood for Binominals (2 states only)
Advanced WS 06/07 I
Likelihood for Binominals (2 states only)
Compute partial derivative
Compute partial derivative
=1
=1
- Learning
Bayesian Networks
=> MLE is
=> MLE is
Advanced WS 06/07 I
Likelihood for Conditional Multinominals

multinomial for each joint state pa of
Advanced WS 06/07 I

the parents of V:
observed
- Learning
Bayesian Networks
MLE
Bayesian Networks
In general, for multinomials (>2 states), the MLE is
- Learning
Set partial derivative zero
Set partial derivative zero
fully Partially
Advanced WS 06/07 I
Known Structure, Incomplete Data

E E A B A B
Advanced WS 06/07 I
EM Idea
E, B, A <Y,?,N> <Y,N,?> <N,N,Y> <N,Y,Y> . . <?,Y,Y>
In the case of complete data, ML parameter estimation is easy:

simply counting (1 iteration)
.1 .3 .2 .01
- Learning
B b b b b
P(A | E,B)
.9 .7 .8 .99
E e e e e
B b b b b
P(A | E,B)
? ? ? ? ? ? ? ?
Bayesian Networks
Need to consider assignments to missing values
2. Count 3. Iterate
Advanced WS 06/07 I
EM Idea: complete the data

A B
Advanced WS 06/07 I

A B
incomplete data
A true true false true false B true
incomplete data
A B true
?
true false
complete
true true false true false
?
true false
- Learning
A true true false false
B true false true false
N 1.5 1.5 1.5 0.5
Bayesian Networks
Bayesian Networks
- Learning
complete data
expected counts
Bayesian Networks
Network structure is specified Data contains missing values
most probable?, average?, ... value
- Learning
Learning algorithm
Incomplete data ? 1. Complete data (Imputation)
Advanced WS 06/07 I

A B
Advanced WS 06/07 I

A B
incomplete data
A B true
incomplete data
A B true
complete
?
true false
complete
?
true false
?
B true N 1.0
- Learning
true A true true false false B true false true false N 1.5 1.5 1.5 0.5 true true false true false false
?=true ?=false
true false
0.5 0.5
1.0 1.0
N 1.5 1.5 1.5 0.5
Bayesian Networks
?=true ?=false
0.5 0.5
maximize
Advanced WS 06/07 I

A B
Advanced WS 06/07 I

A B
incomplete data
A B true
incomplete data
complete
?
true false
?
true false
- Learning
N 1.5 1.5 1.5 0.5
Bayesian Networks
maximize
Bayesian Networks
- Learning
complete data
expected counts
iterate
Bayesian Networks
- Learning
complete data
expected counts
A
complete data
expected counts
Advanced WS 06/07 I

A B
Advanced WS 06/07 I

A B
incomplete data
A B true
incomplete data
complete
?
true false
?
true false
- Learning
N 1.5 1.5 1.75 0.25
Bayesian Networks
maximize
Advanced WS 06/07 I

A B
Advanced WS 06/07 I
Complete-data likelihood
A1 true ? ... true A2 true true ... false A3 ? ? ... ? A4 true ? ... false A5 false false ... true A6 false false ... ?
incomplete data
A B true
incomplete-data likelihood
complete
?
true false
Assume complete data

- Learning
exists with
complete data
A true true false false B true false true false N 1.5 1.5 1.875 0.125
expected counts
iterate
complete-data likelihood
maximize
Bayesian Networks
Bayesian Networks
- Learning
complete data
expected counts
iterate
Advanced WS 06/07 I
EM Algorithm - Abstract
Expectation Step
Advanced WS 06/07 I
EM Algorithm - Principle
- Learning
Bayesian Networks
Construct an new function based on the current point (which behaves well) Property: The maximum of the new function has a better scoring then the current point.
Advanced WS 06/07 I
Advanced WS 06/07 I
P(Y| )
P(Y| )
new function Q( , k)
- Learning - Learning Bayesian Networks
Current point
Current point
Bayesian Networks
Expectation Maximization (EM):

Bayesian Networks
- Learning
Maximization Step
P(Y| )
Advanced WS 06/07 I
Advanced WS 06/07 I
EM for Multi-Nominals
Random variable V with 1,...,K values
P(Y| )
new function Q( , k)
- Learning
Maximum
k+1
Bayesian Networks
MLE:
Advanced WS 06/07 I
EM for Conditional Multinominals

multinomial for each joint state pa of
Advanced WS 06/07 I
Learning Parameters: incomplete data
the parents of V:
Non-decomposable likelihood (missing value, hidden nodes)

Data Initial parameters
S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? B 1> 1> ?> 1>
- Learning
Current model
Bayesian Networks
Bayesian Networks
MLE
- Learning
Bayesian Networks
- Learning
Current point
where ENk are the expected counts of X state k in the data, i.e.
Advanced WS 06/07 I
Advanced WS 06/07 I


Expectation
Inference: P(S|X=0,D=1,C=0,B=1)
Current model
S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ?
B 1> 1> ?> 1>
Expectation
Current model
- Learning
S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ?
B 1> 1> ?> 1>
Expected counts
S 1 1 0 1 X D C 0 1 0 1 1 0 0 0 0 0 0 0 .. B 1 1 0 1
Expected counts
Bayesian Networks
Update parameters (ML, MAP)
Advanced WS 06/07 I
Advanced WS 06/07 I

1. Initialize parameters 2. Compute pseudo counts for each variable

junction tree algorithm
- Learning
Expectation
Current model
S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ?
B 1> 1> ?> 1>
Expected counts
Bayesian Networks
EM-algorithm: iterate until convergence
Update parameters (ML, MAP)
4. If not converged, iterate to 2
Bayesian Networks
Maximization
S 1 1 0 1
X D C 0 1 0 1 1 0 0 0 0 0 0 0 ..
B 1 1 0 1
3. Set parameters to the (completed) ML estimates
- Learning
Bayesian Networks
Maximization
S 1 1 0 1
X D C 0 1 0 1 1 0 0 0 0 0 0 0 ..
B 1 1 0 1
- Learning
Advanced WS 06/07 I
Monotonicity
Advanced WS 06/07 I
LL on training set (Alarm)
(Dempster, Laird, Rubin 77): the incompletedata likelihood fuction is not decreased after an EM iteration
- Learning
Bayesian Networks
Experiment by Bauer, Koller and Singer [UAI97]
Advanced WS 06/07 I
Parameter value (Alarm)
Advanced WS 06/07 I
EM in Practice
Initial parameters:
Random parameters setting Best guess from other source
Stopping criteria:
Small change in likelihood of data Small change in parameter values
- Learning
Multiple restarts Early pruning of unpromising ones
Bayesian Networks
Speed up:
various methods to speed convergence
Experiment by Baur, Koller and Singer [UAI97]
Bayesian Networks
- Learning
Avoiding bad local maxima:
Bayesian Networks
(discrete) Bayesian networks: for any initial, nonuniform value the EM algorithm converges to a (local or global) maximum.
- Learning
Advanced WS 06/07 I
Gradient Ascent
Advanced WS 06/07 I
Parameter Estimation: Summary
Main result
Requires same BN inference computations as EM Pros:

- Learning
Parameter estimation is a basic task for learning with Bayesian networks Due to missing values non-linear optimization
EM, Gradient, ... Fully observed data: counting Partially observed data: pseudo counts
Flexible Closely related to methods in neural network training
EM for multi-nominal random variables
Need to project gradient onto space of legal parameters To get reasonable convergence we need to combine with smart optimization techniques
Bayesian Networks
Cons:
Junction tree to do multiple inference

Graphical Models - Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graphical Models - Learning

Uploaded by

Copyright:

Available Formats

Based on J. A.

Graphical Models - Learning HR ERRCAUTER

STROEVOLUME HISTORY ERRBLOWOUTPUT

Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel

Why bothering with learning?

Why Learning Bayesian Networks?

Learning With Bayesian Networks

Learning With Bayesian Networks

What does the data look like?

complete data set

A1 true false ... true

A2 true true ... false

A3 false true ... false

A4 true true ... false

A5 false false ... true

Data + Priori Info

X1 X2 ... XM data cases

What does the data look like?

What does the data look like?

A2 true true ... false

A5 false false ... true

A6 false false ...

What does the data look like?

Hidden variable Examples

incomplete data set

A2 true true ... false

[Slides due to Eamonn Keogh]

Hidden variable Examples

Hidden variables also appear in clustering Autoclass model:

[Slides due to Eamonn Keogh]

The cluster means are randomly assigned

[Slides due to Eamonn Keogh]

[Slides due to Eamonn Keogh]

[Slides due to Eamonn Keogh]

What is a natural grouping among these objects?

What is a natural grouping among these objects?

Learning With Bayesian Networks

A B Easiest problem counting

What does best matching mean ?

Maximum Likelihood - Parameter Estimation

Maximum Likelihood - Parameter Estimation

What does best matching mean ?

Find paramteres which have most likely produced the data

Maximum Likelihood - Parameter Estimation

Maximum Likelihood - Parameter Estimation

What does best matching mean ?

What does best matching mean ?

2. Data is equally likely for all parameters.

Maximum Likelihood - Parameter Estimation

Maximum Likelihood - Parameter Estimation

What does best matching mean ?

What does best matching mean ?

Maximum Likelihood - Parameter Estimation

Maximum Likelihood - Parameter Estimation

What does best matching mean ?

What does best matching mean ?

Log-Likelihood of the paramteres given the data

Learning With Bayesian Networks

A B Easiest problem counting

Known Structure, Complete Data

E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y>

Data does not contain missing values