Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Based on J. A.

Bilmes,A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, TR-97021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, The EM Algorithm and Extensions, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Kollers NIPS99.
MINVOLSET

Advanced I WS 06/07

Advanced WS 06/07 I

Outline

PULMEMBOLUS

INTUBATION

KINKEDTUBE

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG PRESS

VENITUBE

MINOVL

FIO2

VENTALV

ANAPHYLAXIS

PVSAT

ARTCO2

TPR

SAO2

INSUFFANESTH

EXPCO2

HYPOVOLEMIA

LVFAILURE

CATECHOL

Graphical Models - Learning HR ERRCAUTER

LVEDVOLUME

STROEVOLUME HISTORY ERRBLOWOUTPUT

Parameter Estimation

CVP

PCWP

CO HRBP

HREKG

HRSAT

BP

Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel


Albert-Ludwigs University Freiburg, Germany

Advanced WS 06/07 I

What is Learning?

Advanced WS 06/07 I

Why bothering with learning?

Agents are said to learn if they improve their performance over time based on experience.
The problem of understanding intelligence is said to be the greatest problem in science today and the problem for this century as deciphering the genetic code was for the second half of the last onethe problem of learning represents a gateway to understanding intelligence in man and machines. -- Tomasso Poggio and Steven Smale, 2003.

Bayesian Networks

Bayesian Networks

Bottleneck of knowledge aquisition Expensive, difficult Normally, no expert is around Data is cheap ! Huge amount of data avaible, e.g. Clinical tests Web mining, e.g. log files ....

- Learning

- Learning

Bayesian Networks

Introduction Reminder: Probability theory Basics of Bayesian Networks Modeling Bayesian networks Inference (VE, Junction tree) [Excourse: Markov Networks] Learning Bayesian networks Relational Models

Advanced WS 06/07 I

Why Learning Bayesian Networks?

Advanced WS 06/07 I

Learning With Bayesian Networks


E A B

Conditional independencies & graphical language capture structure of many real-world distributions Graph structure provides much insight into domain
Allows knowledge discovery
Data + Priori Info
E e e
Bayesian Networks

B b b b b

P(A | E,B)
.9 .7 .8 .99 .1 .3 .2 .01
- Learning Bayesian Networks - Learning Bayesian Networks

Learned model can be used for many tasks Supports all the features of probabilistic learning
Model selection criteria Dealing with missing data & hidden variables

e e

Advanced WS 06/07 I

Learning With Bayesian Networks


E A B

Advanced WS 06/07 I

What does the data look like?


attributes/variables

complete data set


A6 false false ... true

A1 true false ... true

A2 true true ... false

A3 false true ... false

A4 true true ... false

A5 false false ... true

- Learning

Data + Priori Info

B b b b b

P(A | E,B)
.9 .7 .8 .99 .1 .3 .2 .01

X1 X2 ... XM data cases

Learning algorithm

e e e e

Bayesian Networks

Advanced WS 06/07 I

What does the data look like?


incomplete data set

Advanced WS 06/07 I

What does the data look like?


incomplete data set

A1 true

A2 true true ... false

A3

A4 true

A5 false false ... true

A6 false false ...

? ?
...

- Learning

... true

... false

... true

... false

...

... false

... true

...

Bayesian Networks

missing value

Advanced WS 06/07 I

What does the data look like?


hidden/ latent

Advanced WS 06/07 I

Hidden variable Examples

incomplete data set


A4 A5 false false ... true A6 false false ... Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering, ...

A1 true

A2 true true ... false

A3

1. Parameter reduction:
2 2 2 2 2 2

X1
118

X2

X3

? ?
...

X1 hidden

X2 H H Y2
4 16

X3
34

true

- Learning

32

... true

... false

Bayesian Networks

missing value

Bayesian Networks

- Learning

16

Y1

Y2

Y3
64

Y1

Y3

Bayesian Networks

- Learning

Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering, ...

A1 true

A2 true true

A3

A4 true

A5 false false

A6 false false

? ?

Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering, ...

[Slides due to Eamonn Keogh]

Advanced WS 06/07 I

Hidden variable Examples

Advanced WS 06/07 I

2. Clustering:
hidden Cluster assignment of cluster

X1

X2

...

Xn

attributes

Hidden variables also appear in clustering Autoclass model:


Hidden variables assigns class labels Observed attributes are independent given the class
- Learning - Learning
[Slides due to Eamonn Keogh]

Bayesian Networks

[Slides due to Eamonn Keogh]

Advanced WS 06/07 I

Advanced WS 06/07 I

Iteration 1
- Learning

Iteration 2
- Learning Bayesian Networks

The cluster means are randomly assigned

Bayesian Networks

Bayesian Networks

[Slides due to Eamonn Keogh]

[Slides due to Eamonn Keogh]

Advanced WS 06/07 I

Advanced WS 06/07 I

Iteration 5
- Learning

Iteration 25
- Learning
[Slides due to Eamonn Keogh]

Bayesian Networks

[Slides due to Eamonn Keogh]

What is a natural grouping among these objects?

What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family

School Employees

Females

Males

Bayesian Networks

Advanced WS 06/07 I

Learning With Bayesian Networks


Fixed structure Fixed variables Hidden variables

Advanced WS 06/07 I

Parameter Estimation
be set of data is called a data case
- Learning Bayesian Networks - Learning Bayesian Networks

Let
i overXmX RVs

A B Easiest problem counting

Selection of arcs New domain with no domain expert Data mining Encompasses to difficult subproblem, Only Structural EM is known
- Learning

Scientific discouvery

fully
Advanced WS 06/07 I

iid - assumption: All data cases are independently sampled from identical distributions Find: X Parameters of CPDs which match the data best

Bayesian Networks

What does best matching mean ?

Bayesian Networks

- Learning

observed

Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks

Partially

Maximum Likelihood - Parameter Estimation

Advanced WS 06/07 I

Maximum Likelihood - Parameter Estimation

What does best matching mean ?

Find paramteres which have most likely produced the data

Advanced WS 06/07 I

Maximum Likelihood - Parameter Estimation

Advanced WS 06/07 I

Maximum Likelihood - Parameter Estimation

What does best matching mean ?


1. MAP parameters

What does best matching mean ?


1. MAP parameters

2. Data is equally likely for all parameters.


- Learning - Learning Advanced WS 06/07 I

Bayesian Networks

Advanced WS 06/07 I

Maximum Likelihood - Parameter Estimation

Maximum Likelihood - Parameter Estimation

What does best matching mean ?


1. MAP parameters

What does best matching mean ?

Find: ML parameters

- Learning

Bayesian Networks

Bayesian Networks

- Learning

2. Data is equally likely for all parameters 3. All parameters are apriori equally likely

Bayesian Networks

Advanced WS 06/07 I

Maximum Likelihood - Parameter Estimation

Advanced WS 06/07 I

Maximum Likelihood - Parameter Estimation

What does best matching mean ?

What does best matching mean ?

Find: ML parameters
Likelihood given the data of the paramteres

Find: ML parameters
Likelihood given the data of the paramteres

- Learning

Bayesian Networks

Log-Likelihood of the paramteres given the data

Advanced WS 06/07 I

Maximum Likelihood

Advanced WS 06/07 I

Learning With Bayesian Networks


Fixed structure Fixed variables Hidden variables

This is one of the most commonly used estimators in statistics Intuitively appealing
fully

A B Easiest problem counting

- Learning

Consistent: estimate converges to best possible value as the number of examples grow Asymptotic efficiency: estimate is as close to the true value as possible given a particular training set

?
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks

Selection of arcs New domain with no domain expert Data mining Encompasses to difficult subproblem, Only Structural EM is known
- Learning Bayesian Networks

Scientific discouvery

Bayesian Networks

Bayesian Networks

- Learning

observed

Partially

Advanced WS 06/07 I

Known Structure, Complete Data


E E A B A B

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false ... true A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

E, B, A <Y,N,N> <Y,N,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y>

B b b b b

P(A | E,B)
.9 .7 .8 .99 .1 .3 .2 .01
- Learning - Learning Advanced WS 06/07 I

E e e e e

B b b b b

P(A | E,B)
? ? ? ? ? ? ? ?

Learning algorithm
Network structure is specified
Learner needs to estimate parameters

e e e e

Bayesian Networks

Data does not contain missing values

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

(iid)

... true

(iid)

... true

- Learning

Bayesian Networks

Bayesian Networks

- Learning

Bayesian Networks

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

(iid)

... true

(iid)

... true

(BN semantics)
- Learning

(BN semantics)
- Learning Advanced WS 06/07 I

Bayesian Networks

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

(iid)

... true

(iid)

... true

(BN semantics)
- Learning

(BN semantics)

Bayesian Networks

Bayesian Networks

Only local parameters of family of Aj involved

- Learning

Bayesian Networks

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

Advanced WS 06/07 I

ML Parameter Estimation
A1 true false A2 true true ... false A3 false true ... false A4 true true ... false A5 false false ... true A6 false false ... true

(iid)

... true

(iid)

... true

(BN semantics)
- Learning

(BN semantics)

Bayesian Networks

Each factor individually !!

Each factor individually !!

Advanced WS 06/07 I

Decomposability of Likelihood

Advanced WS 06/07 I

Likelihood for Multinominals

If the data set if complete (no question marks) we can maximize each local likelihood function independently, and then combine the solutions to get an MLE solution.

Random variable V with 1,...,K values

Bayesian Networks

where Nk is the counts of state k in data

Bayesian Networks

decomposition of the global problem to independent, local sub-problems. This allows efficient solutions to the MLE problem.

- Learning

This constraint implies that the choice on I influences the choice on j (i<>j)

Bayesian Networks

Only local parameters of family of Aj involved

Only local parameters of family of Aj involved

- Learning

Decomposability of the likelihood

Advanced WS 06/07 I

Likelihood for Binominals (2 states only)

Advanced WS 06/07 I

Likelihood for Binominals (2 states only)

Compute partial derivative

Compute partial derivative

=1

=1

- Learning

Bayesian Networks

=> MLE is

=> MLE is

Advanced WS 06/07 I

Likelihood for Conditional Multinominals


multinomial for each joint state pa of

Advanced WS 06/07 I

Learning With Bayesian Networks


Fixed structure Fixed variables Hidden variables

the parents of V:

A B Easiest problem counting

observed
- Learning

Selection of arcs New domain with no domain expert Data mining Encompasses to difficult subproblem, Only Structural EM is known
- Learning Bayesian Networks

Bayesian Networks

MLE

Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks

Scientific discouvery

Bayesian Networks

In general, for multinomials (>2 states), the MLE is

- Learning

Set partial derivative zero

Set partial derivative zero

fully Partially

Advanced WS 06/07 I

Known Structure, Incomplete Data


E E A B A B

Advanced WS 06/07 I

EM Idea

E, B, A <Y,?,N> <Y,N,?> <N,N,Y> <N,Y,Y> . . <?,Y,Y>

In the case of complete data, ML parameter estimation is easy:


simply counting (1 iteration)
.1 .3 .2 .01
- Learning

B b b b b

P(A | E,B)
.9 .7 .8 .99

E e e e e

B b b b b

P(A | E,B)
? ? ? ? ? ? ? ?

Bayesian Networks

Need to consider assignments to missing values

2. Count 3. Iterate

Advanced WS 06/07 I

EM Idea: complete the data


A B

Advanced WS 06/07 I

EM Idea: complete the data


A B

incomplete data
A true true false true false B true

incomplete data
A B true

?
true false

complete

true true false true false

?
true false

- Learning

A true true false false

B true false true false

N 1.5 1.5 1.5 0.5

Bayesian Networks

Bayesian Networks

- Learning

complete data

expected counts

Bayesian Networks

Network structure is specified Data contains missing values

most probable?, average?, ... value

- Learning

Learning algorithm

Incomplete data ? 1. Complete data (Imputation)

Advanced WS 06/07 I

EM Idea: complete the data


A B

Advanced WS 06/07 I

EM Idea: complete the data


A B

incomplete data
A B true

incomplete data
A B true

complete

true true false true false

?
true false

complete

true true false true false

?
true false

?
B true N 1.0

- Learning

true A true true false false B true false true false N 1.5 1.5 1.5 0.5 true true false true false false

?=true ?=false
true false

0.5 0.5
1.0 1.0

A true true false false

B true false true false

N 1.5 1.5 1.5 0.5

Bayesian Networks

?=true ?=false

0.5 0.5

maximize

Advanced WS 06/07 I

EM Idea: complete the data


A B

Advanced WS 06/07 I

EM Idea: complete the data


A B

incomplete data
A B true

incomplete data
A true true false true false B true

complete

true true false true false

?
true false

?
true false

- Learning

A true true false false

B true false true false

N 1.5 1.5 1.5 0.5

Bayesian Networks

maximize

Bayesian Networks

- Learning

complete data

expected counts

iterate

Bayesian Networks

- Learning

complete data

expected counts
A

complete data

expected counts

Advanced WS 06/07 I

EM Idea: complete the data


A B

Advanced WS 06/07 I

EM Idea: complete the data


A B

incomplete data
A B true

incomplete data
A true true false true false B true

complete

true true false true false

?
true false

?
true false

- Learning

A true true false false

B true false true false

N 1.5 1.5 1.75 0.25

Bayesian Networks

maximize

Advanced WS 06/07 I

EM Idea: complete the data


A B

Advanced WS 06/07 I

Complete-data likelihood
A1 true ? ... true A2 true true ... false A3 ? ? ... ? A4 true ? ... false A5 false false ... true A6 false false ... ?

incomplete data
A B true

incomplete-data likelihood

complete

true true false true false

?
true false

Assume complete data


- Learning

exists with
- Learning Bayesian Networks

complete data
A true true false false B true false true false N 1.5 1.5 1.875 0.125

expected counts

iterate

complete-data likelihood

maximize

Bayesian Networks

Bayesian Networks

- Learning

complete data

expected counts

iterate

Advanced WS 06/07 I

EM Algorithm - Abstract
Expectation Step

Advanced WS 06/07 I

EM Algorithm - Principle

- Learning

Bayesian Networks

Construct an new function based on the current point (which behaves well) Property: The maximum of the new function has a better scoring then the current point.

Advanced WS 06/07 I

EM Algorithm - Principle

Advanced WS 06/07 I

EM Algorithm - Principle

P(Y| )

P(Y| )

new function Q( , k)
- Learning - Learning Bayesian Networks

Current point

Current point

Construct an new function based on the current point (which behaves well) Property: The maximum of the new function has a better scoring then the current point.

Bayesian Networks

Expectation Maximization (EM):

Expectation Maximization (EM):


Construct an new function based on the current point (which behaves well) Property: The maximum of the new function has a better scoring then the current point.

Bayesian Networks

Expectation Maximization (EM):

- Learning

Maximization Step

P(Y| )

Advanced WS 06/07 I

EM Algorithm - Principle

Advanced WS 06/07 I

EM for Multi-Nominals

Random variable V with 1,...,K values

P(Y| )

new function Q( , k)
- Learning

Maximum

k+1

Bayesian Networks

Construct an new function based on the current point (which behaves well) Property: The maximum of the new function has a better scoring then the current point.

MLE:

Advanced WS 06/07 I

EM for Conditional Multinominals


multinomial for each joint state pa of

Advanced WS 06/07 I

Learning Parameters: incomplete data

the parents of V:

Non-decomposable likelihood (missing value, hidden nodes)


Data Initial parameters
S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ? B 1> 1> ?> 1>

- Learning

Current model

Bayesian Networks

Bayesian Networks

MLE

- Learning

Bayesian Networks

Expectation Maximization (EM):

- Learning

Current point

where ENk are the expected counts of X state k in the data, i.e.

Advanced WS 06/07 I

Learning Parameters: incomplete data

Advanced WS 06/07 I

Learning Parameters: incomplete data

Non-decomposable likelihood (missing value, hidden nodes)


Data Initial parameters

Non-decomposable likelihood (missing value, hidden nodes)


Data Initial parameters

Expectation
Inference: P(S|X=0,D=1,C=0,B=1)

Current model

S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ?

B 1> 1> ?> 1>

Expectation
Inference: P(S|X=0,D=1,C=0,B=1)

Current model
- Learning

S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ?

B 1> 1> ?> 1>

Expected counts
S 1 1 0 1 X D C 0 1 0 1 1 0 0 0 0 0 0 0 .. B 1 1 0 1

Expected counts

Bayesian Networks

Update parameters (ML, MAP)

Advanced WS 06/07 I

Learning Parameters: incomplete data

Advanced WS 06/07 I

Learning Parameters: incomplete data

Non-decomposable likelihood (missing value, hidden nodes)


Data Initial parameters

1. Initialize parameters 2. Compute pseudo counts for each variable


junction tree algorithm
- Learning

Expectation
Inference: P(S|X=0,D=1,C=0,B=1)

Current model

S X D C <? 0 1 0 <1 1 ? 0 <0 0 0 ? <? ? 0 ?

B 1> 1> ?> 1>

Expected counts

Bayesian Networks

EM-algorithm: iterate until convergence

Update parameters (ML, MAP)

4. If not converged, iterate to 2

Bayesian Networks

Maximization

S 1 1 0 1

X D C 0 1 0 1 1 0 0 0 0 0 0 0 ..

B 1 1 0 1

3. Set parameters to the (completed) ML estimates

- Learning

Bayesian Networks

Maximization

S 1 1 0 1

X D C 0 1 0 1 1 0 0 0 0 0 0 0 ..

B 1 1 0 1

- Learning

Advanced WS 06/07 I

Monotonicity

Advanced WS 06/07 I

LL on training set (Alarm)

(Dempster, Laird, Rubin 77): the incompletedata likelihood fuction is not decreased after an EM iteration

- Learning

Bayesian Networks

Experiment by Bauer, Koller and Singer [UAI97]

Advanced WS 06/07 I

Parameter value (Alarm)

Advanced WS 06/07 I

EM in Practice
Initial parameters:
Random parameters setting Best guess from other source

Stopping criteria:
Small change in likelihood of data Small change in parameter values
- Learning

Multiple restarts Early pruning of unpromising ones

Bayesian Networks

Speed up:
various methods to speed convergence

Experiment by Baur, Koller and Singer [UAI97]

Bayesian Networks

- Learning

Avoiding bad local maxima:

Bayesian Networks

(discrete) Bayesian networks: for any initial, nonuniform value the EM algorithm converges to a (local or global) maximum.

- Learning

Advanced WS 06/07 I

Gradient Ascent

Advanced WS 06/07 I

Parameter Estimation: Summary

Main result

Requires same BN inference computations as EM Pros:


- Learning

Parameter estimation is a basic task for learning with Bayesian networks Due to missing values non-linear optimization
EM, Gradient, ... Fully observed data: counting Partially observed data: pseudo counts
- Learning Bayesian Networks

Flexible Closely related to methods in neural network training

EM for multi-nominal random variables

Need to project gradient onto space of legal parameters To get reasonable convergence we need to combine with smart optimization techniques

Bayesian Networks

Cons:

Junction tree to do multiple inference

You might also like