Professional Documents
Culture Documents
Bayesian Data Mining
Bayesian Data Mining
Mining
University of Belgrade
School of Electrical Engineering
Department of Computer Engineering and
Information Theory
Marko Stupar 11/3370
1/40
table)
Targe
Value
t
1
Value
2
Value
100000
solution
We need to classify, estimate, predict in real time
Marko Stupar 11/3370
2/40
Problem importance
Find relation between:
All Diseases,
All Medications,
All Symptoms,
3/40
Existing solutions
CART, C4.5
Too many iterations
Continuous arguments need binning
Rule induction
Continuous arguments need binning
Neural networks
High computational time
K-nearest neighbor
Output depends only on distance based close values
Marko Stupar 11/3370
4/40
estimations
Often does surprisingly well
May not be the best possible classifier
Robust, Fast, it can usually be relied on to
5/40
Attribute
1
Attribute
2
..
Attribute n
..
..
..
..
..
..
..
Training Set
Attribute Attribute
1
2
a1
a2
Attribute n
an
Missing
6/40
How to calculate?
P((A
A1...
...A
An || TT tt))** P
P((TT tt))
T arg max P (T t | A11... Ann ) arg max P
1
n
tt
Pt ( A1... An )
t
P ( A1... An | T ) P ( A1... An 1 | AnT ) P( An | T )
P ( A1... An 2 | An 1 AnT ) P ( A1... An 1 | AnT ) P ( An | T )
P ( Ai | Ai 1... AnT )
i
T arg max P (T t ) * P( Ai | T t )
t
1
P ( Ai | Ai 1... AnT ) iP
( Ai | T )
P ( A1... An | T ) P ( Ai | T )
i
7/40
Age
Income
Student
Credit
Target
Buys Computer
Youth
High
No
Fair
No
Youth
High
No
Excellent
No
High
No
Fair
Yes
Nave Bayes
3
Middle
Discrete Target Example
4
Senior
Medium
No
Fair
Yes
Senior
Low
Yes
Fair
Yes
Senior
Low
Yes
Excellent
No
Middle
Low
Yes
Excellent
Yes
Youth
Medium
No
Fair
No
Youth
Low
Yes
Fair
Yes
10
Senior
Medium
Yes
Fair
Yes
11
Youth
Medium
Yes
Excellent
Yes
12
Middle
Medium
No
Excellent
Yes
13
Middle
High
Yes
Fair
Yes
14
Senior
Medium
No
Excellent
No
Nave Bayes
Discrete Target - Example
given Attributes
9/40
Nave Bayes
Discrete Target Spam filter
Attributes = Text Document = w1,w2,w3
Target
= Spam = [Yes | No] ?
Array of words
p ( wi | Spam)
- probability
that all words of document occur in Spam documents in training set
i
10/40
Nave Bayes
Discrete Target Spam filter
p( Spam) * p ( wi | Spam)
p ( Spam | Attributes[ w1 , w2 ,...])
i
BF
p (Spam | Attributes[ w1 , w2 ,...]) p(Spam) * p ( wi | Spam)
- Bayes factor
11/40
12/40
height
(feet)
weigh
t (lbs)
foot
size
(inches
)
male
180
12
male
5.92
190
11
male
5.58
170
12
male
5.92
165
10
femal
e
100
femal
e
5.5
150
femal
5.42
130
sex
femal
e
height
5.75
(feet)
150
weigh
t (lbs)
foot
9
size
(inches
)
130
Validation set
e
Targe
t
1 n
X i
n i 1
Target =
male
1 n
( X i ) 2
n i 1
2
p (male | h 6, w 130, f 8)
Target =
female
2 2
height
(feet)
5.88
5
0.02
717
5
5.41
75
0.07
2918
75
weight (lbs)
176.
25
126.
562
5
132.
5
418.
75
foot
size(inches)
11.2
5
0.68
75
7.5
1.25
p ( female | h 6, w 130, f 8)
13/40
T P(T ti | A1... An ) * ti
i
14/40
BAYESIAN NETWORK
mining?
Assumption
P( Ai | Ai 1 ,..., AnT ) P( Ai | T )
Marko Stupar 11/3370
?
15/40
Bayesian Network
Bayesian network is a directed acyclic graph
A
7
P(
|Ta
2
A
)
et
g
r
A
Marko Stupar
11/3370
2
A
4
A
5
A
6
Targe
t
A
3
P(A3|Target,A4A6)
P(A6|A4A5)
16/40
Bayesian Network
What to do
Construct Network
Training Set
Re
P ( A1... AnT )
ad
k
r
o
w
et
Nave Bayes
17/40
Bayesian Network
Read Network
Joint Probability
Distribution
Chain rule of probability
P ( A1... An ) ?
P ( A1... An ) P ( Ai | Ai 1... An )
i
Assumption
P ( A1... An ) P ( Ai | ParentsOf ( Ai ))
i
P( A1... An | B1...Bm ) ?
P( A1... An B1...Bm )
P( A1... An | B1...Bm )
P( B1...Bm )
A5
A2
A7
A7 depends only on A2
and A5
18/40
Bayesian Network
Read Network - Example
P(M)
0.2
P(!
M)
0.8
Medicatio
n
M T
P(B)
P(!B)
0.95
0.05
0.3
0.7
0.6
0.4
0.9
0.1
P(T)
P(!
T)
0.05
0.95
Blood
Cloth
Heart
Attack
P(H)
P(!
H)
0.4
0.6
0.15
0.85
Trauma
P(S
)
P(!
S)
0.3
5
0.65
0.1
0.9
Stroke
Nothing
P(N
)
P(!
N)
P ( N , B, M , T ) P ( N | B ) P ( B | M , T ) P ( M ) P(T )
0.2
5
0.75
0.7
0.25
5
How to get P(N|B), P(B|M,T)?
Expert knowledge
From Data(relative frequency estimates)
Or a combination of both
19/40
Bayesian Network
Construct Network
Manually
From Database Automatically
Heuristic algorithms
1. heuristic search method to construct a model
2.evaluates model using a scoring method
Bayesian
scoring method
entropy based method
minimum description length method
20/40
Bayesian Network
Construct Network
Heuristic algorithms
Advantages
less time complexity in worst case
Disadvantage
May not find the best solution due to heuristic nature
Algorithms that analyze dependency among nodes
Advantages
usually asymptotically correct
Disadvantage
CI tests with large condition-sets may be unreliable
unless the volume of data is enormous.
Marko Stupar 11/3370
21/40
Bayesian Network
Construct Network - Example
1. Choose an ordering of variables X1, ,Xn
2. For i = 1 to n
add Xi to the network
select parents from X1, ,Xi-1 such that
X
Marr
1, ... Xi-1)
y
Calls
John
Calls
Alar
m
Burglar
y
P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)?
No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)? No
P(E | B, A, J, M) = P(E | A, B)? Yes
Earthquak
e
22/40
are blocked
Example of Path : N1 <- N2 -> N3 -> N4 -> N5 <- N6 <- N7
N5 head-to-head node
Path is not blocked if every head-to-head node is in Z-set or has
23/40
4.
pairs
nodes
are
independent
of each
otherAgiven
5.
Do
we
have
P(AF|E)
P(A|E)P(F|E)?
and FB?independent
2.
Do
D and
Eofthat:
d-separate
C=and
F?
1.
Does
D
d-separate
Cof
and
F?
3. Which
Write
down
all
pairs
nodes
which
are (are
independent
of each other.
given
NodesE?)
which are independent are those that are d-separated by the
We
needset
to
by
B.node D given {D,E}.
path
C -find
B nodes.
-which
A - Dnodes
- E - Fare
is d-separated
blocked
the
The
empty
of
There
are
two
undirected
paths
from by
C to
F:
A,
C and
D
all
d-separated
fromCgiven
F- because
the node
E.it given
E no
longer
blocks
B-E
-since
Fofmust
path
since
node.
However,
A
and
F are
are
NOT
independent
does
d-separate
This
every
path
between
contain
at least
oneAone
node
(i)
C means
-B
-E F
This
blocked
given them
DE,by
the Enode
E,not
since
E is not
of
C isand
d-separated
from
all
the
other
nodes
(except
B)
given
B.
withFboth path arrows going into it, which is E in current context.
the given nodes and has both arrows on the path going into it.
D and E do not d-separate C and F
So,
(ii)independent
C - B - A - Dpairs
- E -given
F. This
path
is also
blocked
byCF,
E (and
The
B are
hence:
AF, AC,
CD, CE,
DF.
D as well).
We find that F is independent of A, of B, of C and of D. All other pairs of
are dependent
eachF other.
nodes
So, D does
d-separateon
C and
Marko Stupar 11/3370
24/40
25/40
PD
( X ,Y ) P
( xy) * log
x, y
P
D
( xy )
( x) P
( y)
D
PD
( X , Y | Z)
x , y ,z
( xyz ) * log
( xy | z )
( x | z) P
( y | z)
D
( X , Y | Z)
D
P
condition set Z,
If I
26/40
Backup Slides
(Not needed)
27/40
Very fast
Very robust
Target node is the father of all other nodes
The low number of probabilities to be estimated
Knowing the value of the target makes each node independent
28/40
Models:
Naive Bayes
Pruned
(Naive Bayes Build)
decision tree
Simplified
(Single Feature Build)
(Multi Feature
Boosted
Build)
Marko Stupar 11/3370
29/40
PD
( X , Y | Z)
x , y ,z
( xyz ) * log
( xy | z )
( x | z) P
( y | z)
D
(b) Build a complete undirected graph in which the vertices are the attributes A1, A2,
30/40
31/40
32/40
Create
Network
from database
Augmented
Markov
Blanket
Marko Stupar 11/3370
33/40
Phase I: (Drafting)
1. Initiate a graph G(V, E) where V={all the nodes of a data set}, E={ }. Initiate two empty ordered set S, R.
2. For each pair of nodes (v , v ) i j where v v V i j , , compute mutual information I v v i j ( , ) using equation (1). For
the pairs of nodes that have mutual information greater than a certain small value e , sort them by their mutual
information from large to small and put them into an ordered set S.
3. Get the first two pairs of nodes in S and remove them from S. Add the corresponding arcs to E. (the direction of
the arcs in this algorithm is determined by the previously available nodes ordering.)
4. Get the first pair of nodes remained in S and remove it from S. If there is no open path between the two nodes
(these two nodes are d-separated given empty set), add the corresponding arc to E; Otherwise, add the pair of
nodes to the end of an ordered set R.
7. Find a block set that blocks each open path between these two nodes by a set of minimum number of nodes.
(This procedure find_block_set (current graph, node1, node2) is given at the end of this subsection.)
Conduct a CI test. If these two nodes are still dependent on each other given the block set, connect them by an
arc.
9. For each arc in E, if there are open paths between the two nodes besides this arc, remove this arc from E temporarily and call
procedure find_block_set (current graph, node1, node2). Conduct a CI test on the condition of the block set. If the two nodes are
dependent, add this arc back to E; otherwise remove the arc permanently.
34/40
Bayesian Network
Applications
Applications
1. Gene regulatory networks
2. Protein structure
3. Diagnosis of illness
4. Document classification
5. Image processing
6. Data fusion
7. Decision support systems
8. Gathering data for deep space exploration
9. Artificial Intelligence
10. Prediction of weather
11. On a more familiar basis, Bayesian networks are used by the friendly
35/40
Bayesian Network
Advantages, Limits
The advantages of Bayesian Networks:
Visually represent all the relationships between the variables
Easy to recognize the dependence and independence between nodes.
Can handle incomplete data
scenarios where it is not practical to measure all variables (costs, not enough
sensors, etc.)
Help to model noisy systems.
Can be used for any system model - from all known parameters to no known
parameters.
The limitations of Bayesian Networks:
All branches must be calculated in order to calculate the probability of any one
branch.
The quality of the results of the network depends on the quality of the prior beliefs
or model.
Calculation can be NP-hard
Calculations and probabilities using Baye's rule and marginalization can become
complex and are often characterized by subtle wording, and care must be taken to
calculate them properly.
36/40
Bayesian Network
Software
Bayesia Lab
Weka - Machine Learning Software in Java
AgenaRisk , Analytica, Banjo, Bassist, Bayda,
37/40
Problem Trend
History
The term "Bayesian networks" was coined by Judea Pearl in
1985
In the late 1980s the seminal texts Probabilistic Reasoning
in Intelligent Systems and Probabilistic Reasoning in Expert
Systems summarized the properties of Bayesian networks
Fields of Expansion
Nave Bayes
Bayesian Networks
Find new way to construct network
38/40
Rutgers University
Dept. of Computing Science University of Alberta Alberta, T6G 2H1 Email: jcheng@cs.ualberta.ca David Bell,
Weiru Liu Faculty of Informatics, University of Ulster, UK BT37 0QB Email: {w.liu, da.bell}@ulst.ac.uk
http://www.bayesia.com/en/products/bayesialab/tutorial.php
ISyE8843A, Brani Vidakovic Handout 17 1 Bayesian Networks
Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan Angus Ng Bing Liu Philip S. Yu ZhiHua Zhou Michael Steinbach David J. Hand Dan Steinberg Received: 9 July 2007 / Revised: 28
September 2007 / Accepted: 8 October 2007 Published online: 4 December 2007 Springer-Verlag London
Limited 2007
Causality Computational Systems Biology Lab Arizona State University Michael Verdicchio With some
slides and slide content from: Judea Pearl, Chitta Baral, Xin Zhang
39/40
Questions?
40/40