Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Bayesian Models and Notation

Logistic Regression
• Random (= statistical = stochastic) vari-
able: upper-case letter, e.g. V or X, or
upper-case string, e.g. RAIN
Probability theory as basis for the construction
of classifiers:
• Binary variables: take one of two values,
X = true (abbreviated x) and X = false
• Multivariate probabilistic models (abbreviated ¬x)

• Conjunctions: (X = x) ∧ (Y = y) as
• Independence assumptions
X = x, Y = y

• Naive Bayesian classifier • Templates: X, Y means X = x, Y = y, for


any value x, y, i.e. the choice of the values
x and y does not really matter
• Forest-augmented networks (FANs)
P
• X P (X) = P (x) + P (¬x), where X is bi-
• Approximation: logistic regression nary

Joint probability distribution Chain rule

Joint (= multivariate) distribution: Definition of conditional probability distribu-


P (X1 , X2 , . . . , Xn) tion:

Example of joint probability distribution: P (X1 , X2 , . . . , Xn)


P (X1 | X2 , . . . , Xn ) =
P (X2 , . . . , Xn)
P (X1 , X2, X3 ) with:
P (x1 , x2, x3) = 0.1 ⇒ P (X1 , X2, . . . , Xn ) =
P (¬x1 , x2, x3) = 0.05 P (X1 | X2 , . . . , Xn )P (X2 , . . . , Xn)
P (x1 , ¬x2, x3) = 0.10
Furthermore,
P (x1 , x2, ¬x3) = 0.0
P (¬x1 , ¬x2, x3) = 0.3 P (X2 , . . . , Xn ) =
P (x1 , ¬x2, ¬x3) = 0.2 P (X2 | X3 , . . . , Xn )P (X3 , . . . , Xn)
P (¬x1 , x2, ¬x3) = 0.1 .. .. ..
P (¬x1 , ¬x2, ¬x3) = 0.15 P (Xn−1 , Xn) = P (Xn−1 | Xn)P (Xn )
P P (Xn ) = P (Xn )
Note that: X1 ,X2,X3 P (X1 , X2, X3 ) = 1

Marginalisation: Chain rule yields factorisation:

X n
^ n
Y n
^
P (x3 ) = P (X1 , X2 , x3) = 0.55 P( Xi ) = P (Xi | Xk )
X1,X2 i=1 i=1 k=i+1
Chain rule - digraph Does the chain rule help?

X3
X3 X3

X1 X2
X1 X2 X1 X2
(1) (2)
P (X1 , X2 , X3 ) = P (X1 | X2 , X3 ) ·
Factorisation (1): P (X2 | X3 )P (X3 )

i.e. we need:
P (X1 , X2 , X3 ) = P (X1 | X2 , X3 ) ·
P (x1 | x2, x3)
P (X2 | X3 )P (X3 )
P (x1 | ¬x2, x3)
P (x1 | x2, ¬x3)
P (x1 | ¬x2, ¬x3)
Other factorisation (2):
P (x2 | x3)
P (x2 | ¬x3)
P (X1 , X2 , X3 ) = P (X2 | X1 , X3 ) ·
P (x3 )
P (X1 | X3 )P (X3 )
Note P (¬x1 | x2, x3) = 1 − P (x1 | x2, x3), etc.
⇒ different factorisations possible ⇒ 7 probabilities required (as for P (X1 , X2 , X3))

Use stochastic independence


Definition Bayesian Network (BN)
P (X1 , X2 , X3 ) = P (X2 | X1 , X3 ) ·
P (X3 | X1 )P (X1 ) A Bayesian network B is a pair B = (G, P ),
where:
Assume that X2 and X3 are conditionally in-
dependent given X1 : • G = (V (G), A(G)) is an acyclic directed graph,
with
P (X2 | X1 , X3 ) = P (X2 | X1 )
– V (G) = {X1, X2 , . . . , Xn }, a set of vertices
and (nodes)
P (X3 | X1 , X2 ) = P (X3 | X1 ) – A(G) ⊆ V (G) × V (G) a set of arcs

Notation: X2 X3 | X1, X3 X2 | X1
|=

|=

• P : ℘(V (G)) → [0, 1] is a joint probability


X3 X3 distribution, such that
n
Y
P (V (G)) = P (Xi | πG(Xi))
i=1
where πG(Xi ) denotes the set of immediate
X1 X2 X1 X2
ancestors (parents) of vertex Xi in G

Only 5 = 2 + 2 + 1 probabilities (instead of 7)


Reasoning: evidence propagation
Example Bayesian network
• Nothing known:
P (FL, PN, MY, FE, TEMP) MYALGIA
NO
FLU YES
NO

P (MY = y|FL = y) = 0.96 YES

FEVER
P (MY = y|FL = n) = 0.20 NO
YES
PNEUMONIA
NO TEMP
YES <=37.5

Myalgia (MY) >37.5

(yes/no) • Temperature >37.5 ◦C:


P (FL = y) = 0.1
P (FE = y|FL = y, PN = y) = 0.95 MYALGIA
Flu (FL) FLU
NO

P (FE = y|FL = n, PN = y) = 0.80 NO


YES

(yes/no) YES

P (FE = y|FL = y, PN = n) = 0.88 NO


FEVER

YES

P (FE = y|FL = n, PN = n) = 0.001 NO


PNEUMONIA
TEMP
YES <=37.5
>37.5

P (PN = y) = 0.05 Fever (FE)


Pneumonia (PN) (yes/no) TEMP • Likely symptoms of the flu?
(yes/no) (≤ 37.5/ MYALGIA
NO
FLU
> 37.5) NO
YES
YES

FEVER
P (TEMP ≤ 37.5|FE = y) = 0.1 NO
YES
PNEUMONIA
P (TEMP ≤ 37.5|FE = n) = 0.99 NO TEMP
YES <=37.5
>37.5

Bayesian network structure


learning Special form Bayesian networks

Bayesian network B = (G, P ), with Problems:


• digraph G = (V (G), A(G)), and • for many BNs too many probabilities have
• probability distribution to be assessed
Y
P (V ) = P (X | π(X)) • complex BNs do not necessarily yield better
X∈V (G) classifiers
• complex BNs may yield better estimates to
the probability distribution

Spectrum Solution:
use simple probabilistic models for classifica-
naive Bayesian tree−augmented
network Bayesian network
tion:
general Bayesian (TAN)
• naive (independent) form BN
networks
• T ree-Augmented Bayesian Network (TAN)
Un restricted Restricted Structure Learning
Structure • F orest-Augmented Bayesian Network (FAN)
Learning
Naive (independent) form BN Example of naive Bayes

··· P (myalgia | flu ) = 0.96


E2
Myalgia P (myalgia | pneu ) = 0.28
E1 Em y/n

C Disease Fever P (fever | flu ) = 0.88

• C is a class variable pneu /flu y/n P (myalgia | pneu ) = 0.82

• The evidence variables Ei in the evidence P (flu ) = 0.67


TEMP
E ⊆ {E1, . . . , Em} are conditionally indepen- P (pneu ) = 0.33 ≤ 37.5/ P (TEMP ≤ 37.5 | flu ) = 0.20
> 37.5
dent given the class variable C P (TEMP ≤ 37.5 | pneu ) = 0.26
This yields, using Bayes’ rule:
Evidence: E = {TEMP > 37.5}; computation
P (E | C)P (C) of the probability of flu using Bayes’ rule:
P (C | E) =
P (E)
P (TEMP > 37.5) | flu )P (f lu)
P (flu | TEMP > 37.5) =
with, as Ei Ej | C, for i 6= j:
|=

P (TEMP > 37.5)


Y
P (E | C) = P (E | C) by cond. ind. P (TEMP > 37.5) = P (TEMP > 37.5 | flu )P (flu ) +
E∈E
X P (TEMP > 37.5 | pneu )P (pneu )
P (E) = P (E | C)P (C) marg. & cond. = 0.8 · 0.67 + 0.74 · 0.33 ≈ 0.78
C
⇒ P (flu | TEMP ≥ 37.5) = 0.8 · 0.67/0.78 ≈ 0.687
Classifier: cmax = arg maxC P (C | E)

Computing probabilities from data More complex Bayesian networks


• We want to add arcs to a naive Bayesian
Compute the weighted average of network to improve its performance
• estimate PbD (V | π(V )) of the conditional • Result: possibly TAN
probability distribution for variable V based
···
on the dataset D
E2
• Dirichlet prior Θ, which reflects prior
knowledge E1 Em
C
These are combined as follows: • Which arc should be added?
n n0
PD (V | π(V )) = PbD (V | π(V )) + Θ E E0
n + n0 n + n0

where C
• n is the size of the dataset D
Compute mutual information between variables
• n0 is the estimated size of the (virtual) ‘dataset’ E, E 0 conditioned on the class variable C:
on which the prior knowledge is based (equiv- X P (E, E 0 | C)
IP (E, E 0 | C) = P (E, E 0, C) · log
alence sample size) P (E | C)P (E 0 | C)
E,E 0,C
FAN algorithm
Performance evaluation
Choose k ≥ 0. Given evidence variables Ei, a
class variable C, and a dataset D:
• Success rate σ based on:
1. Compute mutual infor-
mation −IP (Ei, Ej | C) cmax = argmaxc{P (c | xi )}
∀(Ei, Ej ), i 6= j, in a
complete undirected graph for xi ∈ D, i = 1, . . . , n = |D|
2. Construct a minimum-cost
spanning forest containing
exactly k edges • Total entropy (penalty):
3. Change each tree in the for-
n
X
est into a directed tree E=− ln P (c | xi)
4. Add an arc from the class i=1
vertex C to every evidence
vertex Ei in the forest – if P (c | xi) = 1, then ln P (c | xi) = 0
5. Learn conditional probabil-
ity distributions from D us- – if P (c | xi) ↓ 0 then ln P (c | xi) → −∞
ing Dirichlet distributions

Example COMIK BN Results for COMIK Dataset

CLOTTING-FACTORS FEVER
80
>0.70 NO
<=0.55 YES-WITHOUT-CHILLS
0.56-0.70 WITH-CHILLS

GALL-BLADDER
NONE
COURVOISIER
75
BILIRUBIN ASCITES FIRM-OR-TENDER LEUKAEMIA-LYMPHOMA
<200UMOL/L NO NO
>=200UMOL/L YES YES
ASAT
Correct conclusion (%)

<40U/L
40-319U/L INTERMITTENT-JAUNDICE
UPPER-ABDOMINAL-PAIN
>=320U/L
SLIGHT-MODERATE
SEVERE
NO
YES
70
LIVER-SURFACE
SMOOTH
NODULAR GI-CANCER
NO
ALKALINE-PHOSPHATASE YES
<400U/L
400-1000U/L
>1000U/L LDH
65
<1300U/L
>=1300U/L

31-64YR
AGE 60
>=65YR

CONGESTIVE-HEART-FAILURE BILIARY-COLICS-GALLSTONES

ALCOHOL
NO
YES
NO
YES 55
1-4DRINKS/DAY
>=5DRINKS/DAY

HISTORY-GE-2-WEEKS
50
SPIDERS
NO
YES
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
NO
YES
Number of added arcs

NO
WEIGHT-LOSS 1200
YES
COMIK
JAUNDICE-DUE-TO-CIRRHOSIS
NO ACUTE-NON-OBSTRUCTIVE 1175
YES CHRONIC-NON-OBSTRUCTIVE
BENIGN-OBSTRUCTIVE
MALIGNANT-OBSTRUCTIVE
OTHER 1150
1125
1100
Based on COMIK dataset: 1075
Entropy

1050
• Dataset with 1002 patient cases with liver 1025

and biliary tract disease, collected by the 1000


975
Danish COMIK group 950
925
• Has been used as vehicle for learning various 900
probabilistic models 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of added arcs
Naive Bayes: odds-likelihood form
Comparison Bayesian networks: NHL
For class variable C and evidence E:
BM-DEPRESSION

CT&RT-SCHEDULE
NO
YES
Q
E∈E P (E | C)P (C)
NONE
RT
GENERAL-HEALTH-STATUS CT
POOR

P (C | E) =
CT-NEXT-RT PERFORATION
AVERAGE
NO
GOOD
YES

HEMORRHAGE
P (E)
NO THERAPY-ADJUSTMENT
YES NO
YES

if E E 0 | C, ∀E, E 0 ∈ E; for C = true:

|=
AGE
10-19
20-29
30-39 EARLY-RESULT
CLINICAL-STAGE CR
40-44
I PR POST-CT&RT-SURVIVAL

Q
45-49
II1 NC NO
50-54
II2 PD
55-59 YES

E∈E P (E | c)P (c)


III
60-64
IV
65-69
70-79
80-89
>=90
P (c | E) =
BULKY-DISEASE
IMMEDIATE-SURVIVAL
NO
P (E)
YES
YES
NO

For C = false:
Q
5-YEAR-RESULT
HISTOLOGICAL-CLASSIFICATION ALIVE
LOW-GRADE DEATH

E∈E P (E | ¬c)P (¬c)


HIGH-GRADE
SURGERY
POST-SURGICAL-SURVIVAL

P (¬c | E) =
NONE
CURATIVE NO
PALLIATIVE YES

CLINICAL-PRESENTATION
NONE
HEMORRHAGE
P (E)
PERFORATION
OBSTRUCTION
Q
P (c | E) P (E | c) P (c)
CLINICAL-STAGE
I
GENERAL-HEALTH-STATUS
POOR
CT&RT-SCHEDULE ⇒ = Q E∈E
P (¬c | E) E∈E (E | ¬c) P (¬c)
P
NONE
II1 AVERAGE
RT
II2 GOOD CLINICAL-STAGE CT&RT-SCHEDULE
CT
III I NONE
CT-NEXT-RT
IV II1 RT
II2 CT

m
III CT-NEXT-RT

Y
IV
GENERAL-HEALTH-STATUS
POOR
AVERAGE

= λi · O(c)
GOOD

AGE
10-19
AGE 20-29
10-19 30-39
20-29 40-44

i=1
45-49
30-39
40-44 50-54 5-YEAR-RESULT
5-YEAR-RESULT SURGERY 55-59 ALIVE
45-49
NONE 60-64 DEATH
50-54 ALIVE
CURATIVE 65-69
55-59 DEATH

= O(c | E)
PALLIATIVE 70-79
60-64
80-89
65-69
>=90
70-79
80-89
>=90 SURGERY
NONE
CURATIVE
PALLIATIVE

YES
BULKY-DISEASE HISTOLOGICAL-CLASSIFICATION
LOW-GRADE
CLINICAL-PRESENTATION
NONE
HEMORRHAGE
PERFORATION
YES
NO
BULKY-DISEASE HISTOLOGICAL-CLASSIFICATION
LOW-GRADE
HIGH-GRADE
CLINICAL-PRESENTATION
NONE
HEMORRHAGE
PERFORATION
Here is O(c | E) the conditional odds, and λi =
P (Ei | c)/P (Ei | ¬c) is a likelihood ratio
NO HIGH-GRADE OBSTRUCTION OBSTRUCTION

Odds, likelihoods and logarithms


Odds and probabilities
Odds:
P (c | E)
O(c | E) =
10 P (¬c | E)
9 P (c | E)
=
8 1 − P (c | E)
7

6
Back to probabilities:
Odds O

5
O(c | E)
4 P (c | E) =
1 + O(c | E)
3

2 Logarithmic odds-likelihood form:


1 m
Y
0 ln O(c | E) = ln λi · O(c)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability P
i=1
m
X
Note that: = ln λi + ln O(c)
i=1
• O(c | E) = 1 if P (c | E) = 0.5 m
X
= ωi
• O(c | E) → ∞ if P (c | E) ↑ 1 i=0

with ω0 = ln O(c) and ωi = ln λi, i = 1, . . . , m


Log-odds and weights
Logistic regression
Log-odds:
m
Y y
ln O(c | E) = ln λi · O(c)
i=1
m
X
= ln λi + ln O(c)
i=1
m
X
= ωi
i=0

Back to probabilities: x
decision hyperplane
O(c | E)
P (c | E) =
1 + O(c | E)
P Hyperplane: {ω | β T ω = 0} where
exp( m i=0 ωi )
= P
1 + exp( m i=0 ωi )
• c = β0ω0 is the intercept (recall that ω0 =
ln O(c), which is independent of any evidence
Adjust ωi with weights βi based in existing in- E)
teractions between variables:
• ωi, i = 1, . . . , m correspond to the probabili-
m
X
ln O(c | E) = βi ω i = β T ω ties we want to find
i=0

Maximum likelihood estimate Maximum likelihood estimate


N 
X 
Database D, |D| = N , with independent in- L(β) = yi ln Pβ (ci|x0i) + (1 − yi) ln(1 − Pβ (ci|x0i))
stances xi ∈ D, then likelihood function l: i=1
N 
X  
T βT ω
N
Y = yiβ ω − ln 1 + e
l(β) = Pβ (Ci | x0i) i=1
i=1
Maximisation of L(β):
where Ci is the class value for instance i, and  
T
x0i is xi, without Ci ∂L(β) N
X eβ ω 
= y i ω −
T
∂β i=1 1 + eβ ω
Log-likelihood function L:
N 
X 
= yiω − Pβ (ci | x0i)
L(β) = ln l(β)
i=1
N
X = 0
= ln Pβ (Ci | x0i)
i=1
N 
Can be solved using equation solving methods,
X
= yi ln Pβ (ci | x0i) + such as Newton-Raphson method
!−1
i=1  ∂L(β) ∂ 2L(β)
(1 − yi) ln(1 − Pβ (ci | x0i)) βr+1 = βr − ·
∂β ∂β T ∂β

where ci is (always) the value Ci = true; yi = 1 for r = 0, 1, . . ., and β0 = 0; result: approxima-


if ci is the class value for xi, yi = 0, otherwise tion for β
WEKA output
Scheme: weka.classifiers.Logistic
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Logistic Regression (2 classes)

Coefficients...
Variable Coeff.
1 34.9227
2 -48.1161
3 7.8472
4 17.3933
5 -33.0445
6 22.2601
7 -82.415
8 -54.6671
Intercept 66.1064

You might also like