Download as pdf or txt
Download as pdf or txt
You are on page 1of 35


Artificial Intelligence Techniques

Unit 12:
Bayesian Network
• Independence
• Conditional Independence
• Bayesian Network
• Inference in Bayesian Network

• Chapter 14 in Russell & Norvig
• CS188 Lecture Note: Bayes Nets [link]

▪ An event A is independent of event B if A is not affected by B.

▪ For example
• When you toss a coin, the probability of getting head at
any time is not affected by previous tosses.
• The probability that today is A’s birthday is independent
of the date of B’s birthday.

▪ To denote the independence of the two variables A and B, we

use the following notation:

𝑃𝑟𝑜𝑑𝑢𝑐𝑡 𝑟𝑢𝑙𝑒:
▪ If two variables x and y are independent: 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦)

▪ For example: If the passing exam and the weather are

W P(W)
E P(E)
hazy 0.6
pass 0.85
sunny 0.1
pass 0.15
rainy 0.3
Passing an exam Weather

Then, the probability of passing exam and a sunny day is

P(pass, sunny) = P(pass)P(sunny)
= 0.85*0.1 = 0.085
Notes: Only correct if the weather and passing an exam are independent. 4

▪ For two independent variables , the joint distribution can be

derived directly from their marginal distributions
▪ For example: the joint distribution of the two independent
Passing an exam Weather

is given by:

hazy sunny rainy

0.85*0.6 0.85*0.1 0.85*0.3
pass = 0.085 = 0.255
= 0.51
0.15*0.6 0.15*0.1 0.15*0.3
pass = 0.09 = 0.015 = 0.045

▪ Independent variables also have the following properties:

▪ Consider the previous example:

W and E are independent:
P(pass | hazy) = P(pass, hazy)/P(hazy)
= 0.51 / 0.6
= 0.85
= P(pass)

P(hazy| pass) = P(pass, hazy)/P(pass)

= 0.51 / 0.85
= 0.6
= P(hazy)
Given the following joint distribution P1(T, W) hot sun 0.4
determine if T and W are independent hot rain 0.2
cold sun 0.1
cold rain 0.3

Answer: W P T P
sun 0.5 hot 0.6
First, create the marginal rain 0.5 cold 0.4
Next, use the
hot sun 0.3
independence assumption hot rain 0.3
to compute the joint cold sun 0.2
distribution P2(T, W) cold rain 0.2

T and W are not independent because the original joint distribution

P1(W, T) is very different from the joint distribution generated from the
independence assumption P2(W, T)
Simplifying Joint Distributions

▪ The independence property can simplify the probability

▪ For example, if Weather is independent of Cavity, Toothache
and Catch, we may break the original table into two tables:

P(Toothache, Catch, Cavity, Weather)

= P(Toothache, Catch, Cavity) P(Weather)

Simplifying Joint Distributions

▪ This results in less entries

P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather)

8 entries

4 entries
32 entries

Assume Weather: {rainy, sunny, foggy, snowy}.

Cavity: {cavity, cavity},
Toothache: {toothache, toothache}
Catch: {catch, catch}

The number of variables reduces from 32 (i.e., 4222) entries to 12

(i.e., 4 + 222) entries

• For n independent unbiased coins:

P(Coin1, …, Coinn) The full joint distribution has a total 2n


With independence assumption, we

need to keep n table, each have a size
of 2. Total entries = 2n.
P(Coin1) P(Coinn)

Since the probabilities of all tosses are

Coin P(Coin)
head 0.5
the same (i.e., all 2n tables are the
tail 0.5
same), we need to store only 1 table.
Total entries = 2. This is a huge savings in
Conditional Independence

▪ Unconditional independence is rare and cannot be used to infer

other variables
▪ Conditional independence:
• Two dependent variables may become independent given
some other variables → needs at least 3 variables to work.
• Example:
• Running Nose is dependent of fever.
Running Once we know someone has fever, it
Fever becomes more likely that he also has
running nose as well
• Running Nose is independent of
Flu fever given the disease.
Once we know that a person has flu,
Running knowing one symptom does not
Nose affect the likelihood of getting
another symptom anymore 11
Conditional Independence

▪ If X is conditionally independent of Y given Z


the following rules applies:

P(X, Y|Z) = P(X|Z) P(Y|Z)

P(X|Y, Z) = P(X|Z)

P(Y|X, Z) = P(Y|Z)


What are the conditional independence relationships for

the following domains:

▪ Traffic jam (T)

▪ Somebody carries an umbrella (U)
▪ It is raining (R)


▪ There is fire (F)

▪ Smoke is present (S)
▪ Alarm triggers (based on smoke only) (A)

Example 1
Given that the probability of rainfall in a region where in average it rains
120 out of 365 days, and 40% of the employees is expected to be late for
work when it rains. Rains also cause traffic jam 80% of the time.

Assume that an employee coming late and traffic jam are conditionally
independent given the rain. What is the probability of the employee
arriving on time, there is a traffic jam, and it is a rainy day?
Information given: Query:
• P(rain) = 120/365 = 0.329 • P(late, jam, rain)
• P(late | rain) = 0.40
• P(jam | rain) = 0.80
• Late ⊥ Jam | Rain

P(late, jam, rain)

= P(late, jam | rain) P(rain) (product rule)
= P(late | rain) P (jam | rain) P(rain) (conditional independence assumptions)
= (1 – P(late|rain)) P(jam|rain) P(rain)
= 0.6*0.8*0.329
= 0.15792 14
Example 2

Given that the probability of rainfall in a region where in average it

rains 120 out of 365 days, and the probability of raining and an
employee being late is 0.30.
Assume that an employee coming late and traffic jam are conditionally
independent given the rain. What is the probability of an employee
coming late given that it is raining and there is no traffic jam?

Information given: Query:

• P(rain) = 120/365 = 0.329 P(late | jam, rain)
• P(late, rain) = 0.3
• Late ⊥ Jam | Rain

P(late| jam, rain)

= P(late | rain) (conditional independence assumption)
= P(late, rain)/P(rain) (product rule)
= 0.3 / 0.329
= 0.912
Chain Rule

▪ A joint probability distribution can be expressed using

conditional distributions by using a chain of product
rules. For example: 𝑃𝑟𝑜𝑑𝑢𝑐𝑡 𝑟𝑢𝑙𝑒:
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 𝑃(𝑦)
P(Umbrella ,Traffic, Rain,)
= P(Umbrella | Traffic, Rain) P(Traffic, Rain)
= P(Umbrella | Traffic, Rain) P(Traffic | Rain) P(Rain)
= P(Rain | Traffic, Umbrella) P(Traffic | Umbrella) P(Umbrella)

▪ The chain rule:

P(X1, X2, … , Xn) = P(Xn | X1, …, Xn-1) P(Xn-1 | X1,…, Xn-2) …P(X2 | X1)P(X1)

Simplifying Joint Distribution with Conditional Independence
▪ Let's consider P(A, B, C, D)
Apply the chain rule:
P(A, B, C, D) = P(D|A, B, C) P(A, B, C)
= P(D|A, B, C) P(C|A, B) P(A, B)
= P(D|A, B, C) P(C|A, B) P(B|A) P(A)

▪ If we make conditional independence assumption, we get:

Assume D is independent of A given B and C: P(D|A,B,C) = P(D|B,C)

= P(D|B, C) P(C|A, B) P(B|A) P(A)

Assume C is independent of B given A: P(C|A,B) = P(C|A)
= P(D|B, C) P(C|A) P(B|A) P(A)
Bayesian Network
(Bayes’ Net)

Why Bayesian Network
▪ Joint distributions
• Represented using one single big table
• Typically very huge. Grows exponentially with respect to the
number of variables
• Hard to learn or estimate anything empirically
Auto Insurance Probabilistic
Total 27 variables. If all variables are
binary, need to store 227 (~134 million)
entries in a full joint distribution.

A variable is locally related to a
few other variables.

Bayesian Network
▪ Bayesian Network
• Make assumptions on the conditional independence of
certain variables
• Represented using multiple local conditional probabilities
tables (CPTs)
• Models how variables interact locally
• Local interactions chain together to give global, indirect

A Bayesian network is a probabilistic graphical model
that represents a set of variables and their conditional B C
dependencies via a directed acyclic graph
The Bayesian Network

▪ Given
P(A, B, C, D) = P(D|B, C) P(C|A) P(B|A) P(A)

▪ P(D|B, C) P(C|A) P(B|A) P(A) is essentially a Bayesian Network

which can be visualized as follows:

A The topology of the network:

▪ Each node represents a variable
▪ The edge from A to B represents
that B is conditioned by A, P(B|A)
▪ A variable can be conditioned by
more than one variable. For
example, D is conditioned by B and
D C, P(D|B, C)
The Bayesian Network
▪ Original: use one big table (i.e., P(A, B, C, D)) to model the joint distribution
▪ BN Network: when we make certain conditional independence assumptions, we
can represent P(A, B, C, D) using the Bayesian Network (BN)
▪ The BN comprises multiple tables:
• Prior distribution for root nodes. e.g., P(A)
• Conditional Probability Table (CPT) for non-root nodes, e.g., P(B|A), P(C|A)
and P(D|B, C)
a a
A 0.4 0.6

P(B|A) c c
b b B C a 0.7 0.3
a 0.67 0.33 a 0.52 0.48
a 0.3 0.7
D P(D|B, C)
d d
b, c 0.32 0.68 P(d|b,c) = 1 – P(d|b,c)
b, c 0.28 0.72
b, c 0.63 0.37
b, c 0.45 0.55 22
Network Size

▪ Bayesian Network gives a huge space savings compared to the full joint
distribution P(X1, X2, … , Xn).
▪ Considering only non-root variables in the BN, assume n binary
• Size of the full joint distribution :
2n ‒ 1 ≈ 2n
• Using the Bayesian Net, if each node has up to k parents, n number
of nodes will have size:

Notes: For each node (i.e., CPT), there are 2k conditions

▪ If n>> k, then number of entries in the BN (n2k) will be much smaller

than the full joint distribution (2n)
▪ Easier to construct local CPTs P(Xi | Parents(Xi)) than the full joint
distribution P(X1, X2, … , Xn)
Bayesian Network and Joint Distributions

▪ Bayesian Network implicitly encode the joint distribution

▪ We can retrieve the full joint probability from a Bayes’ Net by
multiplying the relevant conditionals (from CPTs) together (chain rule):

For example:

P(cavity, catch, toothache)

=P(cavity)P(toothache|cavity) P(catch|cavity)

P(cavity, catch, toothache)

=P( cavity)P(toothache| cavity) P(catch| cavity)

Example: Traffic
Given the following Bayesian Network, get the full joint distribution
P(R, T)
R P(R) P(r, t) = P(r)P(t|r) = (1/4)(3/4) = 3/16
r 1/4
R r 3/4 P(r, t) = P(r)P(t|r) = (1/4)(1/4) = 1/16
P(r, t) = P(r)P(t|r) = (3/4)(1/2) = 3/8
R T P(T|R) P(r,  t) = P(r)P(t|r) = (3/4)(1/2) = 3/8
r t 3/4
T r t 1/4
r t 1/2
r t 1/2 P(T, R)
t t
r 3/16 1/16

r 3/8 3/8
Example: Alarm Network

Build a Bayesian Network for the following scenario:

You went for a holiday and asked your

two neighbors, John and Mary, to call if
they heard the alarm ringing. The
alarm can be triggered by minor
earthquakes or a burglar.

▪ B: Burglary A
▪ E: Earthquake
▪ A: Alarm goes off
▪ M: Mary calls J M
▪ J: John calls 26
Example: Alarm Network (cont.)

Given the following Bayesian Network, what is the probability of

event (j, m, a, b, e) happening?

Burglary Earthquake


MaryCalls JohnCalls
P(j, m, a,  b, e)
= P(e)P(b)P(a|b, e)P(j|a)P(m|a)
= 0.998  0.999  0.001  0.9  0.7
= 0.00063 27
Inference in
Bayesian Network

Bayes’ Net Inference by Enumeration

▪ We want to find out the

Burglary Earthquake probability of certain event
▪ For example: what is the
probability of burglary if both
John and Mary calls?
P(b|j, m) = ?
MaryCalls JohnCalls • Evidence variables:
John, Marry
• Query variable:
• Hidden variables:
Alarm, Earthquake

Example: Alarm Network
What is the probability of burglary if
both John and Mary call?
P(b | j, m) = P(b, j, m) / P(j, m)
Burglary Earthquake


MaryCalls JohnCalls

Example: Alarm Network (cont.)
What is the probability of burglary if
both John and Mary call?

Burglary Earthquake


MaryCalls JohnCalls

Example: Alarm Network (cont.)
What is the probability of burglary if
both John and Mary call?

Burglary Earthquake
P(j, m) = P(b, j, m) + P(b, j, m)
= 0.00059224259 + 0.001491857649
= 0.002084100239

P(b | j, m) = P(b, j, m) / P(j, m)

= 0.00059224259 / 0.002084100239
= 0.2842
MaryCalls JohnCalls

Bayes’ Net Summary
▪ Conditional Independences
▪ Bayes’ Net Representation
• A directed, acyclic graph, one node per random variable
• A conditional probability table (CPT) for each node
• Implicitly encode joint distributions:

▪ Probabilistic Inference in Bayes’ Net

• Enumeration (exact, exponential complexity, very slow)
• Variable elimination (exact, often better)
• Sampling (approximate, much faster)

▪ Learning Bayes’ Nets from Data

AI Algorithms


Machine Probabilistic Decision
Learning Inference Process
Deep Bayesian Satisfaction Adversarial
Learning Networks Problem Logic

Data Model

The End

You might also like