Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING

DISCRETE STRUCTURES FOR COMPUTING (CO1007)

Assignment (Semester 203, Duration: 04 weeks)

“Bayesian Networks”
(Version 0.1)

Advisors: Nguyễn Tiến Thịnh


Nguyễn An Khương

Students: Nguyễn Văn A – 22102134 (Class CC0x - Group 0y)


Trần Văn B – 88471475 (Class CC0x - Group 0y)
Lê Thị C – 36811334 (Class CC0x - Group 0y, Team leader)
Phạm Ngọc D – 97501334 (Class CC0x - Group 0y)

Ho Chi Minh City, 06/2021


University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Contents
1 Introduction 2

2 Bayesian networks 2
2.1 Graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Joint probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Variable elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 The pgmpy package, model evaluation, and real-world applications 9


3.1 The pgmpy package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Real-world applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Data 13

5 Instructions and Requirements 13


5.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Exercises 15

7 Evaluation and cheating judgement 18


7.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 Cheating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

References 18

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 1/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 Introduction
In this assignment, you will be equipped with a basic knowledge of Bayesian networks, which is
a branch of Machine Learning. Bayesian networks have been developed to capture the complex
relationship between properties of a subject in order to solve problems about the subject. The
dependencies between the properties are often omitted in many Machine Learning models, which
is not true in general situations. For example, in medicine, Dyspnoea (it’s the Latin name of the
shortness of breath symptom) might be caused by either the lung cancer or the air pollution.
There is likely a relationship between the air pollution and the lung cancer. However, some well-
known machine learning models such as Linear Regression usually assume that such a relationship
does not present in the models by any means.

2 Bayesian networks
2.1 Graph representation
Let’s consider the well-known Bayes formula

P (B|A)P (A)
P (A|B) = , (1)
P (B|A)P (A) + P (B|Ac )P (Ac )

where event Ac is the complement event of event A, P (A) is the chance of event A occurring,
P (A|B) is the chance of event A occurring given that event B is true. The complement event of
an event is just the difference of the sample space and the event. Such a formula is particularly
significant as in the following simple example.
Example 2.1. Suppose a test for whether someone has been infected by tuberculosis bacteria is
95% sensitive, i.e., the test identifies correctly 95% of being infected for infected people. Moreover,
the test identifies correctly 90% of not being infected for a non-infected person. For simplicity,
assume also 5% prevalence, meaning that everyone has a chance of 5% of being infected by
tuberculosis bacteria. What is the chance of being infected by tuberculosis bacteria for a person
if his test result is positive, i.e., the test identifies that he has been infected?
You might then think about applying the Bayes formula. If it is the case, you are right. Let’s
call A the event of being infected by tuberculosis bacteria and B the event of that the test result
is positive. Therefore, what we need is

P (B|A)P (A) (0.95).(0.05) 1


P (A|B) = = = . (2)
P (B|A)P (A) + P (B|Ac )P (Ac ) (0.95).(0.05) + (1 − 0.9).(1 − 0.05) 3

The computation above tells us that every people who has a positive test result has a chance of
approximately 33.33% of being infected by tuberculosis bacteria.
The key point here is the relationship between P (A|B) and P (B|A). Once you knew the
tuberculosis prevalence, you can measure the chance of that a person has been infected by
tuberculosis easily. Particularly, P (A|B) 6= P (A). In this case, two events A and B are not
independent. In fact, A is the cause of B, i.e., the fact that the test result will change whether
a person has been infected by tuberculosis bacteria or not. Such a causality allows us to present
it as a directed graph below.

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 2/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 1: Causality between event A and event B.


2.2 Joint probability distribution
Let’s think further. Assume now we generalize the situation by considering A as a “variable”
representing both cases where a person has or has not been infected by tuberculosis bacteria.
The values that A can have are denoted by “infected” and “non-infected”. Thus the fact that a
person has been infected by tuberculosis bacteria is represented by the expression “A = infected”.
Similarly, suppose B a variable representing both cases of the test result, i.e., “positive” or “non-
positive” (you may also use the term “negative” instead of “non-positive”). Then “B = positive”
represents the fact that the test identifies of being infected for someone (he may have been
infected or have not been infected). Therefore, the question in Example 2.1 is to compute the
probability P (A = infected|B = positive).
With this use of notation, we can assign the following “probability distribution” at node A.
Since A has only two values “infected” and “non-infected”, probability distribution at node A
can be seen as an array of P (A = infected) = 0.05 and P (A = non-infected) = 1 − 0.05 = 0.95.
The probability distribution at node B is much more difficult since it depends tightly on A (you

A (infected) 0.05
A (non-infected) 0.95
Table 1: Probability distribution at node A.

might want to take a look at Example 2.1 one more time to see how the probability of the test
result changes). Since A has two values “infected” and “non-infected” and B also has two values
“positive” and “non-positive”, we have four combinations in total and the probability distribution
at node B is a 2 × 2 array of P (B = positive|A = infected), P (B = non-positive|A = infected),
P (B = positive|A = non-infected), and P (B = non-positive|A = non-infected). We thus arrive at
the following table of probability distribution at node B. Then we say the directed graph in Figure

B|A A (infected) A (non-infected)


B (positive) 0.95 0.1
B (non-positive) 0.05 0.9
Table 2: Probability distribution at node B.

1 together with probability distribution in Table 1 and Table 2 form a Bayesian network. The
probability distribution is called a “joint distribution” of A and B and is denoted by P or more
precisely P (A, B). Noting that P (A, B) is not the same as P (B|A). Indeed, they are completely
different even each of them has four values as the four combinations of the values of A and B. If
we consider one among the four values of P (A, B), we can see that it is exactly P (A = infected ∧
B = positive), where ∧ is the conjunction operator, rather than the conditional probability
P (A = infected|B = positive) or conditional probability P (B = positive|A = infected). Hence,
we can apply the product rule of probability to calculate

P (A = infected ∧ B = positive) = P (A = infected)P (B = positive|A = infected)


= (0.05).(0.95) = 0.0475. (3)

Similarly, we obtain the following table. It is just a “product” of the probability distribution

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 3/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

at node A in Table 1 and the probability distribution at node B in Table 2, i.e., P (A, B) =
P (A)P (B|A).

A, B A (infected) A (non-infected)
B (positive) 0.0475 0.095
B (non-positive) 0.0025 0.855
Table 3: Probability distribution P (A, B) = P (A)P (B|A).

Definition 1. A Bayesian network is a directed graph G = (V, E), where V is the set of all
vertices and E is the set of all edges of the graph, equipped with a joint distribution P (V ).
Each vertex v ∈ V is a random variable and the probability distribution at v is the conditional
probability P (v|u1 , . . . , un ), where u1 , . . . , un are the parent nodes of v. Moreover, the graph is
a-cyclic, meaning that there is no cycle existing in the graph.
In general, any three nodes of a Bayesian network form the following basic shapes. For sim-
plicity, let G be a Bayesian network of only three nodes named A, B, and C.
1. Cascade: it is the case where graph G is A −→ B −→ C or A ←− B ←− C. In this case,
when B is observed, the connection between A and C is lost. In the other word, they are
independent, given that B is true and we denote A ⊥ C|B. Otherwise, C is dependent on A
and we denote A 6⊥ C. For example, if we consider A a person smokes or doesn’t smoke, B
the the lungs of a person is in a good or a bad condition, and C a person has Dyspnoea or
hasn’t. You can see that the causality relationship between A, B, and C is A −→ B −→ C.
If you don’t know anything about the current condition of the lungs of a person, there
will be likely some relationship between smoking and Dyspnoea because smoking may be
a reason of Dyspnoea. However, once you knew that the lungs of a person is in a very bad
condition, it must be the main (and direct) cause of the Dyspnoea that the person has had
and the fact that this person smokes or doesn’t smoke is not really important and can be
omitted.

Figure 2: Cascade shape.

Due to the independence of A and C, given that B is true, the probability distribution
P (A, B, C) can be calculated as follows.

P (A, B, C) = P (A)P (B|A)P (C|B). (4)

2. Common parent: we say that a graph G is in a “common parent” shape if it has the form
A ←− B −→ C. Similarly as in the case of the cascade shape, if B is observed then
A ⊥ C|B. Otherwise, A 6⊥ C. The common parent shape can be found in the relationship
between imaging diagnostic result (A), lung cancer (B), and Dyspnoea (C). In fact, the
cancer directly causes a positive imaging diagnostic result and shortness of breath symptom.
We can also see that when a person has Dyspnoea, the chance of that he will receive a
positive diagnostic result must be slightly higher than as usual. It explains why if the cancer
variable is not observed, the Dyspnoea variable and the diagnostic result variable may be
dependent but it is not the case when the cancer variable is observed.

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 4/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 3: Common parent shape.


The probability distribution in this case has the following formula.

P (A, B, C) = P (B)P (A|B)P (C|B). (5)

3. Explaining away: This shape of graph has a completely different property of the previous
ones. If graph G is in the form A −→ B ←− C then if B is observed, A 6⊥ C|B. Otherwise,
A ⊥ C. Let’s consider A the air is polluted or isn’t, B a person has the lung cancer or
hasn’t, and C a person smokes or doesn’t. As we can see, pollution (A) and smoking (C)
are not related to each other but two reasons of the lung cancer (B). When we know that a
person has had the lung cancer and assume that he smokes pretty much. It somehow makes
the chance of the fact that air pollution causes his cancer decreasing since smoking now is
mostly the main reason of the cancer. As a consequence, A and C are indeed dependent
on each other given that B is observed.

Figure 4: Explaining away shape.

Therefore, we can calculate the probability distribution equipped with the graph as the
product of the probability distributions at the three nodes.

P (A, B, C) = P (A)P (C)P (B|A, C). (6)

Based on the basic structures above, we can construct more complex Bayesian networks. For
example, we can construct a Bayesian network describing the relationship between air pollution,
smoking (or smoker), cancer, imaging diagnostic result (X-ray), and Dyspnoea as in Figure 5.
See [SL13; SD21] for more examples.

Figure 5: A Bayesian network for cancer.

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 5/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2.3 Variable elimination


Let’s ask an interesting question. Assume we are given a Bayesian network as in Example 2.1 and
the probability distribution P (A, B) as in Table 3. We do not know anything about the probability
distribution either at node A or node B. How can we guess the tuberculosis prevalence if we don’t
know it before?
Definition 2. Given the joint probability distribution P of n variables X1 , . . . , Xn . The marginal
probability of Xi is defined as follows.
X X X X
P (Xi ) = ··· ··· P (X1 , . . . , Xn ) ∀i ∈ {1, . . . , n}. (7)
X1 Xi−1 Xi+1 Xn

In the tuberculosis example 2.1, the marginal probability of the variable B is the probability
of whether the test result is positive or non-positive without observing whether the person has
been infected by tuberculosis bacteria. By using Table 3, we have
P (B = positive) = P (A = infected, B = positive) + P (A = non-infected, B = positive)
= 0.0475 + 0.095 = 0.1425. (8)
We can also calculate the marginal probability of B where B = non-positive similarly, the result
is 0.8575. However, it would be a “nightmare” if we are considering a larger Bayesian network
with a hundred of nodes. Indeed, assume that we have a Bayesian network of n nodes, where
each of them is a binary variable, i.e, each of them receives a value that can be 0 or 1. Then with
just a simple calculation, we can see that the number of summands in a marginal sum increases
up to 2n−1 . The running time of a “naive” algorithm computing such a sum is thus unacceptable
if n is very large.
Fortunately, we can leverage the factorization of P as in (4), (5), and (6). For simplicity, we
can inspect a network in a cascade form X1 −→ X2 −→ X3 . We then have
" #
XX X X
P (X3 ) = P (X1 , X2 , X3 ) = P (X3 |X2 ) P (X1 )P (X2 |X1 ) . (9)
X1 X2 X2 X1

Since the inner sum is taken over X1 , it becomes a function dependent on X2 only. Moreover,
we can set X
τ (X2 ) := P (X1 )P (X2 |X1 ). (10)
X1
P
Hence, it is easy to see that P (X3 ) = X2 P (X3 |X2 )τ (X2 ). For a more general case where the
graph is X1 −→ · · · −→ Xn for arbitrary n, the marginal probability P (Xn ) = τn−1 (Xn ) where
X
τi (Xi+1 ) := P (Xi+1 |Xi )τi−1 (Xi ) ∀i ∈ {1, . . . , n − 1}. (11)
Xi

By a series of “tedious” calculation steps, the total number of addition and multiplication that we
have to execute to calculate P (Xn ) by using (11) is at most a polynomial function of the sum of
the cardinalities of n−1 variables, and thus, is faster than the usual way to calculate the marginal
probability by using (7). Since one variable is removed after each step of calculation, this method
is named as Variable Elimination. Let’s apply the formula (11) to calculate P (B = positive).
From (11), Table 1, and Table 2, we have
P (B = positive) = P (A = infected)P (B = positive|A = infected)
+P (A = non-infected)P (B = positive|A = non-infected)
= (0.05).(0.95) + (0.95).(0.1) = 0.1425. (12)

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 6/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

To help you to understand the formula well, let’s consider a more complex example of a
Bayesian network of three nodes and in cascade form X1 −→ X2 −→ X3 .
Example 2.2. Suppose a box contains 3 red balls and 7 blue balls. There are 2 students named
X1 and X2 . Starting from X1 , each of them takes randomly one ball out of the box without
returning it to the box. There is another student named X3 . If the ball that X2 took is red, X3
will move one step to the left. Otherwise, he will move to the right one step.
One model for this example is X1 −→ X2 −→ X3 since the color of the ball that X1 had
taken affects the chance of taking a blue or a red ball of X2 . Moreover, without observing the
color of the ball that X2 took, we can see some effect of the color of the ball that X1 had taken
on X3 because it changes the probability of taking a blue or a red ball of X2 and thus affects
X3 . However, once X2 was observed, we know exactly whether X3 will move to the left or to the
right. Let’s find the probability distribution at each node. The calculation is straight forward,
we will omit it here.
X1 (blue) 7/10
X1 (red) 3/10
Table 4: Probability distribution at node X1 .

X2 \X1 X1 (blue) X1 (red)


X2 (blue) 2/3 7/9
X2 (red) 1/3 2/9
Table 5: Probability distribution at node X2 .

X3 \X2 X2 (blue) X2 (red)


X3 (right) 1 0
X3 (left) 0 1
Table 6: Probability distribution at node X3 .

Let’s calculate P (X3 = right). We firstly calculate τ (X2 ) by using (10).

τ (X2 = blue) = P (X1 = red)P (X2 = blue|X1 = red)


+P (X1 = blue)P (X2 = blue|X1 = blue)
3 7 7 2 7
= . + . = . (13)
10 9 10 3 10

τ (X2 = red) = P (X1 = red)P (X2 = red|X1 = red)


+P (X1 = blue)P (X2 = red|X1 = blue)
3 2 7 1 3
= . + . = . (14)
10 9 10 3 10
Now we substitute (13) and (14) in (9). Thus, we have

P (X3 = right) = τ (X2 = blue)P (X3 = right|X2 = blue)


+τ (X2 = red)P (X3 = right|X2 = red)
7 3 7
= .1 + .0 = . (15)
10 10 10

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 7/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Though it is more complicated than the computation in (8), the Variable Elimination tech-
nique runs faster in general cases. For more general cases and algorithms, please take a look at
[SD21] (but they are usually computationally taxing, aren’t they?).

2.4 Inference
In this section, we focus on discussing about an even more interesting question that is very
important in applications. Let’s consider Example 2.1, we want to predict the chance of that a
person has been infected by tuberculosis bacteria when the test result is observed rather than the
marginal probability of the test result. We have known very well that it can be done by applying
the Bayes formula with the answer approximately 33.33%.
Let’s generalize for X1 −→ X2 −→ X3 . We will compute P (X1 |X3 ), i.e., the probability of
X1 given that X3 is observed. What would happen if we attempt to use the Bayes formula?
P (X3 |X1 )P (X1 )
P (X1 |X3 ) = . (16)
P (X1 )P (X3 |X1 ) + P (X1c )P (X3 |X1c )
However, because we just have the information of the probability distribution of X3 |X2 , X2 |X1 ,
and X1 which are assigned to the nodes X1 , X2 , and X3 respectively by definition of the network
rather than the probability distribution X3 |X1 . To overcome this problem, we just need to use
the product rule of probability.
P (X1 , X3 )
P (X1 |X3 ) = . (17)
P (X3 )
Moreover, both of the numerator and the denominator of (17) are nothing but the marginal
probability of (X1 , X3 ) and X3 . Therefore, we can leverage (7) (with modifications if necessary)
and (4) to calculate
" #
X X
P (X1 , X3 ) = P (X1 , X2 , X3 ) = P (X1 ) P (X2 |X1 )P (X3 |X2 ) . (18)
X2 X2

On the other hand, we can also apply (9) to calculate P (X3 ). Substituting (18) and (9) into (17),
we arrive at the following formula of P (X1 |X3 ).
P 
P (X1 ) X2 P (X2 |X1 )P (X3 |X2 )
P (X1 |X3 ) = P P . (19)
X2 P (X3 |X2 ) X1 P (X1 )P (X2 |X1 )

If the network has only two nodes then the formula is exactly the Bayes formula. At the moment,
you might guess that in general cases, such kind of inference can be solved by just applying
Variable Elimination technique.
Consider Example 2.2. Assume the student X3 moves to the right one step, what is the
probability that X1 had taken a blue ball? Now we apply the formula (19) to calculate P (X1 |X3 ).
We firstly calculate the numerator of (19). We have

P (X1 , X3 ) = P (X1 = blue)P (X2 = blue|X1 = blue)P (X3 = right|X2 = blue)


+P (X1 = blue)P (X2 = red|X1 = blue)P (X3 = right|X2 = red)
7 2 7 1 7
= . .1 + . .0 = . (20)
10 3 10 3 15
The denominator of (19) has been obtained already in (15). Noting that, when we calculated
P (X3 ), both X1 = blue and X1 = red were taken into account because, in the formula of P (X3 )

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 8/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

in (9), we took the sum over X1 . Finally, we have P (X1 = blue|X3 = right) = 2/3, i.e, there
is approximately 66.67% of chance of that student X1 had taken a blue ball. The marginal
probability and Variable Elimination technique really have done a good job as you can see here.

3 The pgmpy package, model evaluation, and real-world


applications
3.1 The pgmpy package
There is a very good python package that can help you to deal with Bayesian networks. We
suggest taking a look at the document of the package on https://pgmpy.org/ to see how to
install the package and to use the built-in functions. In this section, we will show you few simple
steps to build a Bayesian model by using pgmpy package. For simplicity, we will consider Example
2.2. The example model has the form X1 −→ X2 −→ X3 .
To build the model for Example 2.2, we can use the built-in “BayesianModel” class with
input the set of edges of the graph representing the model.

from pgmpy.models import BayesianModel

model = BayesianModel([('X_1', 'X_2'), ('X_2', 'X_3')])

Now, we need to assign a probability distribution to each node. It can be done by applying
the attribute “add cpds” of the model and the built-in “TabularCPD” class. We also recall the
values in Table 4, 5, and 6.

from pgmpy.factors.discrete import TabularCPD

cpd_X1 = TabularCPD(variable = 'X_1', variable_card = 2, values = [[7/10],


,→ [3/10]], state_names = {'X_1': ['blue', 'red']})

cpd_X2 = TabularCPD(variable = 'X_2', variable_card = 2, values = [[2/3, 7/9],


,→ [1/3, 2/9]], evidence = ['X_1'], evidence_card = [2], state_names = {'X_2':
,→ ['blue', 'red'], 'X_1': ['blue', 'red']})

cpd_X3 = TabularCPD(variable = 'X_3', variable_card = 2, values = [[1, 0],[0,


,→ 1]], evidence = ['X_2'], evidence_card = [2], state_names = {'X_3':
,→ ['right', 'left'], 'X_2': ['blue', 'red']})

model.add_cpds(cpd_X1, cpd_X2, cpd_X3)

To check if the input probability distribution at each node is proper because the probability sum
must be equal to 1, we can use the command “model.check()”. If it returns “True”, the input
probability distribution at each node of the model is proper. Otherwise, we have to change it.
Now let’s get the probability distribution at X1 , X2 , and X3 . To do that, you can consider the
command “model.get cpds” for example. The result will be as follows. It will be exactly the same
as Table 5.

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 9/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

print(model.check())

print(model.get_cpds('X_2'))

Figure 6: The display of the probability distribution at X2 .

Let’s try to perform inference to get P (X1 = blue|X3 = right) by using the pgmpy package.
Fortunately, this package also provides us the “VariableElimination” class equipped with the
Variable Elimination algorithm.

from pgmpy.inference import VariableElimination

infer = VariableElimination(model)

print(my_infer.query(variables = ['X_1'], evidence = {'X_3': 'right'}))

Figure 7: The display of the inference computing P (X1 |X3 = right).

The probability P (X1 = blue|X3 = right) is exactly 2/3. Not only P (X1 = blue|X3 = right)
is computed but also P (X1 = red|X3 = right) is.

3.2 Model evaluation


Particularly, at the moment, the model can be seen as a “predictive” model by showing us how
we can guess the color of the ball that X1 had taken whenever we know X3 moved to the left or
moved to the right. Let’s compute also P (X1 = blue|X3 = left) and P (X1 = red|X3 = left).
From Figure 7 and Figure 8, it seems that the model always predicts the color of the ball
that X1 had taken is blue no matter what X3 moves to the left or to the right. However, if X3
moves to the right then the chance of that the color of the ball that X1 had taken is blue is less
than the chance when X3 moves to the left. Indeed, the probability that X3 moves to the left or
to the right is exactly the same as the probability of the color of the ball that X2 took. Hence,
we can take a look at Figure 1 in the row where X2 = left when we know that X3 moved to the
left. You will soon see that the chance of that X2 takes a red ball when X1 took a blue ball is

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 10/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 8: The display of the inference computing P (X1 |X3 = left).


slightly higher than the chance of that X2 takes a red ball when X1 took a red ball, which are
respectively 1/3 and 2/9. On the other hand, if we know X3 moved to the right, the row where
X2 = blue in Figure 1 tells us a contrary situation, i.e., the chance of that X2 took a blue ball
when X1 had taken a blue ball is slightly less than the chance of that X2 took a red ball when
X1 had taken a red ball, which is 2/3 and 7/9. It means that the fact that X1 takes a blue ball
increases the chance of that X2 will take a red ball. Therefore, if X3 moves to the left more often
than he moves to the right, the chance of that X2 took a red ball is higher than the chance of
that X2 took a blue ball. This fact can happen if X1 had taken a blue ball more often. While
in the case where X3 moves to the right, similarly we can deduce that X1 had taken a red ball
more often, which means in this case we are less sure than the previous case where X1 had taken
a blue ball. You might also want to set up a threshold which is equal to 0.7 (or ¿ 0.5) to prevent
the model from predicting X1 had taken a blue ball when X3 moved to the right.
This interesting point leads us to a method to evaluate our model if we have a “test” data
set. In fact, we can hide X1 from the model and just let it predict the value of X1 based on the
value of X3 that it can observe. Finally, we define a “metric”, which is just a measure of the
difference between the true value of X1 and the predicted value. In this case, we can count how
many times the model predicted correctly.
For example, assume that we have the following test data set, which is stored in “pandas
dataframe” format. If you don’t know what is “pandas dataframe” format, please take a look at
the package document https://pandas.pydata.org/docs/.

Figure 9: A test data set of X1 and X3 .

The pgmpy package provides a built-in attribute for the model to make prediction as follows.

import pandas as pd #Don't forget to import the pandas library

test = pd.DataFrame(data = ['left', 'right', 'right', 'left', 'right'], columns


,→ = ['X_3'])

print(model.predict(test))

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 11/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The result includes also the predicted values of X2 since it wasn’t able to observe X2 in the
test data set. You can deduce that only 3/5 values of X1 were predicted correctly (see Figure

Figure 10: Predicted values of X1 and X2 .

10). If you want to change the threshold, i.e., if P (X1 = blue|X3 ) < threshold, the predicted
value is set as “red”. Otherwise, it is blue. Unfortunately, if you want to change the threshold,
you should write a prediction function by yourselves.

3.3 Real-world applications


Assume now we have a data set about air pollution, whether a person smokes or doesn’t smoke,
whether a person has the lung cancer or hasn’t, whether a person has Dyspnoea or hasn’t, and
about the imaging diagnostic result of a person (X-ray). In this case, a Bayesian network as in
Figure 5 can be considered. However, the probability distribution at each node of the model is not
obtained as easily as the ones in Example 2.1 and Example 2.2. For more general cases, what we
can do is just to estimate the distributions rather than explicitly calculate the exact ones. To this
end, we define the best distribution approximating the “true” distribution of a model (assume
it does exist) is the one that is “the closest” to the true one. In this case, the “closest” can be
defined based on the Kullback–Leibler divergence distance between two distributions (sometimes
we simply call it the KL divergence).
X X
DKL (pkp∗ ) := p∗ (x) log p∗ (x) − p∗ (x) log p(x). (21)
x x

The sum is taken over all the data x that comes from the real distribution p∗ (at this point,
you might want to take a look at an advanced book on probability). The idea is to find an
approximation p so that is the closest to p∗ , i.e, DKL (pkp∗ ) is the smallest. It means we need
to solve an optimization problem where we minimize DKL (pkp∗ ). Nonetheless, it is more or less
impossible since the sum must be taken over all the data x that comes from p∗ , which is not
known. Noting also that we just need to maximize (due to the minus sign) the second term
in DKL (pkp∗ ) rather than the whole DKL (pkp∗ ) since p∗ is fixed in the formula. Though it is
difficult, we can approximate the second term in the formula of DKL (pkp∗ ) as follows.
X 1 X
p∗ (x) log p(x) ≈ log p(x), (22)
x
|D|
x∈D

where D is a finite sample taken from the real distribution p∗ . This sample is nothing but our
(training) data set. The remaining task is to solve
1 X
max log p(x). (23)
p |D|
x∈D

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 12/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The optimization problem (23) is called “Maximum Likelihood Estimation” and can be solve
exactly by applying some optimization techniques. The techniques are related to solving “differ-
ential equations”. We don’t want to focus on these techniques in details. We let you exploit the
techniques by referring to some optimization books as in [BBV04].
Let’s demonstrate how to use pgmpy to estimate the distribution of a data set, given a
Bayesian model. Assume now we have a data set of 10,000 data points, each of them includes a
vector of values of air pollution, smoker, cancer, X-ray, and Dyspnoea variables.

Figure 11: The first five data points in the data set.

In this example, we cannot assign the probability distribution to the model (see Figure 5) by
using the built-in attribute “add cpds” of the class. Instead, we can estimate it.

from pgmpy.estimators import MaximumLikelihoodEstimator

#data here is the data set stored in pandas dataframe format


model.fit(data, estimator = MaximumLikelihoodEstimator)

#print the probability distribution at node Cancer after estimating


print(model.get_cpds('Cancer'))

Here is the final result. It tells us that if the air is highly polluted and if a person smokes
frequently, there is a chance of approximately 94.6% of that he has the lung cancer.

4 Data
Data for doing the assignment can be found on https://www.kaggle.com/mohammedessam97/
uefa-euro-championship. It is about the UEFA Euro Championship from 1960 to 2016. A
very good exploration of the data which maybe is very useful is here https://www.kaggle.com/
mohammedessam97/euro-championship-eda. The exploration is just an overview of the data
and consists of basic steps making the data readable and analyzable.

5 Instructions and Requirements


Students have to follow the instructions and comply with the requirements below. Lecturers do
not solve the cases arising due to the fact that students do not follow the instructions and comply
with the requirements.

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 13/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 12: The probability distribution at node “Cancer”.


5.1 Instructions
Students must work closely with the other members in their own group.
Check your own group at https://docs.google.com/spreadsheets/d/
1Gtw7wi418TGL2li7G2hwEAvCjks6WIzLHRSgtcEdrBc/edit#gid=0 and https://docs.
google.com/spreadsheets/d/1BFgWq5an6GM50xKkpbkJ9IsTeDPhCckA8fB47nRfuDM/edit#
gid=1607305454.
All of the aspects related to this assignment will be quizzed (about 10 - 12 of about 25
multiple-choice questions) in the final exam of the subject. Therefore, team members must work
together so that all of you understand all of the aspects of the assignment. The team leader
should organize the work group so that this requirement will be met.
During the work, if you have any question about the assignment, please post that question
on the forum http://e-learning.hcmut.edu.vn/mod/forum/view.php?id=592105.
Regarding the background knowledge related to the topic, students are supposed to
read the books [SL13; SD21] or the online lecture notes https://ermongroup.github.io/
cs228-notes/. You can also refer to other materials such as https://towardsdatascience.
com/introduction-to-probabilistic-graphical-models-7d2c0b4bef19 in case you need.
However, you have to put all of them in the references section of your report.

5.2 Requests
• Deadline for submission: July 25, 2021. Students have to answer each question in a clear
and coherent way.
• Write a report by using LaTeX in accordance with the layout as in the template
file (you can find it on http://e-learning.hcmut.edu.vn/mod/folder/view.php?id=
583762).
• Each group when submitting its report need to submit also a log file (diary) in
which clearly state: weekly work progress for all 04 weeks, tasks, content of opinions
exchanged of the members, ...

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 14/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

• Prepare your codes a a Jupyter notebook (learn to know how to use Google Colab for this
task.)

5.3 Submission
• Students must submit their own group report via BK-eLearning system (to be opened
in the coming weeks): compress all necessary files (.pdf file, .tex file, .ipynb file, ...) into
a single file named “Assignment-CO1007-CSE203-{All member ID numbers separated by
hyphens}.zip” and submit it in the Assignment section on the BK-eLearning site.
• Noting that for each group, only the leader will submit the report of the group.

6 Exercises
Important note: the following exercises are about the UEFA Euro Championship data set men-
tioned in Section 4. Those who don’t wanna work with this data set can search for a different data
set (on https: // www. kaggle. com/ , maybe?) on any topic such as medical healing cancer data
sets to which you can apply Bayesian models. The exercises will be the same except for Exercises
4 - 6, which will be changed slightly. In fact, if you choose that option, you just need to replace
the questions which are about the UEFA Euro Championship data in Exercises 4 - 6 by similar
questions about your data set without changing the structures of the exercises. We will let you
decide what questions should be asked about the data set.
Prepare a preliminary knowledge section in your report named “Preliminary knowledge”,
where you have to study the following questions (each question prepared as a subsection of this
section):

Exercise 1 (10 pts). Name this subsection as “Probability”:


a) Discrete probability (5 pts): definitions, formulas, and basic properties of experiments,
sample spaces, events, probability of an event, complement event of an event, intersection
and union of two events, conditional probability of an event given that a other event is
observed, rule of probability multiplication, law of total probability, the Bayes formula.
Give also 5 examples for each of the concepts.
b) Discrete random variables (5 pts): definition, mass function definition, expected value and
variance formulas of a discrete random variable, Bernoulli trials definition, binomial dis-
tribution definition and properties, geometric distribution definition and properties, and
joint distribution of two discrete random variables definition and its formula. Give also 5
examples for each of concepts.
Exercise 2 (15 pt). Name this subsection as “Graphs”:
a) Undirected graphs (5 pts): definitions of simple, multi-, and pseudo-graphs. Give 5 examples
for each type of graphs. Definitions of the degree and the neighbourhood of a node (or
vertex). Give 5 examples for each of the concepts. Definitions and examples of some special
graphs (complete graphs, cycles, wheels, n-cube, bipartite graphs). State the handshaking
theorem and give a proof for it.
b) Directed graphs (5 pts): definition and 5 examples of directed graphs. Definition and 5
examples of graph connectivity. Definitions of the in-degree, out-degree, and the neigh-
bourhood of a node (or vertex). Give 5 examples for each of the concepts. State the basic

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 15/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

relationship between the in-degree, out-degree, and the number of edges in a directed graph
and give a proof for it.

c) Paths and circuits (5 pts): definitions of paths, cycles, and circuits. Definition of path,
cycle, and circuit length. Give 5 examples of graphs where there is at least one cycle in
each of them with different lengths.
Exercise 3 (15 pts). Name this subsection as “Bayesian networks”. Explain carefully the fol-
lowing in your report:

a) Read carefully and present Bayesian models in more details with at least 5 examples as we
did in Section 2. (5 pts)
b) In Subsection 2.3 and Subsection 2.4, we haven’t talked about how to perform Variable
Elimination technique and inference for the 2 other basic Bayesian networks X1 −→ X2 ←−
X3 and X1 ←− X2 −→ X3 yet. For the former case, derive how we can apply Variable
Elimination technique to compute the marginal probability P (X2 ) and the conditional
probability P (X1 |X2 ) and P (X3 |X2 ). For the latter case, do similarly but for the marginal
probability P (X1 ) and P (X3 ) and the conditional probability P (X2 |X1 ) and P (X2 |X3 ).
Give few examples for each case to demonstrate what you have discussed. (5 pts)
c) Extend to the case where the Bayesian model has the form X1 −→ X2 −→ X3 ←− X4 −→
X5 . Give an example for it. (5 pts)
Prepare an application section named “Applications” in your report, where you have to do
the following:
Exercise 4 (15 pts). Name this subsection as “Data preprocessing”. Use the raw data mentioned
in Section 4 and do the following:

a) Prepare a pandas data frame consisting of only data that you think they are relevant to
your model. (2.5 pts)
b) Find and fill in or remove missing values in the data. Missing values are just the NA values
existing in your data set. You can use the methods “isna, fillna”, or “dropna” provided for
pandas data frame to detect, fill in, or remove the missing values in the data respectively.
You can also use the method “groupby” to group the data by country for example and
compute the average number of goals of that country and then use it as a feature to
predict which country will be the next champion rather than using directly the number of
goals of all of the countries attending the championships. (5 pts)

c) Discretize your data set. It’s because the Bayesian models discussed in this assignment are
discrete models, meaning that it can be applied to discrete data only. Since you data may
vary continuously in an interval of real numbers, even in the case of total number of goals
of each country, since it may vary from 0 to more than 100, it is unacceptable to put it in
a discrete Bayesian model, except for the fact that you split the total number of goals into
5 distinct intervals, for example, [0, 19), [20, 39), ..., [180, 199]. Then 4 and 18 will belong
to the same interval [0, 19). Hence the values are restricted into 10 classes only, and it is
good for your discrete models. (5 pts)
d) Save your prepared data as a .csv file for later uses. The data should be also split into a
training set (about 80% of the whole data set) and the remaining test set. (2.5 pts)

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 16/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Exercise 5 (20 pts). Name this subsection as “Problem modeling”. In this subsection, accom-
plish the following:

a) Perform some descriptive statistics (for example you can compute the average number of
goals of a country for the last 10 years) to see for example which country has won the most
compared to the other country in the last 10 years. Descriptive statistics can be seen as an
overview of the data. (5 pts)
b) Use different kinds of plots (bar plot, violin plot, box plot, or whatever you want) to “vi-
sualize” your prepared (training) data set to inspect the underlying relationships between
the data features, which are the columns that you have selected from the raw data set.
Particularly, try to figure out the causalities between the features. For example, you can
attempt to answer a list of questions like if the host country is Italy, does it affect the final
result? Does the chance of winning of France increase when the host country is in Europe?
Does the winning probability of Italy change if the away team is Denmark?... to find the
causality. Explain everything in details in your report (including plots). You may want to
use the “matplotlib” python library to do this exercise. (10 pts)
c) Propose at least one suitable Bayesian model for the data set to predict which country
will win the Euro Championship this year. For each model, give a picture of its graph
representation as well as a detailed reasonable explanation. Read the article [OEM13]
whenever you need (The .pdf file is on http://www.ijcte.org/papers/802-N30016.pdf).
(5 pts)
Exercise 6 (25 pts). Name this subsection as “Training and predicting”. Answer the following
questions and write the answers carefully in your report.
a) Build your model(s) by using pgmpy module. Use the training data set to estimate the
probability distribution at each node of the model. Provide in the report as a table, an
estimated probability distribution at a certain node of the model. Use the estimated model
to infer the conditional probability of winning the Euro championship of each country
given that some observations are known before. Provide the inference result as a table in
the report. (5 pts)

b) Use the test data set from which you must hide the columns that you ask the model(s) to
predict. Just keep the information that the model need to know to make prediction. (5 pts)
c) Study what is “precision score” and “recall score” to evaluate your model(s) performance.
(5 pts)
d) Try to adjust the structure(s) of your model(s) slightly to see if it improves the scores. (5
pts)
e) Without using the predict method provided by the pgmpy package. Write a function to
predict the number of goals of the home team and the away team in a match as well as
the probability of that the home team wins the match. The function named “winner” with
inputs consisting of your model, the home team name, the away team name, a threshold
to force the function to predict the home team will not win if the probability of winning
of that team is less than the threshold (by default it is 0.5), and the other information if
need. (5 pts)

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 17/18
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

7 Evaluation and cheating judgement


7.1 Evaluation
Each assignment will be evaluated as follows.

Content Score (%)


- Analyze, answer coherently, systematically,
focus on the goals of the questions and requests 30%
- The programs are neatly written and executable 30%
- Correct, clear, and intuitive graphs & diagrams 20%
- Background section is well written, correct and appropriate 15%
- Well written report and correct 5%

7.2 Cheating
The assignment has be done by each group separately. Students in a group will be considered as
cheating if:

• There is an unusual similarity among the reports (especially in the background section).
In this case, ALL submissions that are similar are considered as cheating. Therefore, the
students of a group must defend the works of their own group.
• They do not understand the works written by themselves. You can consult from any source,
but make sure that you understand the meaning of everything you will have written.

If the article is found as cheating, students will be judged according to the university’s regulations.

References
[BBV04] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization.
Cambridge university press, 2004.
[OEM13] Farzin Owramipur, Parinaz Eskandarian, and Faezeh Sadat Mozneb. “Football result
prediction with Bayesian network in Spanish League-Barcelona team”. In: Interna-
tional Journal of Computer Theory and Engineering 5.5 (2013), p. 812.
[SL13] M Scutari and S Lebre. Bayesian networks in R: with applications in systems biology.
2013.
[SD21] Marco Scutari and Jean-Baptiste Denis. Bayesian networks: with examples in R. CRC
press, 2021.

Discrete Structures for Computing Assignment (CO1007), Semester 3, Academic year 2020-2021 Page 18/18

You might also like