Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

ECS784U/P DATA ANALYTICS

(WEEK 6, 2024)
REVISION &
INTRODUCTION TO PROBABILISTIC
GRAPHICAL MODELS

DR ANTHONY CONSTANTINOU 1
SCHOOL OF ELECTRONIC ENGINEERING AND COMPUTER SCIENCE
TIMETABLE

2
LECTURE OVERVIEW
Week 6

▪ Revision: Multiple choice and problem solving


questions.
▪ Coursework 1 past submission examples and Q&A.
▪ Introduction to Probabilistic Graphical Models
(PGMs).
▪ Directed Acyclic Graphs.
▪ Bayes theorem.
▪ Bayesian networks.

3
CATEGORICAL VS CONTINUOUS
Question:
Which attributes in the list should be represented
as continuous quantities?

▪ A: Exterior colour of car.


▪ B: Olympic Bronze, Silver, or Gold medal.
▪ C: A coin landing heads or tails.
▪ D: Profit margin.
▪ E: Brightness measures by light meter.

4
HIDDEN SLIDE

5
CLASSIFICATION VS REGRESSION
Question:
There are two main types of supervised learning
algorithms: classification and regression. Which
type is appropriate to predict the outcomes of the
following problems?

▪ A: Whether a customer will default on their loan.


▪ B: Tomorrow’s temperature.
▪ C: Expected profit on a particular transaction.

6
HIDDEN SLIDE

7
REGRESSION / ERROR MEASURE
Question:
Consider an application of regression to predict
house prices. We want to know how much money
we expect to gain or lose for particular
investments, based on the recommendation
provided by the regression model. Which
evaluation metric is the most natural to use?

▪ A: R-squared.
▪ B: Adjusted R-Squared.
▪ C: Sum Squared Error (SSE).
▪ D: Mean Absolute Error (MAE).
8
HIDDEN SLIDE

9
HAMMING DISTANCE
Question:
In K-NN classification and K-Means clustering, we
need to compute distances between data points. For
data 𝒙𝟏 = [𝟎, 𝟎, 𝟏, 𝟐, 𝟏] and 𝒙𝟐 = [𝟎, 𝟏, 𝟏, 𝟎, 𝟏]. What is
the Hamming distance between them?

▪ Answer: In information theory, the Hamming distance between two strings or


vectors of equal length is the number of positions at which the
corresponding symbols are different = > 2

10
HIDDEN SLIDE

11
OVERFITTING
Question:
Which is the most likely symptom of overfitting?

▪ A: Model takes a long time to train.


▪ B: Excellent performance on both train and test data.
▪ C: Poor performance on both train and test data.
▪ D: Excellent performance on test data and poor
performance on train data.
▪ E: Excellent performance on train data and poor
performance on test data.
12
HIDDEN SLIDE

13
OVERFITTING
Question:
Select all factors that could contribute to
overfitting:

▪ A: Test data has limited sample size.


▪ B: Increasing the number of model parameters.
▪ C: Input data has many variables.
▪ D: Training data has many samples.
▪ E: Weak or no regularisation.
▪ F: Performing PCA on the input data.
14
HIDDEN SLIDE

15
BIG-𝑶 COMPLEXITY
Question:
You have 𝑵 training samples, 𝑴 testing samples,
and 𝑫 dimensions of data. What is the test time
computational complexity for K-NN and Logistic
Regression respectively?

▪ A: K-NN: 𝑂 𝐷𝑀 , LR: 𝑂 𝑁𝐷𝑀 .


▪ B: K-NN: 𝑂 𝐷𝑀 , LR : 𝑂 𝐷 .
▪ C: K-NN: 𝑂 𝑁𝐷𝑀 , LR : 𝑂 𝑁𝐷𝑀 .
▪ D: K-NN: 𝑂 𝑁𝐷𝑀 , LR : 𝑂 𝐷𝑀 .
▪ E: K-NN: 𝑂 𝑁𝐷𝑀 , LR : 𝑂 𝐷 2 N .
16
HIDDEN SLIDE

17
DIMENSIONALITY REDUCTION
Question:
When using PCA to perform dimensionality reduction,
it is often suggested to preserve enough dimensions
so that they can explain 99% of the variance of the data
set. What does this mean?

▪ A: The original data can be reconstructed with 99%


accuracy.
▪ B: 1% of the dimensions are discarded.
▪ C: The data is compressed by 1%.
▪ D: The subsequent classification task will lose 1%
accuracy.
▪ E: The variance will decrease by 1%.
18
HIDDEN SLIDE

19
ALGORITHMS AND OBJECTIVES
Question:
Match each algorithm with the corresponding objective
function.
A. Sum squared error between
original and reconstructed data.
A. K-Means. B. Sum squared error between each
data point and its assigned cluster.
B. Mixture of Gaussians.
C. Classification accuracy,
C. PCA. Information gain.
D. Sum squared error between
D. Statistical Regression. prediction and target.
E. Likelihood of the data under
Gaussian distributions.
F. Computational cost.
G. Memory utilisation.

20
HIDDEN SLIDE

21
PROBLEM 1
Question:
You are working for a bank that uses people’s social
network profile data to make lending decisions. You
have a historical data set of 10,000 users, capturing
both those who repaid and defaulted on their loan. For
each user you have extracted 1,000 features from their
social media profiles. Your manager suggests using
this data set to classify if a new person will pay or
default. What data problems may arise? How to
address them?

22
HIDDEN SLIDE

23
PROBLEM 2
Question:
You are working for a biotech company. Your task is to
improve disease prediction related to 10,000
sequenced genes of each client. You are provided with
a data set of 10,000 individuals that captures their
genes and disease status. A colleague suggests to
assess every possible combination of features for
maximum accuracy. What do you do?

24
HIDDEN SLIDE

25
COURSEWORK SUBMISSIONS: EXAMPLE 1
DIAGNOSING CARDIOVASCULAR DISEASE BASED ON
RANDOM FORESTS AND LOGISTIC REGRESSION

26
COURSEWORK SUBMISSIONS: EXAMPLE 2
PREDICTING HIT SONGS USING SPOTIFY DATA

27
COURSEWORK SUBMISSIONS: EXAMPLE 3
PREDICTION OF NITROGEN DIOXIDE LEVELS BASED ON
MULTIVARIATE LINEAR REGRESSION AND LINEAR SUPPORT
VECTOR MACHINES

28
10 MINUTTERS PAUSE
10分の休憩
10 MINUTEN PAUSE
‫ دقائق استراحة‬10
10 MINUTI DI PAUSA
‫ דקות‬10 ‫הפסקה של‬
10 MINUTES DE PAUSE
10 मिनट का ब्रेक
10 MINUTES BREAK
10 МИНУТА ПАУЗЕ
10 মিমিটের মিরমি
ΔΙΑΛΕΙΜΜΑ 10 ΛΕΠΤΩΝ
ПЕРЕРЫВ 10 МИНУТ
休息10分钟
DESCANSO DE 10 MINUTOS
10 분 휴식
10 MINUTEN PAUZE 29
PROBABILISTIC GRAPHICAL MODELS
▪ A Probabilistic Graphical Model (PGM) is a model that
relies on probability theory to express dependencies
via a graph structure.
▪ Random variables, also known as stochastic or
uncertain variables, represent variables whose
possible values reflect the expected outcomes.
▪ We will focus on Bayesian Networks (BNs), which
represent a graphical model that can be used for
causal representation.
▪ Note that, in the literature, when a BN model is
assumed to capture causal relationships, it is often
referred to as a ‘Causal’ Bayesian Network (CBN).
30
BAYESIAN NETWORKS
A ▪ Nodes represent random variables.
▪ Arcs represent dependency:
𝐵 has parent 𝐴 and child 𝐷
𝐷 has parents 𝐵 and 𝐶, and ancestor 𝐴
B C
𝐴 has children 𝐵 and 𝐶, and descendant 𝐷
▪ Directed networks.

D
▪ Do not allow cycles.
we can say that A and D are independent if we
know the value B and C

Bayesian Networks assume the Markov condition, which states


that any node in the network is conditionally independent of its
non-descendants, given its parents. We will cover conditional
31
independence in the weeks that follow.
DIRECTED RELATIONSHIPS
▪ If we want to reduce the risk of Lung cancer, the
causal assumptions enable us to determine that
we should:

Intervene on
Smoking
To manipulate

Yellow Lung
teeth cancer

32
CHARACTERISTICS OF
BAYESIAN NETWORKS
BNs (and causal BNs) are based on four fundamental areas:
▪ Bayesian/Inverse probability theory.
▪ Statistics: in addition to categorical variables, BNs can contain
variables represented by statistical distributions such as
Gaussian, Beta, Binomial, Poisson, Gamma distributions etc.
▪ Machine Learning: different algorithms to learn the graphical
structure and the parameters of the conditional distributions.
▪ Causality: the science of cause-and-effect; i.e., an effect is partly
dependent on a cause.
▪ Assumes effects can have multiple causes, and causes must
lie in the past of effects.
▪ An effect can in turn be a cause of many other effects which all
lie in its future.
33
THOMAS BAYES
▪ English statistician and philosopher.
▪ Born in 1701 and died in 1761.
▪ Studied Logic and Theology at the
University of Edinburgh.
▪ After Bayes’ death, his works were
passed to his friend Richard Price,
who was a philosopher and a
mathematician.
▪ Bayes’ notes were later edited and
published by Richard Price.
▪ In the first decades of the 18th
century, Bayes’ theorem solved
many problems of inverse
probability. 34
INVERSE PROBABILITY
▪ A method of assigning a
probability distribution to an
unobserved variable.

▪ Specifically, if we want to solve


𝑃 𝐻|𝐸 , where 𝐻 is some
hypothesis (e.g. red wine is
healthy) and 𝐸 is some evidence
(e.g. data), then the inverse
probability is 𝑃 𝐸|𝐻 .

▪ Today, inverse probability is


known as Bayesian probability
(or conditional probability).
35
PROBABILITY DISTRIBUTION

COIN DIE

▪ 𝐶𝑜𝑖𝑛 (𝐶) ▪ 𝐷𝑖𝑒 (𝐷)


States: 𝑐1 𝐻 , 𝑐2 𝑇 States: 𝑑1 1 , 𝑑2 2 , 𝑑3 3 , 𝑑4 4 , 𝑑5 5 , 𝑑6 6
𝑃 𝐶 𝑃 𝐷
C P D P
H 0.5 1 0.1667
T 0.5 2 0.1667
3 0.1667
Sum up to 1 4 0.1667
5 0.1667
6 0.1667 36
JOINT PROBABILITY DISTRIBUTION
𝑃 𝐶, 𝐷
C D P
H 1 0.0833
H 2 0.0833
▪ The joint probability distribution
H 3 0.0833
𝑃 𝐶, 𝐷 is the probability distribution
H 4 0.0833 of the combined events between
H 5 0.0833 variables 𝐶 and 𝐷.
H 6 0.0833
▪ This example is based on two
T 1 0.0833 combined categorical distributions.
T 2 0.0833
T 3 0.0833
T 4 0.0833
T 5 0.0833
T 6 0.0833
37
MARGINALISATION
𝑃 𝐶, 𝐷
C D P ▪ Marginalisation refers to a subset of
H 1 0.0833 a larger probability distribution.
H 2 0.0833 ▪ 𝑃 𝐶, 𝐷 is marginalised into 𝑃 𝐶 by
H 3 0.0833 summing up the probabilities which
H 4 0.0833 correspond to each 𝑐𝑖 .
H 5 0.0833 ▪ In the same way, 𝑃 𝐶 can be
marginalised into 𝑃 𝐶 = "H"=0.5
H 6 0.0833
T 1 0.0833
Marginal
T 2 0.0833 probability
T 3 0.0833 𝑃 𝐶
T 4 0.0833
C P
T 5 0.0833 H 0.5
T 6 0.0833 T 0.5
38
MARGINALISATION
𝑃 𝐶, 𝐷
C D P ▪ Marginalisation refers to a subset of
H 1 0.0833 a larger probability distribution.
H 2 0.0833 ▪ 𝑃 𝐶, 𝐷 is marginalised into 𝑃 𝐷 by
H 3 0.0833 summing up the probabilities which
H 4 0.0833 correspond to each 𝑑𝑖 .
H 5 0.0833 ▪ In the same way, 𝑃 𝐷 can be
marginalised into 𝑃 𝐷 = "6"=0.1667
H 6 0.0833
𝑃 𝐷 Marginal
T 1 0.0833
D P probability
T 2 0.0833
1 0.1667
T 3 0.0833 2 0.1667
T 4 0.0833 3 0.1667
T 5 0.0833 4 0.1667
5 0.1667
T 6 0.0833
6 0.1667 39
CONDITIONING
𝑃 𝐶, 𝐷
C D P ▪ Conditional distribution is the
H 1 0.0833 probability distribution of a variable
H 2 0.0833 conditional on a state of some other
H 3 0.0833
variable.
H 4 0.0833 ▪ Conditioning on 𝐶 = 𝑐1 𝐻 .
Conditional
H 5 0.0833 probabilities
𝑃 𝐶, 𝐷 is normalised into
H 6 0.0833 𝑃 𝐷 𝑐1
the posterior distribution
T 1 0.0833 𝑃 𝐷 𝑐1 by conditioning. C D P
T 2 0.0833 H 1 0.1667
The output 𝑷 𝑫 𝒄𝟏 is
T 3 0.0833 H 2 0.1667
equivalent to 𝑷 𝑫 .
H 3 0.1667
T 4 0.0833 From this we can conclude
that 𝑃 𝐶 and 𝑃 𝐷 are H 4 0.1667
T 5 0.0833 H 5 0.1667
likely to be independent.
T 6 0.0833 H 6 0.1667
40
CONDITIONING
𝑃 𝐶, 𝐷
C D P ▪ Conditional distribution is the
H 1 0.0833 probability distribution of a variable
H 2 0.0833 conditional on a state of some other
H 3 0.0833
variable.
Why is conditional
H 4 0.0833 ▪ Conditioning on 𝐷 = 𝑑6 6 . probability
𝑷 𝑪 𝒅𝟔 =‘H’ equal
H 5 0.0833
to the marginal
H 6 0.0833 probability
𝑷 𝑪 =‘H’ ?
T 1 0.0833 𝑃 𝐶, 𝐷 is normalised to
the posterior 𝑃 𝐶 𝑑6
T 2 0.0833
distribution 𝑃 𝐶 𝑑6 by
T 3 0.0833 C D P
conditioning. The output
𝑃 𝐶 𝑑6 is equivalent to H 6 0.5
T 4 0.0833
𝑃 𝐶 since the coin and T 6 0.5
T 5 0.0833 die are independent.
T 6 0.0833
41
HIDDEN SLIDE

42
BAYES THEOREM:
ILLUSTRATION OF INVERSE THINKING
▪ 𝐻 = “Have headache”. ▪ Let’s assume that, at any point in time around 1 in
10 people have headache and 1 in 40 people have
▪ 𝐶 = “Have COVID-19”. COVID-19.
▪ Assume COVID-19 causes headache 50% of the
▪ 𝑃 𝐻 =
1 time.
10
1 ▪ Suppose that a friend of yours has headache and
▪ 𝑃 𝐶 = states that, given that 50% of COVID-19 infections
40
1 are associated with headaches, the chance of
▪ 𝑃 𝐻|𝐶 = being infected with COVID-19 must be 50-50.
2

▪ 𝑷 𝑪|𝑯 = ? ▪ What’s wrong with this reasoning?

P(H)P(C|H) = P(C) P(H|C)


P(C|H) = P(C) P(H|C)/P(H) = 1/40*1/2*10 = 1/8

43
HIDDEN SLIDE

44
BAYES’ THEOREM

𝑃 𝐸|𝐻 𝑃 𝐻
𝑃 𝐻|𝐸 =
𝑃 𝐸

45
BAYES’ THEOREM

𝑃 𝐸|𝐻 𝑃 𝐻
𝑃 𝐻|𝐸 =
𝑃 𝐸

46
BAYES’ THEOREM

𝑃 𝐸|𝐻 𝑃 𝐻
𝑃 𝐻|𝐸 =
𝑃 𝐸

47
BAYES’ THEOREM

𝑃 𝐸|𝐻 𝑃 𝐻
𝑃 𝐻|𝐸 =
𝑃 𝐸

48
BAYES’ THEOREM
Bayes theorem gives us a method to calculate the probability of a hypothesis 𝐻
(e.g. red wine is healthy) given evidence 𝐸 (e.g. data) as follows:

The likelihood of 𝑬: The probability to observe


evidence 𝐸 given the hypothesis 𝐻 is true.

𝑃 𝐸|𝐻 𝑃 𝐻
𝑃 𝐻|𝐸 =
𝑃 𝐸
The prior probability of 𝑯:
The posterior probability of 𝑯: The probability the
The probability the hypothesis 𝐻 hypothesis 𝐻 is true before
is true given evidence 𝐸. observing any evidence 𝐸.

The prior probability of 𝑬: The


probability to observe evidence 𝐸 without
49
knowing whether hypothesis 𝐻 is true.
HIDDEN SLIDE

50
CONDITIONAL PROBABILITY TABLE
▪ A Conditional Probability Table (CPT) captures the parameters of conditional
distributions.
▪ Suppose that we introduce the child node WIN (W) for both COIN TOSS (C)
and DIE ROLL (D), and we set 𝑊 = 𝑇 (i.e., true) when 𝐶 = 𝐻 and 𝐷 = 6.
These are (unconditional)
probability tables.
CPT of COIN TOSS

CPT of DIE ROLL


COIN DIE
TOSS ROLL

WIN CPT of WIN given COIN TOSS and DIE ROLL

51
CONDITIONAL PROBABILITY TABLE
▪ The values in a CPT can be based on data, knowledge, or a
combination of the two. The parameters shown below are
based on knowledge. How do we know this?

CPT of COIN TOSS

CPT of DIE ROLL


COIN DIE
TOSS ROLL

WIN CPT of WIN given COIN TOSS and DIE ROLL

52
PRIOR DISTRIBUTIONS
A prior distribution of a variable represents the expected outcomes of that
variable before any ‘observations’ are entered into the model.

𝑃 𝐷

𝑃 𝐶

𝑃 𝑊
What will happen to 𝑷 𝑾
if we set 𝑷 𝑪 = "𝐇“ ?
as H from 50% to 100%
P(W)=>
F= 53
T = 16.67
HIDDEN SLIDE

54
HIDDEN SLIDE

55
HIDDEN SLIDE

56
HIDDEN SLIDE

57
HIDDEN SLIDE

58
EXPLAINING AWAY
The notion of explaining away is a pattern of reasoning where the revised
beliefs of one cause influence the beliefs of some other independent cause,
when both causes share an effect (AKA ‘common-effect’). This way of
reasoning is very powerful and unique to Bayesian probability.

What will happen to 𝑷 𝑪 and


𝑷 𝑫 if we set 𝑷 𝑾 = "𝐅“ ?

59
HIDDEN SLIDE

60
HIDDEN SLIDE

61
HIDDEN SLIDE

62
HIDDEN SLIDE

63
READING
Introductory book chapters in Bayesian AI:

▪ Book title: Bayesian Artificial


Intelligence, 2nd Edition, by Kevin
Korb and Ann Nicholson.
▪ A PDF of the chapter is available for
you to download on QM+.
▪ You should skip subsections 2.4.2.,
2.4.3., 2.7 and 2.8

64
READING
Introductory book chapters on Bayesian
probability and Bayesian networks:
▪ Book Title: Risk Assessment and Decision
Analysis with Bayesian Networks, by
Norman Fenton and Martin Neil.
▪ A PDF of chapters (4, 5, and 6) is available
for you to download on QM+.
▪ Chapter 4 covers the basics of probability.
You should skip subsection 4.5.
▪ Chapter 5 covers Bayes’ theorem and
conditional probability. You should skip
subsection 5.5.
▪ Chapter 6 covers how we move from
Bayes’ theorem to Bayesian networks. You
should skip subsections 6.7, 6.8 and 6.9. 65

You might also like