Download as pdf or txt
Download as pdf or txt
You are on page 1of 138

ECE 368 Course Review

Probabilistic Reasoning
2023
1. What is Probabilistic Reasoning
• Why? Decision-Making with Uncertain Information
• Events not directly observable, Measurement errors, …
• Analytics Pipelines: Observe, Analyze, Decide, Act
• Probabilistic Inference: Classification and Regression
• and Learning

Data Evidence
Events Observe Analyze Decide Action

Learning and Knowledge Accumulation


Data, Learning and Model Development

• Data: Scalar, Vector, Temporal, Spatial, Composite


• Training: Use data to develop models
• Supervised: Labeled Dataset (input and output data)
• Unsupervised (output data)
• Periodic Retraining
• Reinforcement (state, action, reward)

Dataset Training Model

Data Model Inference


Bayesian Statistics and Frequentist Statistics

𝑃 𝐷 𝐻𝑖 𝑃 𝐻𝑖
𝑃 𝐻𝑖 𝐷 = ~𝑃 𝐷 𝐻𝑖 𝑃 𝐻𝑖
𝑃𝐷

Frequentist Bayesian
• A priori probability not used • A priori probability over hypotheses
• Hypotheses not usually result of an • Must know or construct
experiment “subjective” prior
• Objective assessment of evidence • Can explore different priors
• Confidence intervals & p-values • Computationally intensive
• A posteriori probability not used • Learns as data accumulated
• Less computationally intensive • Aposteriori enables decisions
Learning Objectives
1. Joint distributions, marginals, conditionals and Bayes rule.
2. Vector-based probabilistic models, e.g., jointly Gaussian vectors,
binomials, multinomials, conjugate priors
3. Hypothesis Testing: Naïve Bayes, Gaussian Discriminants, Likelihood
Ratio test, Bayesian Testing, Type I/II errors, cost function
4. Estimation: Likelihood function; Linear regression, Bayesian, LMS
5. Graphical models, message-passing inference.
6. Hidden Markov Models (HMMs), the forward-backward algorithm
and Viterbi algorithm
Maximum Likelihood & Bayesian Parameter Estimation

• Sample Mean and Convergence Properties


• Estimating parameters of a Random Variable
• ML Estimation and Frequentist Interpretation
• MAP Estimation and Bayesian Interpretation
• Bayesian Least Mean Square (LMS) Estimation
Parameter Estimation
• Assume IID sequence of RVs
• Estimate a parameter of X

• Example: Bernoulli RV
Properties of Estimators
• Estimation error and Bias
Maximum Likelihood Estimation
• Likelihood
Log Likelihood Function
• Log likelihood
MLE Bernoulli RV
MLE Bernoulli RV
Laplace: Will the Sun Rise Tomorrow?
Frequentist Bayesian
Estimation Using Conditional Expectation
Bayes Inference
• Prior Distribution on ⍬ • Conditional distribution given ⍬
Maximum A Posteriori Probability Rule
MAP Estimate for Binomial with Beta Prior
MAP Rule for Prediction
Maximum Likelihood & Bayesian Parameter Estimation
More on ML, MAP, LMS Estimators
• Comparison of ML, MAP, LMS & Conditional Expectation
• Poisson RV with Gamma Prior
• Gaussian RV with Gaussian Prior
• Multinomial RV with Dirichlet Prior
Frequentist and Bayesian Inference

• Frequentist • Bayesian
LMS & Conditional Expectation
Important Conjugate Priors
Sample Variance
MAP for Gaussian RV with Gaussian Prior
ML Estimator for Multinomial RV
MAP for Multinomial RV with Dirichlet Prior
Estimation of Gaussian Vectors
• Gaussian Vector Estimation Problems
• Conditional Gaussian Distributions
• Marginal Gaussian Distributions
• Gaussian Systems
• ML Estimation
• MAP Estimation

• Bishop: Section 2.3


• Murphy: Section 4.3, 4.4
Conditional pdf of 2D Gaussian
fX ,Y (x, y)
fX ,Y (x|y) =
fY (y)
ì
ï -1 é s1 ù
2
ü
ï
exp í ê x - r (y - m ) - m1ú ý
ï
î 2(1 - r 2
X , Y
)s 2
1 ë
X ,Y
s 2
2
û ï
þ
=
2ps 12 (1- r X2 ,Y )

• X given Y=y is a Gaussian RV with mean & variance

s1
E[ X|Y ] = r X ,Y (y - m2 ) + m1 VAR[ X|Y ] = (1- r 2 X ,Y )s 12
s2

• Max of pdf of X given Y=y at E[X|y]


• Least Mean Square (LMS) estimate also at E[X|y]
Conditional Gaussian Distributions

X and Y are now VECTOR random variables


MAP Estimators
Gaussian Systems
Hypothesis Testing

• Binary Hypothesis Testing


• Likelihood Ratio Test & Neyman Pearson Lemma
• Significance Testing
• Bayesian Hypothesis Testing
• MAP Rule
• Minimum Cost Decisions
• ROC Curves
• Naïve Bayes’ Classifier Next Week
• ML Estimator
• Laplace Smoothing
Binary Hypothesis Testing
Likelihood Ratio Test

The ML Decision Rule compares the likelihood ratio to 1; other decision rules result by comparing L(x) to other thresholds
The ML Decision Rule compares the likelihood ratio to 1; the corresponding log likelihood ratio test
compares to the log of the original threshold.
We obtain a class of decision rules by varying the threshold.
The ML rule picks the threshold where
the two pdf’s are equal; As the
threshold gamma approaches infinity,
alpha approaches zero and beta
approaches 1. As gamma approaches
infinity, alpha goes to 1, beta to zero.
Neyman Pearson Lemma
Explanation of the Derivation:
• First assume there is a rule that achieves type 1 error alpha
• Next consider minimizing the type 2 error, given that the rule attains type 1 error alpha
• This involved minimizing type 2 error with constraint on type 1, leading to the Lagrangian expression
• The expression is minimized by assigning all values of x to the acceptance region when the integrand in the
previous page is negative; We note that this implies a likelihood ratio test with threshold lambda
• Finally pick lambda so that the type 1 error constraint is met.
Bayesian Hypothesis Testing
Bayesian Binary Hypothesis tests can be designed to minimize the average cost of the decision rule
• Example 1: The cost could be the probability of error, i.e. the sum of type 1 error and type 2 error
• General case: Reward correct decisions C00, C11, and penalize error C01, C10
• Both of these are solved by likelihood ratio tests (see next two charts)
Minimum Cost Decisions

See Section 8.6.2 in Leon-Garcia for proof


Bayesian K-ary Hypothesis tests extend the binary case to K hypotheses
• ML, MAP and Minimum cost decision rules can be derived
Naïve Bayes Assumption
Gaussian Discriminant Analysis

• Consider Gaussian classes c in C


• with priors
• Mean & Covariance Matrices
• Given data X decide which class c is present
• Bayesian Hypothesis Test (Classification)
Case 1: Equal Covariance Matrices
Case 2: Unequal Covariance Matrices
Quadratic Discriminant
Training Gaussian Parameters
Linear Regression
• Gauss and Least Squares Method
• Regression to the Mean
• Linear Regression: Orthogonal Projection
• Linear Regression: Curve-Fitting
• Bayesian Regression
• Regularization and Ridge Regression
Linear Regression
Linear Regression: Curve Fitting
Bayesian Regression
Discrete-Time Markov Chains

ECE 368

Reference: Leon-Garcia, Probability, Statistics,


and Random Processes, Chapter 11
Markov Chain

• A discrete-valued random sequence Xn is a


Markov chain if the future of the process given
the present is independent of the past, that is,
• if Xn is discrete-valued,

future present past

P[Xn+1 = xn+1 Xn = xn , Xn−1 = xn−1,..., X1 = x1]


= P[Xn+1 = xn+1 Xn = xn ]
Discrete-Time Markov Chains
• Let Xn be a discrete-time integer-valued
Markov chain that starts at n = 0 with pmf
p j (0) ! P[X0 = j] j = 0,1,2,…

• The joint pmf for the first n + 1 values of the


process is
P[Xn = in ,…, X0 = i0 ]
= P[Xn = in Xn−1 = in−1]!P[X1 = i1 X0 = i0 ]P[X0 = i0 ]

• Joint pmf for a sequence is product of


– probability for the initial state
– probabilities for subsequent one-step state transitions.
Homogeneous Transition Probabilities
• Assume that the one-step state transition
probabilities are fixed and do not change with
time, that is,
P[Xn+1 = j Xn = i] = pij for all n
• Xn is said to have homogeneous transition
probabilities.

• The joint pmf for Xn,…, X0 is then given by

P[Xn = in ,…, X0 = i0 ] = pi ,i
!pi ,i pi (0)
n−1 n 0 1 0
Transition Probability Matrix
• Xn is completely specified by the initial pmf pi(0)
and the matrix of one-step transition probabi-
lities P, or transition probability matrix:
⎡ p p01 p02 ! ⎤
⎢ 00

⎢ p10 p11 p12 ! ⎥
⎢ ⎥
P=⎢ ⋅ ⋅ ⋅ ⎥
⎢ p pi1 ! ⎥
i0
⎢ ⎥
⎢ ⋅ ⋅ ! ⎥
⎣ ⎦
• Note that each row of P must add to 1 since
1 = å P[ Xn+1 = j Xn = i] = å pij
j j
n-Step Transition Probabilities

• The matrix of two-step transition probabilities P(2) is:

P(2) = P(1)P(1) = P2

• Using the preceding arguments, P(n) is found by multiplying


P(n – 1) by P:

P(n) = P(n - 1)P or P(n) = Pn

• The n-step transition probability matrix is the nth power of the


one-step transition probability matrix.
The State Probabilities

• Let p(n) = {pj(n)} denote the row vector of state probabilities


at time n. The probability pj(n) is related to p(n – 1) by

pj (n) = å P[ Xn = j Xn-1 = i]P[ Xn-1 = i] = å pij pi (n - 1)


i i
The State Probabilities II
• p(n) is obtained by multiplying the row vector p(n – 1) by
the matrix P:

p(n) = p(n - 1)P

• Similarly, pj(n) is related to p(0) by

pj (n) = å P[ Xn = j X0 = i]P[ X0 = i] = å pij(n)pi (0)


i i

p(n) = p(0)Pn
Steady-State Probabilities

• Many Markov chains settle into stationary


behavior after the process has been running for
a long time; the initial state becomes irrelevant.
• As n → ∞, Pn approaches a matrix in which all
the rows are equal to the same pmf, i.e.,

pij (n) = p j for all i

pj (n) = å p j pi (0) = p j å pi (0) = p j


i i
Steady-State Probabilities II

• Consequently, probability of state j approaches a constant


independent of time and of the initial state probabilities:

pj (n) = p j for all i

• We say the system reaches “equilibrium” or “steady state.”


• The steady state pmf summarizes our knowledge about the process after it
has been running for a long time.
Steady-State Probabilities III
• If a Markov chain has a steady state, then the
steady state pmf π !{πj} is found by noting that as
n → ∞, pj(n) → πj and pi(n – 1) → πi, so

p j = å pijp i
i

p = pP

åp
i
i =1

• We refer π to as the stationary state pmf of the Markov chain.


Recurrence & Classes

• The behavior of a Markov chain is determined by


its transition matrix.
• The states in a discrete-time Markov chain can
be divided into one or more separate classes,
where each class is of a different type.
• The long-term behavior of a Markov chain is
related to the types of its state classes.
Classes of States

• State j is accessible from state i if there is a sequence


of transitions from i to j that has non-zero probability.
• States i and j communicate if they are accessible to
each other, we write: i<->j
– A state always communicates with itself.
• If i<->j and j<->k, then i<->k.
• Two states belong to the same class if they
communicate with each other.
• States in the same class share the same fate.
Classes of States II
• Two different classes must be completely disjoint, that is, they
cannot have any state in common.

• The states of a Markov chain consist of one or more disjoint


classes
• A Markov chain that consists of a single class is said to be
irreducible.
Recurrence Properties

• Start a Markov chain at state i.


• State i is said to be recurrent if the process returns to the state
with probability one, i.e.,
fi = P[ever returning to state i] = 1

• State i is transient if
fi < 1.

• Start a Markov chain in a transient state. The state does not


reoccur after some finite number of returns.
Irreducible Markov Chains

• If a Markov chain is irreducible then either all its states


are transient or recurrent.
• If the # of states in the chain is finite, it is impossible for
all the states to be transient.
• Thus the states of a finite-state, irreducible Markov
chain are all recurrent.
Periodic & Aperiodic Classes

• State i has period d if it can only reoccur at times that


are multiples of d.
• It can be shown that all the states in a class have the
same period.
• An irreducible Markov chain is said to be aperiodic if
the states in its single class have period one.
Stationary Probabilities & Limiting Probabilities

• The stationary state pmf is defined by the


global balance equations:

p j = å pijp i åp
i
i =1
i

• The stationary state probabilities correspond to the long-term


proportion of time spent in the states.
Theorem 1
For an irreducible, aperiodic, and positive recurrent Markov
chain:

lim pij (n) = p j , for all j,


n®¥

where πj is the unique nonnegative solution of the global


balance equations.

• For these Markov chains, the state probabilities approach


steady state values that are independent of the initial
condition.
• These Markov chains are called ergodic.
Theorem 2
For an irreducible, periodic, and positive recurrent Markov chain
with period d,

lim pjj (nd) = dp j , for all j,


n®¥

where πj is the unique nonnegative solution of the global


balance equations.

• For these periodic Markov chains, the state occupancies are


constrained to 1/d of the time instants, so the long-term
proportion of time spent in these states is 1/d the proportion of
time spent in the periodic recurrence times.
Types of Classes
State
j

Transient Recurrent
π
j
=0
Positive Null
recurrent recurrent
π
j
>0 π
j
=0

Aperiodic Periodic
lim p jj (n) = p j lim pjj (nd) = dp j
n ®¥ n ®¥
Bayesian Networks
Conditional Independence
Random Fields
Inference on Markov Chains:
Brute Force:
General Case:
Inference of Maximum Likelihood Sequence
Summary: Inference on graphical models
Hidden Markov Model HMM
Viterbi Algorithm
Expectation Maximization
ECE 368
Estimating Gaussian Mixture Model
Estimating HMM

You might also like