Professional Documents
Culture Documents
Computer Science Machine Learning Predictive Model: Probability and Stats
Computer Science Machine Learning Predictive Model: Probability and Stats
Computer Science Machine Learning Predictive Model: Probability and Stats
Machine Learning
Predictive Model
Mahdi Roozbahani
Georgia Tech
These slides are inspired based on slides from Le Song , Sam Roweis, and Chao Zhang.
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation
3
Probability
4
Three Key Ingredients in Probability Theory
𝑝 𝑥 ≥0
5
Continuous variable
Continuous probability distribution
Probability density function
න 𝑝(𝑥)𝑑𝑥 = 1
Density or likelihood value
𝑥
Temperature (real number)
Gaussian Distribution
Discrete variable
Discrete probability distribution
Probability mass function
Probability value 𝑝 𝑥 = 1
Coin flip (integer) 𝑥𝜖𝐴
Bernoulli distribution
Continuous Probability Functions
7
Discrete Probability Functions
k is number of successes
𝒏
𝒌
The total number of ways of selection k distinct combinations of n
trials, irrespective of order.
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation
10
Example
X and Y are random variables
N = total number of trials
𝒏𝒊𝒋 = Number of occurrence
X = Throw a
Y = Flip a coin
dice
X
𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗
𝑦𝑗=2 = 𝑡𝑎𝑖𝑙 𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20
Y 𝑦𝑗=1 = ℎ𝑒𝑎𝑑 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15
𝐶𝑖 5 6 6 7 5 6 N=35
X
𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗
𝑦𝑗=2 = 𝑡𝑎𝑖𝑙 𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20
Y 𝑦𝑗=1 = ℎ𝑒𝑎𝑑 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15
𝐶𝑖 5 6 6 7 5 6 N=35
𝑐𝑖
Probability: 𝑝(𝑋 = 𝑥𝑖 ) =
𝑁
𝑛𝑖𝑗
Joint probability: 𝑝(𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ) =
𝑁
𝑛𝑖𝑗
Conditional probability: 𝑝(𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ) =
𝑐𝑖
Sum rule 𝐿
𝑝 𝑋 = 𝑥𝑖 = 𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ⇒ 𝑝 𝑋 = 𝑃(𝑋, 𝑌)
𝑗=1 𝑌
Product rule
𝑛𝑖𝑗 𝑛𝑖𝑗 𝑐𝑖
𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = = = 𝑝 𝑌 = 𝑦𝑗 𝑋 = 𝑥𝑖 𝑝(𝑋 = 𝑥𝑖 )
𝑁 𝑐𝑖 𝑁
𝑝 𝑋, 𝑌 = 𝑝 𝑌 𝑋 𝑝(𝑋)
Conditional Independence
18
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation
19
Bayes’ Rule
20
Bayes’ Rule
21
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation
22
Mean and Variance
23
For Joint Distributions
Uncorrelated vs Independent RV 29
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation
30
Gaussian Distribution
1 𝑥−𝜇 2
2 −
𝑓(𝑥|𝜇, 𝜎 ) = 𝑒 2𝜎2
2𝜋𝜎 2
33
Properties of Gaussian Distribution
35
Central Limit Theorem
Probability mass function of a biased dice
0.35 Let’s say, I am going to get a
0.3
sample from this pmf having a
0.25
size of 𝒏 = 𝟒
0.2
𝑆1 = 1,1,1,6 ⇒ 𝐸 𝑆1 = 2.25
0.15
0
⋮
1 2 3 4 5 6 𝑆𝑚 = 1,4,6,6 ⇒ 𝐸 𝑆𝑚 = 4.25
X
37
Maximum Likelihood Estimation
Main assumption:
Independent and identically distributed random variables
i.i.d
38
Maximum Likelihood Estimation
For Bernoulli (i.e. flip a coin):
𝐿 𝜃|𝑋 = 𝐿(𝜃|𝑋 = 𝑥1 , 𝑋 = 𝑥2 , 𝑋 = 𝑥3 , … , 𝑋 = 𝑥𝑛 )
i.i.d assumption
𝑛
𝐿 𝜃|𝑋 = ෑ 𝑃 𝑥𝑖 |𝜃
𝑖=1
𝑛 𝑛
𝐿 𝜃|𝑋 = ෑ 𝑃 𝑥𝑖 |𝜃 = ෑ 𝑝 𝑥𝑖 1 − 𝑝 1−𝑥𝑖
𝑖=1 𝑖=1
𝐿 𝜃|𝑋 = 𝜃 𝑥1 1 − 𝜃 1−𝑥1
× 𝜃 𝑥2 1 − 𝜃 1−𝑥2 … × 𝜃 𝑥𝑛 1 − 𝜃 1−𝑥𝑛
=
= 𝜃 σ 𝑥𝑖 1 − 𝜃 σ(1−𝑥𝑖)
We don’t like multiplication, let’s convert it into summation
𝑛 𝑛
How to optimize 𝜃?
𝑛
1
𝜃 = 𝑥𝑖
𝑛
𝑖=1