Computer Science Machine Learning Predictive Model: Probability and Stats

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Computer Science

Machine Learning

Predictive Model

Probability and Stats


Machine Learning CS 4641-7641

Probability and Statistics

Mahdi Roozbahani
Georgia Tech

These slides are inspired based on slides from Le Song , Sam Roweis, and Chao Zhang.
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation

3
Probability

4
Three Key Ingredients in Probability Theory

A sample space is a collection of all possible outcomes

Random variables 𝑋 represents outcomes in sample space

Probability of a random variable to happen 𝑝 𝑥 = 𝑝(𝑋 = 𝑥)

𝑝 𝑥 ≥0

5
Continuous variable
Continuous probability distribution
Probability density function
න 𝑝(𝑥)𝑑𝑥 = 1
Density or likelihood value
𝑥
Temperature (real number)
Gaussian Distribution

Discrete variable
Discrete probability distribution
Probability mass function
Probability value ෍𝑝 𝑥 = 1
Coin flip (integer) 𝑥𝜖𝐴
Bernoulli distribution
Continuous Probability Functions

7
Discrete Probability Functions

In Bernoulli, just a single trial is conducted

k is number of successes

n-k is number of failures

𝒏
𝒌
The total number of ways of selection k distinct combinations of n
trials, irrespective of order.
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation

10
Example
X and Y are random variables
N = total number of trials
𝒏𝒊𝒋 = Number of occurrence

X = Throw a
Y = Flip a coin
dice

X
𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗
𝑦𝑗=2 = 𝑡𝑎𝑖𝑙 𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20
Y 𝑦𝑗=1 = ℎ𝑒𝑎𝑑 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15

𝐶𝑖 5 6 6 7 5 6 N=35
X
𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗
𝑦𝑗=2 = 𝑡𝑎𝑖𝑙 𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20
Y 𝑦𝑗=1 = ℎ𝑒𝑎𝑑 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15

𝐶𝑖 5 6 6 7 5 6 N=35
𝑐𝑖
Probability: 𝑝(𝑋 = 𝑥𝑖 ) =
𝑁
𝑛𝑖𝑗
Joint probability: 𝑝(𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ) =
𝑁
𝑛𝑖𝑗
Conditional probability: 𝑝(𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ) =
𝑐𝑖

Sum rule 𝐿

𝑝 𝑋 = 𝑥𝑖 = ෍ 𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ⇒ 𝑝 𝑋 = ෍ 𝑃(𝑋, 𝑌)
𝑗=1 𝑌

Product rule
𝑛𝑖𝑗 𝑛𝑖𝑗 𝑐𝑖
𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = = = 𝑝 𝑌 = 𝑦𝑗 𝑋 = 𝑥𝑖 𝑝(𝑋 = 𝑥𝑖 )
𝑁 𝑐𝑖 𝑁
𝑝 𝑋, 𝑌 = 𝑝 𝑌 𝑋 𝑝(𝑋)
Conditional Independence

18
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation

19
Bayes’ Rule

20
Bayes’ Rule

21
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation

22
Mean and Variance

23
For Joint Distributions

Uncorrelated vs Independent RV 29
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation

30
Gaussian Distribution
1 𝑥−𝜇 2
2 −
𝑓(𝑥|𝜇, 𝜎 ) = 𝑒 2𝜎2
2𝜋𝜎 2

Probability versus likelihood


Multivariate Gaussian Distribution

33
Properties of Gaussian Distribution

35
Central Limit Theorem
Probability mass function of a biased dice
0.35 Let’s say, I am going to get a
0.3
sample from this pmf having a
0.25
size of 𝒏 = 𝟒
0.2
𝑆1 = 1,1,1,6 ⇒ 𝐸 𝑆1 = 2.25
0.15

0.1 𝑆2 = 1,1,3,6 ⇒ 𝐸 𝑆2 = 2.75


0.05

0

1 2 3 4 5 6 𝑆𝑚 = 1,4,6,6 ⇒ 𝐸 𝑆𝑚 = 4.25
X

According to CLT, it will follow


a bell curve distribution
(normal distribution)
1 2.5 3.5 4.5 6
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation

37
Maximum Likelihood Estimation

Main assumption:
Independent and identically distributed random variables
i.i.d

38
Maximum Likelihood Estimation
For Bernoulli (i.e. flip a coin):

Objective function: 𝑃 𝑥𝑖 |𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 𝑥𝑖 ∈ 0,1 𝑜𝑟 {ℎ𝑒𝑎𝑑, 𝑡𝑎𝑖𝑙}

𝐿 𝜃|𝑋 = 𝐿(𝜃|𝑋 = 𝑥1 , 𝑋 = 𝑥2 , 𝑋 = 𝑥3 , … , 𝑋 = 𝑥𝑛 )
i.i.d assumption
𝑛

𝐿 𝜃|𝑋 = ෑ 𝑃 𝑥𝑖 |𝜃
𝑖=1
𝑛 𝑛

𝐿 𝜃|𝑋 = ෑ 𝑃 𝑥𝑖 |𝜃 = ෑ 𝑝 𝑥𝑖 1 − 𝑝 1−𝑥𝑖

𝑖=1 𝑖=1

𝐿 𝜃|𝑋 = 𝜃 𝑥1 1 − 𝜃 1−𝑥1
× 𝜃 𝑥2 1 − 𝜃 1−𝑥2 … × 𝜃 𝑥𝑛 1 − 𝜃 1−𝑥𝑛
=
= 𝜃 σ 𝑥𝑖 1 − 𝜃 σ(1−𝑥𝑖)
We don’t like multiplication, let’s convert it into summation

What’s the trick? Take the log


𝐿 𝜃|𝑋 = 𝜃 σ 𝑥𝑖 1 − 𝜃 σ(1−𝑥𝑖 )

𝑛 𝑛

𝑙𝑜𝑔𝐿 𝜃|𝑋 = 𝑙 𝜃|𝑋 = log 𝜃 ෍ 𝑥𝑖 + log 1 − 𝜃 ෍(1 − 𝑥𝑖 )


𝑖=1 𝑖=1

How to optimize 𝜃?

𝜕𝑙(𝜃|𝑋) σ𝑛𝑖=1 𝑥𝑖 σ𝑛𝑖=1(1 − 𝑥𝑖 )


=0 − =0
𝜕𝜃 𝜃 1−𝜃

𝑛
1
𝜃 = ෍ 𝑥𝑖
𝑛
𝑖=1

You might also like