Computer Science Machine Learning Predictive Model: Probability and Stats

Computer Science
Machine Learning
Predictive Model
Probability and Stats

Machine Learning CS 4641-7641
Probability and Statistics
Mahdi Roozbahani
Georgia Tech
These slides are inspired based on slides from Le Song , Sam Roweis, and Chao Zhang.
Outline
• Probability Distributions
• Joint and Conditional Probability Distributions
• Bayes’ Rule
• Mean and Variance
• Properties of Gaussian Distribution
• Maximum Likelihood Estimation
3
Probability
4
Three Key Ingredients in Probability Theory
A sample space is a collection of all possible outcomes
Random variables 𝑋 represents outcomes in sample space
Probability of a random variable to happen 𝑝 𝑥 = 𝑝(𝑋 = 𝑥)
𝑝 𝑥 ≥0
5
Continuous variable
Continuous probability distribution
Probability density function
න 𝑝(𝑥)𝑑𝑥 = 1
Density or likelihood value
𝑥
Temperature (real number)
Gaussian Distribution
Discrete variable
Discrete probability distribution
Probability mass function
Probability value ෍𝑝 𝑥 = 1
Coin flip (integer) 𝑥𝜖𝐴
Bernoulli distribution
Continuous Probability Functions
7
Discrete Probability Functions
In Bernoulli, just a single trial is conducted
k is number of successes
n-k is number of failures
𝒏
𝒌
The total number of ways of selection k distinct combinations of n
trials, irrespective of order.
Outline
• Bayes’ Rule
10
Example
X and Y are random variables
N = total number of trials
𝒏𝒊𝒋 = Number of occurrence
X = Throw a
Y = Flip a coin
dice
X
𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗
𝑦𝑗=2 = 𝑡𝑎𝑖𝑙 𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20
Y 𝑦𝑗=1 = ℎ𝑒𝑎𝑑 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15
𝐶𝑖 5 6 6 7 5 6 N=35
X
𝑥𝑖=1 = 1 𝑥𝑖=2 = 2 𝑥𝑖=3 = 3 𝑥𝑖=4 = 4 𝑥𝑖=5 = 5 𝑥𝑖=6 = 6 𝐶𝑗
𝑦𝑗=2 = 𝑡𝑎𝑖𝑙 𝑛𝑖𝑗 = 3 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 5 𝑛𝑖𝑗 = 1 𝑛𝑖𝑗 = 5 20
Y 𝑦𝑗=1 = ℎ𝑒𝑎𝑑 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 2 𝑛𝑖𝑗 = 4 𝑛𝑖𝑗 = 1 15
𝐶𝑖 5 6 6 7 5 6 N=35
𝑐𝑖
Probability: 𝑝(𝑋 = 𝑥𝑖 ) =
𝑁
𝑛𝑖𝑗
Joint probability: 𝑝(𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ) =
𝑁
𝑛𝑖𝑗
Conditional probability: 𝑝(𝑌 = 𝑦𝑗 |𝑋 = 𝑥𝑖 ) =
𝑐𝑖
Sum rule 𝐿
𝑝 𝑋 = 𝑥𝑖 = ෍ 𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ⇒ 𝑝 𝑋 = ෍ 𝑃(𝑋, 𝑌)
𝑗=1 𝑌
Product rule
𝑛𝑖𝑗 𝑛𝑖𝑗 𝑐𝑖
𝑝 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = = = 𝑝 𝑌 = 𝑦𝑗 𝑋 = 𝑥𝑖 𝑝(𝑋 = 𝑥𝑖 )
𝑁 𝑐𝑖 𝑁
𝑝 𝑋, 𝑌 = 𝑝 𝑌 𝑋 𝑝(𝑋)
Conditional Independence
18
Outline
• Bayes’ Rule
19
Bayes’ Rule
20
Bayes’ Rule
21
Outline
• Bayes’ Rule
22
Mean and Variance
23
For Joint Distributions
Uncorrelated vs Independent RV 29
Outline
• Bayes’ Rule
30
Gaussian Distribution
1 𝑥−𝜇 2
2 −
𝑓(𝑥|𝜇, 𝜎 ) = 𝑒 2𝜎2
2𝜋𝜎 2
Probability versus likelihood

Multivariate Gaussian Distribution
33
Properties of Gaussian Distribution
35
Central Limit Theorem
Probability mass function of a biased dice
0.35 Let’s say, I am going to get a
0.3
sample from this pmf having a
0.25
size of 𝒏 = 𝟒
0.2
𝑆1 = 1,1,1,6 ⇒ 𝐸 𝑆1 = 2.25
0.15
0.1 𝑆2 = 1,1,3,6 ⇒ 𝐸 𝑆2 = 2.75

0.05
0
⋮
1 2 3 4 5 6 𝑆𝑚 = 1,4,6,6 ⇒ 𝐸 𝑆𝑚 = 4.25
X
According to CLT, it will follow

a bell curve distribution
(normal distribution)
1 2.5 3.5 4.5 6
Outline
• Bayes’ Rule
37
Maximum Likelihood Estimation
Main assumption:
Independent and identically distributed random variables
i.i.d
38
Maximum Likelihood Estimation
For Bernoulli (i.e. flip a coin):
Objective function: 𝑃 𝑥𝑖 |𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 𝑥𝑖 ∈ 0,1 𝑜𝑟 {ℎ𝑒𝑎𝑑, 𝑡𝑎𝑖𝑙}
𝐿 𝜃|𝑋 = 𝐿(𝜃|𝑋 = 𝑥1 , 𝑋 = 𝑥2 , 𝑋 = 𝑥3 , … , 𝑋 = 𝑥𝑛 )
i.i.d assumption
𝑛
𝐿 𝜃|𝑋 = ෑ 𝑃 𝑥𝑖 |𝜃
𝑖=1
𝑛 𝑛
𝐿 𝜃|𝑋 = ෑ 𝑃 𝑥𝑖 |𝜃 = ෑ 𝑝 𝑥𝑖 1 − 𝑝 1−𝑥𝑖
𝑖=1 𝑖=1
𝐿 𝜃|𝑋 = 𝜃 𝑥1 1 − 𝜃 1−𝑥1
× 𝜃 𝑥2 1 − 𝜃 1−𝑥2 … × 𝜃 𝑥𝑛 1 − 𝜃 1−𝑥𝑛
=
= 𝜃 σ 𝑥𝑖 1 − 𝜃 σ(1−𝑥𝑖)
We don’t like multiplication, let’s convert it into summation
What’s the trick? Take the log

𝐿 𝜃|𝑋 = 𝜃 σ 𝑥𝑖 1 − 𝜃 σ(1−𝑥𝑖 )
𝑛 𝑛
𝑙𝑜𝑔𝐿 𝜃|𝑋 = 𝑙 𝜃|𝑋 = log 𝜃 ෍ 𝑥𝑖 + log 1 − 𝜃 ෍(1 − 𝑥𝑖 )

𝑖=1 𝑖=1
How to optimize 𝜃?
𝜕𝑙(𝜃|𝑋) σ𝑛𝑖=1 𝑥𝑖 σ𝑛𝑖=1(1 − 𝑥𝑖 )

=0 − =0
𝜕𝜃 𝜃 1−𝜃
𝑛
1
𝜃 = ෍ 𝑥𝑖
𝑛
𝑖=1

Computer Science Machine Learning Predictive Model: Probability and Stats

Uploaded by

Copyright:

Available Formats

You might also like

Computer Science Machine Learning Predictive Model: Probability and Stats

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Science Machine Learning Predictive Model: Probability and Stats

Uploaded by

Copyright:

Available Formats

Computer Science

Probability and Stats

Probability and Statistics

A sample space is a collection of all possible outcomes

Random variables 𝑋 represents outcomes in sample space

Probability of a random variable to happen 𝑝 𝑥 = 𝑝(𝑋 = 𝑥)

In Bernoulli, just a single trial is conducted

n-k is number of failures

Probability versus likelihood

0.1 𝑆2 = 1,1,3,6 ⇒ 𝐸 𝑆2 = 2.75

According to CLT, it will follow

Objective function: 𝑃 𝑥𝑖 |𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 𝑥𝑖 ∈ 0,1 𝑜𝑟 {ℎ𝑒𝑎𝑑, 𝑡𝑎𝑖𝑙}

What’s the trick? Take the log

𝑙𝑜𝑔𝐿 𝜃|𝑋 = 𝑙 𝜃|𝑋 = log 𝜃 ෍ 𝑥𝑖 + log 1 − 𝜃 ෍(1 − 𝑥𝑖 )

𝜕𝑙(𝜃|𝑋) σ𝑛𝑖=1 𝑥𝑖 σ𝑛𝑖=1(1 − 𝑥𝑖 )

You might also like