CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML

CS 725: Foundations of Machine Learning
Lecture 2. Overview of probability theory for ML
June 2019
1
Probability is Quantification of Uncertainty
• We are trying to build systems that understand and (possibly) interact with
the real world
• We often can not prove something is true, but we can still ask how likely
different outcomes are or ask for the most likely explanation
Probability theory is nothing but common sense reduced to calculation. —

Pierre Laplace, 1812
2
Random Variable and Sample Space
• A random variable X represents the outcome or the state of the world and
takes values from a sample space or domain
• Sample space: the space of all possible outcomes
1. Can be continuous (Example: how much it will rain tomorrow)
2. Or Discrete (Example: results of a coin pair toss; S = {HH, HT , TH, TT }.)
• Pr(x) is the probability mass (density) function
• Assigns a number to each point in sample space
• Non-negative, sums (integrates) to 1
• Intuitively: how often does x occur, how much do we believe in x.
3
Events
• Event (E ) : An event is any subset of the sample space. i.e., E ⊆ S.

Often we describe events using “conditions.” E.g., the event that the two
coin tosses are the same, Esame = {HH, TT }.
Events can be combined using logical operations on these conditions: E.g.,
“the two coin tosses are the same or the second one is T ”:
E = {HH, TT } ∪ {HT , TT } = {HH, TT , HT }.
• Probability of an Event: The total weight assigned to all the elements in the
event by a given probability distribution.
X
Pr(E ) = p(x)
x∈E
4
A review of probability theory
• Note:
• Pr(S) = 1 and Pr(∅) = 0
• Pr(E ) = 1 − Pr(E ), where E = S \ E
• Pr(E1 ∪ E2 ) = Pr(E1 ) + Pr(E2 ) − Pr(E1 ∩ E2 )
• If E1 , E2 , . . . , En are pairwise disjoint events, then
n
[ n
X
Pr ( Ei ) = Pr (Ei )
i=1 i=1
5
Distribution Functions for discrete data
Suppose X is a discrete random variable, that takes m possible values.
6
Continuous Distributions
Let X be a continous random variable, e.g. amount of rains to arrive tomorrow

that takes values between [0,100]. Infinite number of values in this space.
Cannot define Pr(X = x), the distribution function as in the discrete case.
A probability density function (pdf) of a continuous

Z random variable is
f : S → R+ such that for all D ⊆ S, Pr (X ∈ D) = f (x)dx.
D
7
Cumulative Distribution Function
Suppose X is a continuous random variable which takes values from the sample
space R, and has a pdf f . Its cdf is defined as F : R → [0, 1]:
Z a
F (a) = Pr (X ≤ a) = f (x)dx
−∞
Note: pdf for continuous distribution can be obtained by differentiating the cdf
of that random variable:
dF (x)
f (a) = |x=a
dx
8
Multiple Random Variables
• Two random variables, X , Y , e.g.

1. X =the age of a person and Y is his/her height.
2. Y = rainfall today, X = rainfall tomorrow
3. X = word in frame 1 of an audio and Y = word in frame 2 of audio.
• Joint distribution gives probability of two outcomes happening together
For discrete: Pr(X = x, Y = y ).
E.g., S1 = S2 = {H, T } and
Pr((X , Y ) = (H, H)) = 1/3, Pr((X , Y ) = (H, T )) = 1/3,
Pr((X , Y ) = (T , T )) = 1/6, Pr((X , Y ) = (T , H)) = 1/6.
9
Multiple Random Variables (cont)
• For continuous:
If f (x, y ) is a joint pdf, then
Z b Z a
F (a, b) = Pr (X ≤ a, Y ≤ b) = f (x, y )dxdy
−∞ −∞
∂ 2 F (x, y )
f (a, b) = |a,b
∂x∂y
P
• Marginal Distribution: For x ∈ S1 , Pr(X = x) = y ∈S2 Pr((X , Y ) = (x, y )).
Similarly, marginal distribution of Y .
10
Conditional Probability
• Bayes’ Rule, named after Thomas Bayes (1701-1761):
Pr(X |Y ) Pr(Y )
Pr(Y |X ) = ,
Pr(X )
provided Pr(Y ), Pr(X ) > 0.

•
p(y |x)p(x) p(y , x)p(x)
p(x|y ) = =
p(y ) Σx p(y |x)p(x)
• This gives away a way of reversing conditional probabilities
11
Using Bayes’ Theorem
A lab test has a probability 0.95 of detecting a disease when applied to a person
suffering from said disease, and a probability 0.10 of giving a false positive when
applied to a non-sufferer. If 0.5% of the population are sufferers, what is the
probability of a person suffering from the disease if the test is positive?
12
Independence of Random Variables
• Notation: Pr((X , Y ) = (x, y )) abbreviated as p(x, y ), Pr(X = x|Y = y ) as

p(x|y ) etc.
• X , Y are said to be independent (X ⊥ ⊥ Y ) iff for all x, y , the events X = x
and Y = y are independent. i.e., p(x, y ) = p(x)p(y ).
• X , Y are said to be conditionally independent given Z iff for all x, y , z
(with p(z) > 0), p(x, y |z) = p(x|z)p(y |z).
13
Expectation
If X is a random variable taking (say) real values (i.e., S ⊆ R), we can define an
“expected value” for X as:
E (X ) = Σx∈S x Pr(X = x)
For continuous:
R∞
E [X ] = −∞
xf (x)dx
Question: Suppose, f : S → R and Y is the random variable jointly distributed

with X defined as Y = g (X ). Then, what is E (Y )?
P
E (Y ) = x∈S g (x) Pr(X = x)
14
Variance
The variance of a random variable X with E [X ] = µ is defined as:
Var [X ] = E [(X − µ)2 ]
• Var [X ] = E [X 2 ] − (E [X ])2
• Var [X + β] = Var [X ] and Var [αX ] = α2 Var [X ]
P P
• If X1 , · · · , Xn are pairwise independent, then Var [ i Xi ] = i Var [Xi ]
(Proof HW)
• If X1 , · · · , Xn are pairwise independent, each with variance σ 2 , then,
Var [ n1 i Xi ] = σ 2 /n
P
15
Covariance
For random variables X and Y, covariance is defined as:
Cov [X , Y ] = E [(X − E (X ))(Y − E (Y ))] = E [XY ] − E [X ]E [Y ]
If X and Y are independent then their covariance is 0, since in that case
E [XY ] = E [X ]E [Y ]
16
Important Discrete Random Variables
Bernoulli Random Variable: It is a discrete random variable taking values 0,1
Say, Pr [Xi = 0] = 1 − q where q ∈ [0, 1]

Then Pr [Xi = 1] = q
• E [X ] = (1 − q) × 0 + q × 1 = q
• Var [X ] = q − q 2 = q(1 − q)
Note: It represents the probability of success in a random event.

For example: Coin toss experiment has Pr [Head] = Pr [Xi = 1] = q
17
Binomial Random Variable It is a discrete variable where the distribution is of
number of 1’s in a series of n experiments, each with {0,1} value, with the
probability that the outcome of a particular experiment is 1 being q.
A binomial distribution is the distribution of n-times repeated bernoulli trials. Eg:

Distribution of number of heads when a coin is tossed n times.
1. Pr [X = k] = kn q k (1 − q)n−k

P
2. E [X ] = i E [Yi ] where Yi is a bernoulli random variable ⇒ E [X ] = nq
P
3. Var [X ] = i Var [Yi ] (since Yi ’s are independent) ⇒ Var [X ] = nq(1 − q)
18
Normal (Gaussian) Distribution
• It is a popular continuous distribution

1
exp − 2σ1 2 (x − µ)2

N (x|µ, σ 2 ) = √2πσ
• µ is the mean and σ 2 is the variance
• Exercise: Verify the mean and variance. For e.g.

R∞ 1 ?
E [X ] = −∞ x √2πσ exp (− 2σ1 2 (x − µ)2 )dx = µ
19
1-D Gaussian distribution
23/01/2017 https://upload.wikimedia.org/wikipedia/commons/7/74/Normal_Distribution_PDF.svg
1.0
μ = 0, σ 2 = 0.2,
μ = 0, σ 2 = 1.0,
0.8
μ = 0, σ 2 = 5.0,
μ = −2, σ 2 = 0.5,
0.6
φμ,σ (x)
2
0.4
0.2
0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
20
Normal (Gaussian) Distribution
• It is a popular continuous distribution

1
exp − 2σ1 2 (x − µ)2

N (x|µ, σ 2 ) = √2πσ
• µ is the mean and σ 2 is the variance
• Exercise: Verify the mean and variance. For e.g.

R∞ 1 ?
E [X ] = −∞ x √2πσ exp (− 2σ1 2 (x − µ)2 )dx = µ
• Multivariate Gaussian (d-dim)

f (x|µ, Σ) = (2π)−d/2 |Σ|−1/2 exp {− 12 (x − µ)T Σ−1 (x − µ)}
• x is now a vector, µ is the mean vector and Σ is the co-variance matrix
21
2-D Gaussian distribution
1
0.8
0.6
0.4
0.2
0 3
2
1 3
0 2
1
-1 0
-2 -1
-2
-3 -3
22
Properties of Normal Distribution
• All marginals of a Gaussian are again Gaussian

• Any conditional of a Gaussian is Gaussian
• The product of two Gaussians is again Gaussian
• Even the sum of two independent Gaussian r.v.’s is a Gaussian
Note: Many of the standard distributions belong to the family of exponential

distributions
• Bernoulli, binomial/multinomial, Poisson, Normal (Gaussian), Beta/Dirichlet

...
• Share many important properties - e.g. They have a conjugate prior.
23

CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML

Uploaded by

Copyright:

Available Formats

CS 725: Foundations of Machine Learning

Lecture 2. Overview of probability theory for ML

Probability theory is nothing but common sense reduced to calculation. —

• Event (E ) : An event is any subset of the sample space. i.e., E ⊆ S.

Suppose X is a discrete random variable, that takes m possible values.

Let X be a continous random variable, e.g. amount of rains to arrive tomorrow

A probability density function (pdf) of a continuous

• Two random variables, X , Y , e.g.

• Bayes’ Rule, named after Thomas Bayes (1701-1761):

provided Pr(Y ), Pr(X ) > 0.

• This gives away a way of reversing conditional probabilities

• Notation: Pr((X , Y ) = (x, y )) abbreviated as p(x, y ), Pr(X = x|Y = y ) as

Question: Suppose, f : S → R and Y is the random variable jointly distributed

The variance of a random variable X with E [X ] = µ is defined as:

Var [X ] = E [(X − µ)2 ]

For random variables X and Y, covariance is defined as:

Cov [X , Y ] = E [(X − E (X ))(Y − E (Y ))] = E [XY ] − E [X ]E [Y ]

If X and Y are independent then their covariance is 0, since in that case

Bernoulli Random Variable: It is a discrete random variable taking values 0,1

Say, Pr [Xi = 0] = 1 − q where q ∈ [0, 1]

Note: It represents the probability of success in a random event.

A binomial distribution is the distribution of n-times repeated bernoulli trials. Eg:

• It is a popular continuous distribution

• Exercise: Verify the mean and variance. For e.g.

• It is a popular continuous distribution

• Exercise: Verify the mean and variance. For e.g.

• Multivariate Gaussian (d-dim)

• All marginals of a Gaussian are again Gaussian

Note: Many of the standard distributions belong to the family of exponential

• Bernoulli, binomial/multinomial, Poisson, Normal (Gaussian), Beta/Dirichlet

You might also like