Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

CS 725: Foundations of Machine Learning

Lecture 2. Overview of probability theory for ML

June 2019

1
Probability is Quantification of Uncertainty

• We are trying to build systems that understand and (possibly) interact with
the real world
• We often can not prove something is true, but we can still ask how likely
different outcomes are or ask for the most likely explanation

Probability theory is nothing but common sense reduced to calculation. —


Pierre Laplace, 1812

2
Random Variable and Sample Space

• A random variable X represents the outcome or the state of the world and
takes values from a sample space or domain
• Sample space: the space of all possible outcomes
1. Can be continuous (Example: how much it will rain tomorrow)
2. Or Discrete (Example: results of a coin pair toss; S = {HH, HT , TH, TT }.)
• Pr(x) is the probability mass (density) function
• Assigns a number to each point in sample space
• Non-negative, sums (integrates) to 1
• Intuitively: how often does x occur, how much do we believe in x.

3
Events

• Event (E ) : An event is any subset of the sample space. i.e., E ⊆ S.


Often we describe events using “conditions.” E.g., the event that the two
coin tosses are the same, Esame = {HH, TT }.
Events can be combined using logical operations on these conditions: E.g.,
“the two coin tosses are the same or the second one is T ”:
E = {HH, TT } ∪ {HT , TT } = {HH, TT , HT }.
• Probability of an Event: The total weight assigned to all the elements in the
event by a given probability distribution.
X
Pr(E ) = p(x)
x∈E

4
A review of probability theory

• Note:
• Pr(S) = 1 and Pr(∅) = 0
• Pr(E ) = 1 − Pr(E ), where E = S \ E
• Pr(E1 ∪ E2 ) = Pr(E1 ) + Pr(E2 ) − Pr(E1 ∩ E2 )
• If E1 , E2 , . . . , En are pairwise disjoint events, then
n
[ n
X
Pr ( Ei ) = Pr (Ei )
i=1 i=1

5
Distribution Functions for discrete data

Suppose X is a discrete random variable, that takes m possible values.

6
Continuous Distributions

Let X be a continous random variable, e.g. amount of rains to arrive tomorrow


that takes values between [0,100]. Infinite number of values in this space.
Cannot define Pr(X = x), the distribution function as in the discrete case.

A probability density function (pdf) of a continuous


Z random variable is
f : S → R+ such that for all D ⊆ S, Pr (X ∈ D) = f (x)dx.
D

7
Cumulative Distribution Function

Suppose X is a continuous random variable which takes values from the sample
space R, and has a pdf f . Its cdf is defined as F : R → [0, 1]:
Z a
F (a) = Pr (X ≤ a) = f (x)dx
−∞

Note: pdf for continuous distribution can be obtained by differentiating the cdf
of that random variable:
dF (x)
f (a) = |x=a
dx
8
Multiple Random Variables

• Two random variables, X , Y , e.g.


1. X =the age of a person and Y is his/her height.
2. Y = rainfall today, X = rainfall tomorrow
3. X = word in frame 1 of an audio and Y = word in frame 2 of audio.
• Joint distribution gives probability of two outcomes happening together
For discrete: Pr(X = x, Y = y ).
E.g., S1 = S2 = {H, T } and
Pr((X , Y ) = (H, H)) = 1/3, Pr((X , Y ) = (H, T )) = 1/3,
Pr((X , Y ) = (T , T )) = 1/6, Pr((X , Y ) = (T , H)) = 1/6.

9
Multiple Random Variables (cont)

• For continuous:
If f (x, y ) is a joint pdf, then
Z b Z a
F (a, b) = Pr (X ≤ a, Y ≤ b) = f (x, y )dxdy
−∞ −∞

∂ 2 F (x, y )
f (a, b) = |a,b
∂x∂y
P
• Marginal Distribution: For x ∈ S1 , Pr(X = x) = y ∈S2 Pr((X , Y ) = (x, y )).
Similarly, marginal distribution of Y .

10
Conditional Probability

• Bayes’ Rule, named after Thomas Bayes (1701-1761):

Pr(X |Y ) Pr(Y )
Pr(Y |X ) = ,
Pr(X )

provided Pr(Y ), Pr(X ) > 0.



p(y |x)p(x) p(y , x)p(x)
p(x|y ) = =
p(y ) Σx p(y |x)p(x)

• This gives away a way of reversing conditional probabilities

11
Using Bayes’ Theorem

A lab test has a probability 0.95 of detecting a disease when applied to a person
suffering from said disease, and a probability 0.10 of giving a false positive when
applied to a non-sufferer. If 0.5% of the population are sufferers, what is the
probability of a person suffering from the disease if the test is positive?

12
Independence of Random Variables

• Notation: Pr((X , Y ) = (x, y )) abbreviated as p(x, y ), Pr(X = x|Y = y ) as


p(x|y ) etc.
• X , Y are said to be independent (X ⊥ ⊥ Y ) iff for all x, y , the events X = x
and Y = y are independent. i.e., p(x, y ) = p(x)p(y ).
• X , Y are said to be conditionally independent given Z iff for all x, y , z
(with p(z) > 0), p(x, y |z) = p(x|z)p(y |z).

13
Expectation

If X is a random variable taking (say) real values (i.e., S ⊆ R), we can define an
“expected value” for X as:

E (X ) = Σx∈S x Pr(X = x)

For continuous:
R∞
E [X ] = −∞
xf (x)dx

Question: Suppose, f : S → R and Y is the random variable jointly distributed


with X defined as Y = g (X ). Then, what is E (Y )?
P
E (Y ) = x∈S g (x) Pr(X = x)
14
Variance

The variance of a random variable X with E [X ] = µ is defined as:

Var [X ] = E [(X − µ)2 ]

• Var [X ] = E [X 2 ] − (E [X ])2
• Var [X + β] = Var [X ] and Var [αX ] = α2 Var [X ]
P P
• If X1 , · · · , Xn are pairwise independent, then Var [ i Xi ] = i Var [Xi ]
(Proof HW)
• If X1 , · · · , Xn are pairwise independent, each with variance σ 2 , then,
Var [ n1 i Xi ] = σ 2 /n
P

15
Covariance

For random variables X and Y, covariance is defined as:

Cov [X , Y ] = E [(X − E (X ))(Y − E (Y ))] = E [XY ] − E [X ]E [Y ]

If X and Y are independent then their covariance is 0, since in that case

E [XY ] = E [X ]E [Y ]

16
Important Discrete Random Variables

Bernoulli Random Variable: It is a discrete random variable taking values 0,1

Say, Pr [Xi = 0] = 1 − q where q ∈ [0, 1]


Then Pr [Xi = 1] = q

• E [X ] = (1 − q) × 0 + q × 1 = q
• Var [X ] = q − q 2 = q(1 − q)

Note: It represents the probability of success in a random event.


For example: Coin toss experiment has Pr [Head] = Pr [Xi = 1] = q
17
Binomial Random Variable It is a discrete variable where the distribution is of
number of 1’s in a series of n experiments, each with {0,1} value, with the
probability that the outcome of a particular experiment is 1 being q.

A binomial distribution is the distribution of n-times repeated bernoulli trials. Eg:


Distribution of number of heads when a coin is tossed n times.

1. Pr [X = k] = kn q k (1 − q)n−k

P
2. E [X ] = i E [Yi ] where Yi is a bernoulli random variable ⇒ E [X ] = nq
P
3. Var [X ] = i Var [Yi ] (since Yi ’s are independent) ⇒ Var [X ] = nq(1 − q)

18
Normal (Gaussian) Distribution

• It is a popular continuous distribution


1
exp − 2σ1 2 (x − µ)2

N (x|µ, σ 2 ) = √2πσ
• µ is the mean and σ 2 is the variance

• Exercise: Verify the mean and variance. For e.g.


R∞ 1 ?
E [X ] = −∞ x √2πσ exp (− 2σ1 2 (x − µ)2 )dx = µ

19
1-D Gaussian distribution
23/01/2017 https://upload.wikimedia.org/wikipedia/commons/7/74/Normal_Distribution_PDF.svg

1.0

μ = 0, σ 2 = 0.2,
μ = 0, σ 2 = 1.0,
0.8
μ = 0, σ 2 = 5.0,
μ = −2, σ 2 = 0.5,
0.6
φμ,σ (x)
2

0.4

0.2

0.0
−5 −4 −3 −2 −1 0 1 2 3 4 5
x

20
Normal (Gaussian) Distribution

• It is a popular continuous distribution


1
exp − 2σ1 2 (x − µ)2

N (x|µ, σ 2 ) = √2πσ
• µ is the mean and σ 2 is the variance

• Exercise: Verify the mean and variance. For e.g.


R∞ 1 ?
E [X ] = −∞ x √2πσ exp (− 2σ1 2 (x − µ)2 )dx = µ

• Multivariate Gaussian (d-dim)


f (x|µ, Σ) = (2π)−d/2 |Σ|−1/2 exp {− 12 (x − µ)T Σ−1 (x − µ)}
• x is now a vector, µ is the mean vector and Σ is the co-variance matrix

21
2-D Gaussian distribution

1
0.8
0.6
0.4
0.2
0 3
2
1 3
0 2
1
-1 0
-2 -1
-2
-3 -3
22
Properties of Normal Distribution

• All marginals of a Gaussian are again Gaussian


• Any conditional of a Gaussian is Gaussian
• The product of two Gaussians is again Gaussian
• Even the sum of two independent Gaussian r.v.’s is a Gaussian

Note: Many of the standard distributions belong to the family of exponential


distributions

• Bernoulli, binomial/multinomial, Poisson, Normal (Gaussian), Beta/Dirichlet


...
• Share many important properties - e.g. They have a conjugate prior.
23

You might also like