Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

SiTE

AAiT
AAU

Course Title: Machine Learning


Credit Hour: 3
Instructor: Fantahun B. (PhD)  meetfantaai@gmail.com
Office: NB #

Ch-1 Probabilistic distributions


Nov-2022, AA
1.
Probabilistic Distributions

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 2


Probability Distribution General
• Discrete
• Continuous

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 3


Probability Distribution General
Discrete Probability Distribution
• A discrete probability distribution function has two
characteristics:
• Each probability is between zero and one, inclusive.
• The sum of the probabilities is one.
• The expected value is often referred to as the "long-term"
average or mean.
• Karl Pearson once tossed a fair coin 24,000 times! He recorded
the results of each toss, obtaining heads 12,012 times.
• The Law of Large Numbers states that, as the number of trials in
a probability experiment increases, the difference between
the theoretical probability of an event and the relative
frequency approaches zero (the theoretical probability and
the relative frequency get closer and closer together).
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 4
Probability Distribution General
Continuous Probability Distribution
• The graph of a continuous probability distribution is a curve.
• Probability is represented by area under the curve.
• The curve is called the probability density function (pdf). We
use the symbol f(x) to represent the curve.
• f(x) is the function that corresponds to the graph; we use the
density function f(x) to draw the graph of the probability
distribution.
• Area under the curve is given by a different function called the
cumulative distribution function(cdf). The cumulative
distribution function is used to evaluate probability as area.C´s

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 5


Probability Distribution General
Continuous Probability Distribution
• The outcomes are measured, not counted.
• The entire area under the curve and above the x-axis is equal to one.
• Probability is found for intervals of x values rather than for individual x
values.
• P(c<x<d) is the probability that the random variable X is in the interval
between the values c and d.
• P(c<x<d) is the area under the curve, above the x-axis, to the right of c
and the left of d .
• P(x=c)=0. The probability that x takes on any single individual value is
zero.
• P(c<x<d) is the same as P(c≤x≤d) because probability is equal to area.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 6


Binary Variables: Frequentist’s Way
Bernoulli Distribution
• Consider a single binary random variable x {0, 1}.
 Eg., x might describe the outcome of flipping a coin, with x = 1 representing
‘heads’, and x = 0 representing ‘tails’.
 p(head)=p(tail) = 0.5
• We can imagine that this is a damaged coin so that the probability
of landing heads is not necessarily the same as that of landing tails.
• The probability of x = 1 will be denoted by the parameter μ so that ´

p(x = 1|μ) = μ (2.1)


where 0 <= μ <= 1, from which it follows that
p(x = 0|μ) = 1 − μ.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 7
Binary Variables: Frequentist’s way
• The probability distribution over x can be written in the form
Bern(x|μ) = μx(1 − μ)1−x (2.2)
which is known as the Bernoulli distribution.
• The maximum likelihood estimate for μ (estimated mean) is:
μML = m/N (2.8)
with m = (#observations of x = 1)
• Yet this can lead to overfitting (especially for small N),
 Suppose we flip a coin, say, 3 times and happen to observe 3
heads. Then N = m = 3 and μML = 1.
 In this case, the maximum likelihood result would predict that all
future observations should give heads. This is unreasonable.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 8
Binary Variables: Bayesian way
• We can also work out the distribution of the number m of
observations of x = 1, given that the data set has size N.
• This is called the binomial distribution, and it is proportional
to μm(1 − μ)N−m.
• In order to obtain the normalization coefficient we note
that out of N coin flips, we have to add up all of the
possible ways of obtaining m heads, so that the binomial
distribution can be written

where

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 9


Binary Variables: Bayesian way
• Binomial distribution is calculated by multiplying the probability of success raised
to the power of the number of successes and the probability of failure raised to
the power of the difference between the number of successes and the number of
trials. Then, multiply the product by the combination of the number of trials and
successes.
• For example, assume that a casino created a new game in which participants
can place bets on the number of heads or tails in a specified number of coin flips.
Assume a participant wants to place a $10 bet that there will be exactly six heads
in 20 coin flips. The participant wants to calculate the probability of this occurring,
and therefore, they use the calculation for binomial distribution.
• The probability was calculated as (20! / (6! × (20 - 6)!)) × (0.50)(6) × (1 - 0.50)(20 -
6). Consequently, the probability of exactly six heads occurring in 20 coin flips is
0.0369, or 3.7%. The expected value was 10 heads in this case, so the participant
made a poor bet. The graph below shows that the mean is 10 (the expected
value), and the chances of getting six heads is on the left tail in red. You can see
that there is less of a chance of six heads occurring than seven, eight, nine, 10, 11,
12, or 13 heads.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 10


Binary Variables: Bayesian way
• Histogram of the above example

N=20
P=0.5
P(X=6) = 0.0369644

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 11


Binary Variables: Beta distribution
• For a Bayesian treatment, we take the beta distribution as conjugate
prior:

• (The gamma function extends the factorial to real numbers, i.e., Γ(n) =
(n − 1)!.) Mean and variance are given by

• The parameters a and b are often called hyperparameters b/c they


control the distribution of the parameter μ.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 12
Binary Variables: Beta distribution

Figure 2.2 Plots of the


beta distribution
Beta(μ|a, b) given by
(2.13) as a function of μ
for various values of the
hyperparameters a and b.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 13


Binary Variables: Beta distribution
• Multiplying the binomial likelihood function (2.9) and the
beta prior (2.13), the posterior is a beta distribution and
has the form:

with l = N − m.
 Simple interpretation of hyperparameters a and b as effective
number of observations of x = 1 and x = 0 (a priori)
 As we observe new data, a and b are updated
 As N → ∞, the variance (uncertainty) decreases and the mean
converges to the ML estimate

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 14


Multinomial Variables: Frequentist’s Way
• Binary variables can be used to describe quantities that can
take one of two possible values. Eg. H/T of a coin
• Often, however, we encounter discrete variables that can
take on one of K possible mutually exclusive states.
• Although there are various alternative ways to express such
variables, we shall see shortly that a particularly convenient
representation is the 1-of-K scheme in which the variable is
represented by a K-dimensional vector x in which one of the
elements xk equals 1, and all remaining elements equal 0.
• So, for instance if we have a variable that can take K = 6
states and a particular observation of the variable happens
to correspond to the state where x3 = 1, then x will be
represented by
x = (0, 0, 1, 0, 0, 0)T. (2.25)
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 15
Multinomial Variables: Frequentist’s Way
• A random variable with K mutually exclusive states can be
represented as a K dimensional vector x with xk = 1 and xi≠k = 0.
• If we denote the probability of xk = 1 by the parameter μk, then
the distribution of x is given by:

where μ = (μ1, . . . , μK)T, and the parameters μk are constrained to


satisfy μk >= 0 and because they represent probabilities.
• The distribution (2.26) can be regarded as a generalization of the
Bernoulli distribution to more than two outcomes.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 16
Multinomial Variables: Frequentist’s Way
• For a dataset D with N independent observations x1, …, xN,
the corresponding likelihood function takes the form

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 17


Multinomial Variables: Bayesian Way 1/3
• The multinomial distribution is a joint distribution of the
parameters m1, . . . ,mK, conditioned on μ and N:

Mult(m1, m2, . . . , mK|μ, N) =

where the variables mk are subject to the constraint:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 18


Multinomial Variables: Bayesian Way 2/3
• For a Bayesian treatment, the Dirichlet distribution can be
taken as conjugate prior:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 19


Multinomial Variables: Dirichlet Distribution
• Some plots of a Dirichlet distribution over 3 variables:

Figure 2.5 Plots of the Dirichlet distribution over three variables, where the two horizontal axes
are coordinates in the plane of the simplex and the vertical axis corresponds to the value of the
density. Here {αk} = 0.1 on the left plot, {αk} = 1 in the centre plot, and {αk} = 10 in the right plot.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 20
Multinomial Variables: Dirichlet Distribution
• Some plots of a Dirichlet distribution over 3 variables:

Dirichlet distribution with values (clockwise from top left): = (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 21
Multinomial Variables: Bayesian Way 3/3
• Multiplying the prior (2.38) by the likelihood function (2.34)
yields the posterior:

with m = (m1, . . . ,mK)⊤. Similarly to the binomial distribution with


its beta prior, k can be interpreted as effective number of
observations of xk = 1 (a priori).

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 22


The Gaussian distribution
• The Gaussian, also known as the normal distribution, is a
widely used model for the distribution of continuous variables.
• In the case of a single variable x, the Gaussian distribution
can be written in the form

where μ is the mean and σ2 is the variance.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 23


The Gaussian distribution

• For a D-dimensional vector x, the multivariate Gaussian


distribution takes the form

where μ is a D-dimensional mean vector,


Σ is a D × D covariance matrix, and
|Σ|denotes the determinant of Σ.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 24


The Gaussian distribution
• The Gaussian distribution arises in many different contexts and
can be motivated from a variety of different perspectives.
 The Gaussian distribution maximizes entropy.
 The central limit theorem (due to Laplace): subject to certain mild
conditions, the sum of a set of random variables, which is itself a
random variable, has a distribution that becomes increasingly
Gaussian as the number of terms in the sum increases (Walker, 1969).

Figure 2.6 Histogram plots of the mean of N uniformly distributed numbers for various values
of N. We observe that as N increases, the distribution tends towards a Gaussian.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 25
The Gaussian distribution: Properties
• The law is a function of the Mahalanobis distance from x to μ:

The quantity Δ is called the Mahalanobis distance from μ to x and


reduces to the Euclidean distance when Σ is the identity matrix.
• Gaussian distribution will be constant on surfaces in x-space for
which this quadratic form is constant.
• The expectation of x under the Gaussian distribution is:

• The covariance matrix of x is:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 26


The Gaussian distribution: Properties
• The law is constant on elliptical surfaces

Figure 2.7 The red curve shows the elliptical


surface of constant probability density for a
Gaussian in a two-dimensional space x = (x1,
x2) on which the density is exp(−1/2) of its
value at x = μ.
The major axes of the ellipse are defined by
the eigenvectors ui of the covariance matrix,
with corresponding eigenvalues λi.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 27
The Gaussian distribution: Conditional and marginal laws
• Given a Gausian distribution N(x|μ, ) with:

• The conditional distribution p(xa|xb) is a Gaussian law with


parameters:

• The marginal distribution p(xa) is a Gaussian law with


parameters (μa, aa).
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 28
The Gaussian distribution: Bayes’ theorem
• A linear Gaussian model is a couple of vectors (x, y) described
by the relations:

• (y = Ax+b+ ) where x is Gaussian and is a centered Gaussian


noise).
• Then

where

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 29


The Gaussian distribution: Maximum Likelihood
• Assume we have X a set of N iid observations following a
Gaussian law. The parameters of the law, estimated by ML
are:

• The empirical mean is unbiased but it is not the case of the


empirical variance.
• The bias can be correct multiplying ΣML by the factor N/(N−1).
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 30
The Gaussian distribution: Maximum Likelihood
• The mean estimated form N data points is a revision of the
estimator obtained from the (N−1) first data points:

• It is a particular case of the algorithm of Robbins-Monro,


which iteratively search the root of a regression function.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 31


The Gaussian distribution: Bayesian inference
• The conjugate prior for μ is gaussian,
• The conjugate prior for λ = 1/σ2 is a Gamma law,
• The conjugate prior of the couple (μ, λ) is the normal gamma
distribution N(μ|μ0, λ0−1 )Gam(λ |a, b) where λ0 is a linear
function of λ.
• The posterior distribution would exhibit a coupling between
the precision of μ and λ.
• The multidimensional conjugate prior is the Gaussian Wishart
law.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 32


The Gaussian distribution: Limitations
• A lot of parameters to estimate D(1 + (D + 1)/2) :
simplification (diagonal variance matrix),
• Maximum likehood estimators are not robust to outliers:
t-Student distribution,
• Not able to describe periodic data: von Mises distribution,
• Unimodal distribution Mixture of Gaussian.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 33


After the Gaussian distribution : t-Student distribution
• A student distribution is an infinite sum of Gaussian having the
same mean but different precisions (described by a Gamma law)

• It is robust to outliers
Figure 2.16 Illustration of the robustness of Student’s t-
distribution compared to a Gaussian.
(a) Histogram distribution of 30 data points drawn
from a Gaussian distribution, together with the
maximum likelihood fit obtained from a t-distribution
(red curve) and a Gaussian (green curve, largely
hidden by the red curve). Because the t-distribution
contains the Gaussian as a special case it gives
almost the same solution as the Gaussian.

(b) The same data set but with three additional


outlying data points showing how the Gaussian
(green curve) is strongly distorted by the outliers,
whereas the t-distribution (red curve) is relatively
unaffected. Machine Learning Fantahun B. (PhD) @ AAU-AAiT 34
After the gaussian distribution : von Mises distribution
• When the data are periodic, it is necessary to work with polar
coordinates.
• The von Mises law is obtained by conditioning the bidimensional
Gaussian law to the unit circle:

• the distribution is:

where
 m is the concentration (precision) parameter,
 θ0 is the mean.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 35
Mixtures (of Gaussians) (1/3)
• Data with distinct regimes better modeled with mixtures

• General form: convex combination of component densities

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 36


Mixtures (of Gaussians) (2/3)
• Gaussian popular density, and so are mixtures thereof

• Example of mixture of Gaussians on R

• Example of mixture of Gaussians on R2

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 37


Mixtures (of Gaussians) (3/3)

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 38


The Exponential Family (1/3)
• Large family of useful distributions with common properties
 Bernoulli, beta, binomial, chi-square, Dirichlet, gamma,
 Gaussian, geometric, multinomial, Poisson, Weibull, . . .
 Not in the family: Cauchy, Laplace, mixture of Gaussians, . . .
 Variable can be discrete or continuous (or vectors thereof)
• General form: log-linear interaction

• Normalization determines form of g:

• Differentiation with respect to η, using Leibniz’s rule, reveals

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 39


The Exponential Family (2/3): Sufficient Statistics
• Maximum likelihood estimation for i.i.d. data

• Setting gradient w.r.t. η to zero yields

 is all we need from the data: sufficient statistics

• Combining with result from previous slide, ML estimate yields

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 40


The Exponential Family (3/3): Conjugate Priors
• Given a probability distribution p(x|η), prior p(η) is conjugate if the
posterior p(η |x) has the same form as the prior.
• All exponential family members have conjugate priors:

• Combining the prior with a exponential family likelihood

• We obtain (2.230)

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 41


Nonparametric methods
• So far we have seen parametric densities in this chapter
 Limitation: we are tied down to a specific functional form
 Alternatively we can use (flexible) nonparametric methods
• Basic idea: consider small region R, with P =
 For N  ∞ data points we find about K NP in R
 For small R with volume V : P p(x)V for x R
 Thus, combining we find: p(x) K/(NV )
• Simplest example: histograms
 Choose bins
 Estimate density in i-th bin

 Tough in many dimensions: smart chopping required


Machine Learning Fantahun B. (PhD) @ AAU-AAiT 42
Kernel density estimators

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 43


Kernel density estimators

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 44


Kernel density estimators: fix V , find K

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 45


Kernel density estimators: fix V , find K

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 46

You might also like