Ds 10

Probabilistic ML
Extra Classes
April 03, 2024 (Wed) 6PM L20
April 10, 2024 (Wed) 6PM L20
Timing and venue same for both
In lieu of missed classes on account of

Institute and Gymkhana holidays
No extra class on April 06, 2024 (Sat)

Can you Guess the Mean?
In this case we are said to have a prior belief or simply prior, on the models , in
3
There
this caseis
theauniform
Gaussianprior with unknown
. This means mean
that unless butany
we see known
data to variance
make us (for
sake
believeof simplicity)
otherwise, we willfrom
think .which we receive independent samples
Can we estimate the “model” from these samples?

Likelihood function: for a What happensmodel
candidate we do seeand
somesample
data, namely
the actual samples from the distribution?
MLE:
Suppose we believe (e.g.We
someone tells us)
use the samples even
and the rulesbefore the samples
of probability
have been presented that definitely lies
to update our in the
beliefs interval
about what can (but
and could
otherwise be any value withincannot
that be.
interval)
Let us see how to do this
Posterior 4
Before we see any data, we have a prior belief on the models
It tells us which models are more
Notelikely/less likely
that when we say before we have seen data
or , we mean
Then we see data and we probability density and
wish to update ournotbelief.
probability mass
Basically we
since is a continuous r.v.
want to find out
This quantity has a name: posterior
Bayes Rule
belief or simply posterior
Samples are independent
It tells us which models are more likely/less likely after we have seen data
Law of total probability ℙ [ 𝜇 ] =UNIF ( [ 0 , 2 ] )

if
else if , then
Maximum a Posteriori (MAP)
It is better Estimate
to choose priors that do not
completely exclude some models by
5
Just as MLE gave us the modelgiving
, MAPthemgives us the(asmodel
0 probability we did)
Thus, MAP returns the model that becomes the most likely one after
we have seen some data
Note: posterior probability
True!(density) of someif models
Even in general, may
your priors arebe larger
bad, or than their
prior probability (density)
tooi.e. afterthen
strong, seeing
you maydataend
those models
up getting seem more likely,
funny
for other models, it may go downasi.e.
models theyofseem
a result doingless
MAPlikely after seeing the data
estimation
Note: However, if prior probability (density) of some model is 0, the posterior
probability (density) has to beFor
Indeed! zero as well
example – need
if we were to be careful
wrong and wasabout priors
Warning: Do not read too much
actually intonot
not then these
matternames likelihood,
how many samplesprior,
we posterior.
All of them tell us how likely
see,something
we will neveris, estimate
given orcorrectly!!
not given something else
MAP vs Regularization 6
Taking negative log likelihoods on both sides
However, is constant for and otherwise ()

s.t.
Thus, even MAP solutions can correspond to optimization problems!
In this case, what was the prior became a constraint
In general, the prior becomes a regularizer
MAP vs Regularization
Similarly, had we used a Laplacian prior, we
would have obtained L1 regularization instead
7
Consider the same problem as before but a different prior
This time we do not believe must have been in the interval but a much
milder prior that is not too large in magnitude
A good way to The regularization
express this is toconstant is dictated prior
use a Gaussian by the strength of
the regularization. Be careful not to have strong priors
(uninformed strong opinions are bad in real life too )
MAP:
Thus, a Gaussian prior gave us L2 regularization!

Note: effecitely dictates the regularization constant – not useless!!
Note: this is basically ridge regression except in one dimension!!
Probabilistic Regression Revisited 8
To perform probabilistic regression, need to assign a label distribution
over all for every data point
Had it been binary classification, we would have assigned a dist over
We assume a observation model (likelihood function)
Properties of (univariate) Gaussian tells us that this is same as saying

where
Also, let us assume that we like model vectors with small L2 norm
better than those with larger L2 norm i.e. have a prior
Probabilistic
You might be wondering why Regression
conditioned as and notRevisited
. This is because we are
currently assuming that features do not depend on the model . Thus, the chain
9
Recall
rule givesthat theisprior
us and encodes
just which ourdepend
does not beliefs before
on the model .we have
Note seen data
that we
also assume in calculations that are independent of for
Posterior encodes our beliefs after we have seen data – Bayes Rule
Using independence gives us

Ignoring terms that don’t involve , taking logs gives us MAP estimate
Note: Likelihood is a distribution over labels i.e. over but the prior
and posterior are distributions over models i.e. over
We will soon study “generative” models where the features themselves would
become random variables dependent on a (more complicated) model
MAP for Probabilistic Regression
Just so that we are clear, nothing special about . If we believe is close to a vector ,
10
For Gaussian likelihoodshould use and Gaussian
instead. MAP will prior
becomewe get
Be careful, there are two variance terms here
The above is i.e. ridge regression
Thus, together decide the regularization constant
There is a multivariate version of the Laplace distribution too!
Using it as a prior with (the zero vector) will give us LASSO!
Warning: expression for the Laplace PDF for general covariance is a bit
tricky)
Bayesian Learning 11
Before we started doing probabilistic ML, we used to output a single
label. With PML we started giving a distribution over labels instead
However, we still do so using a single model
In MLE we use the model with highest likelihood function value to do so
In MAP we use the mode of the posterior distribution to do so
In Bayesian learning, we take this philosophy further – instead of
trusting a single model, we place partial trust, possibly over all models
Models with high posterior probability (density) value get high trust
Models with low posterior probability (density) value get low trust
We use Bayes rule yet again to perform these calculations
From PML to BML 12
I have with me data points , and a prior over models
For a test point , I wish to output a distribution over set of all labels i.e. – we
condition on available data and as we know these
Since we need models to predict labels, let us introduce them
Step 1 (law of total probability) Step 2(chain rule of probability), Step 3(get
rid of conditionings that did not matter)
Note: is the distribution we would have given had indeed been the true model
and is our faith in being the true model!
BML Trivia 13
is called the predictive posterior
Note: predictive posterior is a distribution over labels (not models)
For some very well behaved cases, the posterior and the predictive
posterior distributions have closed form expressions
The special cases where we have something called conjugate priors are one
such example
In all the other cases, we must use other techniques to work with the
(predictive) posteriors in an approximate manner
Powerful sampling algorithms e.g. MCMC, Gibbs etc exist
Discussion beyond the scope of CS771 – courses like CS772 discuss
this
Bayesian Regression 14
Suppose we have Gaussian likelihood and Gaussian prior , then we
have
and
Note that the is simply the MAP solution – makes sense since MAP is
the mode of the posterior and for Gaussian, mean is mode
Predictive Posterior: where
and
Note: in this case, variance of the predicted distribution depends on
the point itself – can deduce if the prediction is confident or not!
Conjugate Priors 15
Bayesian Logistic Regression or Softmax Regression is not nearly as
pretty – no closed form solutions for posterior or predictive posterior
However, some likelihood-prior pairs are special
Always yield a posterior that is of the same family as the prior
Such pairs of distributions are called conjugate pairs
The prior in such cases is said to be conjugate to the likelihood
Gaussian-Gaussian is one example – warning: one Gaussian is a likelihood
over reals, the other Gaussian is a prior over models (i.e. vectors)
Other conjugate pairs exist too
Discussion beyond the scope of CS771 – courses like CS772 discuss
this

Ds 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ds 10

Uploaded by

Copyright:

Available Formats

Probabilistic ML

Timing and venue same for both

In lieu of missed classes on account of

No extra class on April 06, 2024 (Sat)

Can we estimate the “model” from these samples?

Law of total probability ℙ [ 𝜇 ] =UNIF ( [ 0 , 2 ] )

However, is constant for and otherwise ()

Thus, a Gaussian prior gave us L2 regularization!

Properties of (univariate) Gaussian tells us that this is same as saying

Using independence gives us

You might also like