Professional Documents
Culture Documents
Ds 8
Ds 8
min ❑
𝑑
𝐰 ∈ℝ
Probabilistic Regression
Be warned though – the we chose will start mattering the moment
we add regularization! It is just that in these simple cases it does not
matter. is usually treated like a hyperparameter and tuned.
4
Suppose I decide to use a Laplacian distribution instead and choose
So I am a bit confused. All MLEs (classification/regression)
and i.e. demand a model that places maximum probability on the true
label. Why w.r.t
Likelihood function don’t we just ask
a data the model
point thentobecomes
predict the true label
itself?
MLE:
Suppose we believe (e.g.We
someone tells us)
use the samples even
and the rulesbefore the samples
of probability
have been presented that definitely lies
to update our in the
beliefs interval
about what can (but
and could
otherwise be any value withincannot
that be.
interval)
Let us see how to do this
Posterior 7
Before we see any data, we have a prior belief on the models
It tells us which models are more
Notelikely/less likely
that when we say before we have seen data
or , we mean
Then we see data and we probability density and
wish to update ournotbelief.
probability mass
Basically we
since is a continuous r.v.
want to find out
This quantity has a name: posterior
Bayes Rule
belief or simply posterior
Samples are independent
It tells us which models are more likely/less likely after we have seen data
Note: Likelihood is a distribution over labels i.e. over but the prior
and posterior are distributions over models i.e. over
We will soon study “generative” models where the features themselves would
become random variables dependent on a (more complicated) model
MAP for Probabilistic Regression
Just so that we are clear, nothing special about . If we believe is close to a vector ,
13
For Gaussian likelihoodshould use and Gaussian
instead. MAP will prior
becomewe get
Be careful, there are two variance terms here
The above is i.e. ridge regression
Thus, together decide the regularization constant
There is a multivariate version of the Laplace distribution too!
Using it as a prior with (the zero vector) will give us LASSO!
Warning: expression for the Laplace PDF for general covariance is a bit
tricky)
Bayesian Learning 14
Before we started doing probabilistic ML, we used to output a single
label. With PML we started giving a distribution over labels instead
However, we still do so using a single model
In MLE we use the model with highest likelihood function value to do so
In MAP we use the mode of the posterior distribution to do so
In Bayesian learning, we take this philosophy further – instead of
trusting a single model, we place partial trust, possibly over all models
Models with high posterior probability (density) value get high trust
Models with low posterior probability (density) value get low trust
We use Bayes rule yet again to perform these calculations
From PML to BML 15
I have with me data points , and a prior over models
For a test point , I wish to output a distribution over set of all labels i.e. – we
condition on available data and as we know these
Since we need models to predict labels, let us introduce them
Step 1 (law of total probability) Step 2(chain rule of probability), Step 3(get
rid of conditionings that did not matter)
Note: is the distribution we would have given had indeed been the true model
and is our faith in being the true model!
BML Trivia 16
is called the predictive posterior
Note: predictive posterior is a distribution over labels (not models)
For some very well behaved cases, the posterior and the predictive
posterior distributions have closed form expressions
The special cases where we have something called conjugate priors are one
such example
In all the other cases, we must use other techniques to work with the
(predictive) posteriors in an approximate manner
Powerful sampling algorithms e.g. MCMC, Gibbs etc exist
Discussion beyond the scope of CS771 – courses like CS772 discuss
this
Bayesian Regression 17
Suppose we have Gaussian likelihood and Gaussian prior , then we
have
and
Note that the is simply the MAP solution – makes sense since MAP is
the mode of the posterior and for Gaussian, mean is mode
Predictive Posterior: where
and
Note: in this case, variance of the predicted distribution depends on
the point itself – can deduce if the prediction is confident or not!
Conjugate Priors 18
Bayesian Logistic Regression or Softmax Regression is not nearly as
pretty – no closed form solutions for posterior or predictive posterior
However, some likelihood-prior pairs are special
Always yield a posterior that is of the same family as the prior
Such pairs of distributions are called conjugate pairs
The prior in such cases is said to be conjugate to the likelihood
Gaussian-Gaussian is one example – warning: one Gaussian is a likelihood
over reals, the other Gaussian is a prior over models (i.e. vectors)
Other conjugate pairs exist too
Discussion beyond the scope of CS771 – courses like CS772 discuss
this
Generative ML
Generative Models 20
So far, we looked at probability theory as a tool to express the belief of
an ML algorithm that the true label is such and such
Likelihood: given model it tells us
We also looked at how to use probability theory to express our beliefs
about which models are preferred by us and which are not
Prior: this just tells us
Notice that in all of this, the data features were always considered
constant and never questions as being random or flexible
Can we also talk about ?
Very beneficial: given label , this would allow us to generate a new from the
distribution ?
Can generate new cat images, new laptop designs (GANs do this very thing!)
Generative Algorithms 21
ML algos that can learn dist. of the form or or
A slightly funny bit of terminology used in machine learning
Discriminative Algorithms: that only use to do their stuff
Generative Algorithms: that use or etc to do their stuff
Generative Algorithms have their advantages and disadvantages
More expensive: slower train times, slower test times, larger models
An overkill: often, need only to make predictions – disc. algos enough!
More frugal: can work even if we have very less training data (e.g. RecSys)
More robust: can work even if features corrupted e.g. some features missing
A recent application of generative techniques (GANs etc) allows us to
Generate novel examples of a certain class of data points
Generate more training examples for those classes as well!
A very simple generative model 22
Given a few feature vectors (never mind labels for now)
We wish to learn a probability distribution with support over
This distribution should capture interesting properties about the data in a way
that allows us to do things like generate similar-looking feature vectors etc
Let us try to learn a standard Gaussian as this distribution i.e. we wish
to learn so that the distribution explains the data well
One way is to look for a that achieves maximum likelihood i.e. MLE!!
As before, assume that our feature vectors were independently generated
which, upon applying first order optimality, gives us
We just learnt as our generating dist. for data features!
A more powerful generative model
Recall that in general, the FO optimality technique cannot give
us the optima in presence of constraints. We just got lucky in this
case that the FO optima happened to satisfy the constraint
23
Suppose we are not satisfied with the above simple model
Suppose we wish to instead learn as well as a so that the distribution
explains the data well
Log likelihood function (be careful – cannot ignore any terms now)
where