Download as pdf or txt
Download as pdf or txt
You are on page 1of 54


J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probability & Bayesian Inference

Some of these slides were sourced and/or modified


Bishop, Microsoft UK

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference

What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

What is Linear Regression?

Probability & Bayesian Inference

In classification, we seek to identify the categorical class Ck

associate with a given input vector x.
In regression, we seek to identify (or estimate) a continuous
variable y associated with a given input vector x.
y is called the dependent variable.
x is called the independent variable.
If y is a vector, we call this multiple regression.
We will focus on the case where y is a scalar.
y will denote the continuous model of the dependent variable
t will denote discrete noisy observations of the dependent
variable (sometimes called the target variable).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Where is the Linear in Linear Regression?

Probability & Bayesian Inference

In regression we assume that y is a function of x.

The exact nature of this function is governed by an
unknown parameter vector w:
y = y x, w
The regression is linear if y is linear in w. In other
words, we can express y as

( )

y = wt! x


! x is some (potentially nonlinear) function of x.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Basis Function Models

Probability & Bayesian Inference


where j(x) are known as basis functions.

Typically, 0(x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions :
d(x) = xd.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference

What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example: Polynomial Bases

Probability & Bayesian Inference

Polynomial basis

These are global

small change in x
affects all basis functions.
A small change in a
basis function affects y
for all x.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example: Polynomial Curve Fitting


Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Sum-of-Squares Error Function


Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

1st Order Polynomial


Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

3rd Order Polynomial


Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

9th Order Polynomial


Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Probability & Bayesian Inference


Penalize large coefficient values

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Probability & Bayesian Inference



CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Probability & Bayesian Inference



CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Probability & Bayesian Inference



CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Probabilistic View of Curve Fitting

Probability & Bayesian Inference


Why least squares?

Model noise (deviation of data from model) as
Gaussian i.i.d.

where ! !

is the precision of the noise.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood
Probability & Bayesian Inference


We determine wML by minimizing the squared error E(w).

Thus least-squares regression reflects an assumption that the

noise is i.i.d. Gaussian.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood
Probability & Bayesian Inference


We determine wML by minimizing the squared error E(w).

Now given wML, we can estimate the variance of the noise:

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Predictive Distribution

Probability & Bayesian Inference

Generating function
Observed data
Maximum likelihood prediction
Posterior over t

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

MAP: A Step towards Bayes

Probability & Bayesian Inference


Prior knowledge about probable values of w can be incorporated into the


Now the posterior over w is proportional to the product of the likelihood

times the prior:

The result is to introduce a new quadratic term in w into the error function
to be minimized:

Thus regularized (ridge) regression reflects a 0-mean isotropic Gaussian

prior on the weights.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference


What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Gaussian Bases
Probability & Bayesian Inference


Gaussian basis functions:

Think of these as interpolation functions.

These are local:

small change in x affects

only nearby basis functions.
a small change in a basis
function affects y only for
nearby x.
j and s control location
and scale (width).
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference


What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood and Linear Least Squares

Probability & Bayesian Inference


Assume observations from a deterministic function with

added Gaussian noise:

which is the same as saying,

Given observed inputs,
, and
we obtain the likelihood

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood and Linear Least Squares

Probability & Bayesian Inference


Taking the logarithm, we get


is the sum-of-squares error.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood and Least Squares

Probability & Bayesian Inference


Computing the gradient and setting it to zero yields

Solving for w, we get


The Moore-Penrose

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

End of Lecture 8

Linear Regression Topics

Probability & Bayesian Inference


What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Regularized Least Squares

Probability & Bayesian Inference


Consider the error function:

Data term + Regularization term

With the sum-of-squares error function and a

quadratic regularizer, we get

which is minimized by

is called the

Thus the name ridge regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Regularized Least Squares

Probability & Bayesian Inference


With a more general regularizer, we have



(Least absolute shrinkage and selection operator)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Regularized Least Squares

Probability & Bayesian Inference


Lasso generates sparse solutions.

of data term ED(w)

Iso-contour of
regularization term EW(w)



CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Solving Regularized Systems

Probability & Bayesian Inference


Quadratic regularization has the advantage that

the solution is closed form.
Non-quadratic regularizers generally do not have
closed form solutions
Lasso can be framed as minimizing a quadratic
error with linear constraints, and thus represents a
convex optimization problem that can be solved by
quadratic programming or other convex
optimization methods.
We will discuss quadratic programming when we
cover SVMs

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference


What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Multiple Outputs
Probability & Bayesian Inference


Analogous to the single output case we have:

Given observed inputs

we obtain the log likelihood function

, and

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Multiple Outputs
Probability & Bayesian Inference


Maximizing with respect to W, we obtain

If we consider a single target variable, tk, we see that

single output case.

, which is identical with the

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Some Useful MATLAB Functions

Probability & Bayesian Inference



fit of a polynomial of specified order to

given data


general function that computes linear weights for

least-squares fit

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference


What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Linear Regression

Rev. Thomas Bayes, 1702 - 1761

Bayesian Linear Regression

Probability & Bayesian Inference


Define a conjugate prior over w:

Combining this with the likelihood function and using

results for marginal and conditional Gaussian
distributions, gives the posterior


CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Linear Regression

Probability & Bayesian Inference


A common choice for the prior is

for which

Thus mN represents the ridge regression solution with

! =" /#

Next we consider an example

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Linear Regression


Probability & Bayesian Inference

0 data points observed


Data Space

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Linear Regression

Probability & Bayesian Inference


1 data point observed

Likelihood for (x1,t1)


Data Space

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Linear Regression

Probability & Bayesian Inference


2 data points observed

Likelihood for (x2,t2)


Data Space

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Linear Regression

Probability & Bayesian Inference


20 data points observed

Likelihood for (x20,t20)


Data Space

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Predictive Distribution
Probability & Bayesian Inference


Predict t for new values of x by integrating over w:


CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Predictive Distribution
Probability & Bayesian Inference


Example: Sinusoidal data, 9 Gaussian basis functions,

1 data point

Notice how much bigger our uncertainty is

relative to the ML method!!

p t | t,! , "

Samples of y(x,w)

E #$t | t,! , " %&

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Predictive Distribution
Probability & Bayesian Inference


Example: Sinusoidal data, 9 Gaussian basis functions,

2 data points
E #$t | t,! , " %&

p t | t,! , "

Samples of y(x,w)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Predictive Distribution
Probability & Bayesian Inference


Example: Sinusoidal data, 9 Gaussian basis functions,

4 data points
E #$t | t,! , " %&

p t | t,! , "

Samples of y(x,w)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Predictive Distribution
Probability & Bayesian Inference


Example: Sinusoidal data, 9 Gaussian basis functions,

25 data points
E #$t | t,! , " %&

p t | t,! , "

Samples of y(x,w)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Equivalent Kernel
Probability & Bayesian Inference


The predictive mean can be written

Equivalent kernel or
smoother matrix.

This is a weighted sum of the training data target

values, tn.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Equivalent Kernel

Probability & Bayesian Inference

Weight of tn depends on distance between x and xn;

nearby xn carry more weight.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Linear Regression Topics

Probability & Bayesian Inference


What is linear regression?

Example: polynomial curve fitting
Other basis families
Solving linear regression problems
Regularized regression
Multiple linear regression
Bayesian linear regression

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

You might also like