Regression

Linear and Nonlinear
Regression and
Classification
Note to other teachers and users of
these slides. Andrew would be delighted Ronald J. Williams
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
CSG220, Spring 2007
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
Containing a number of slides adapted from the Andrew Moore
source repository of Andrew’s tutorials: tutorial “Regression and Classification with Neural Networks”
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams
Linear Regression
DATASET
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
↑
w x3 = 2 y3 = 2
← 1 →↓
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
Linear regression assumes that the expected value of

the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 2
1
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
P(y|w,x) has a normal distribution with

• mean wx
• variance σ2
Bayesian Linear Regression

P(y|w,x) = Normal (mean wx, var σ2)
We have a set of datapoints (x1,y1) (x2,y2) … (xR,yR)

which are EVIDENCE about w.
We want to infer w from the data.

P(w|x1, x2, x3,…xR, y1, y2…yR)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation
2
Maximum likelihood estimation of w
Asks the question:

“For which value of w is this data most likely to have
happened?”
<=>
For what w is
P(y1, y2…yR |x1, x2, x3,…xR, w) maximized?
<=>
For what w is n
∏P( y w, x ) maximized?
i =1
i i
For what w is
R
∏ P( y
i =1
i w , x i ) maximized?
For what w isR

1
∏ exp(− 2 (
yi − wxi
) 2 ) maximized?
i =1 σ
For what w is
2
R
1 ⎛ y − wx i ⎞
∑i =1
− ⎜ i
2⎝ σ
⎟ maximized?
⎠
For what w is 2
R
∑ (y
i =1
i − wx i ) minimized?
3
Linear Regression
The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i
= ∑ yi − (2∑ xi yi )w +
2
(∑ x )w i
2 2
i
We want to minimize a quadratic function of w.
Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑x y i i
∑x
2
i
The maximum likelihood

model is
Out(x) = wx
We can use it for
prediction
4
Linear Regression
Easy to show the sum of
squares is minimized
when
∑ xi yi
p(w)
w= w
∑x
2
i Note: In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
model is
Out(x) = wx And predictions would have given a prob
dist of expected output
Often useful to know your confidence.

We can use it for
Max likelihood can give some kinds of
prediction
confidence too.
Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
. 5 example
.8
x2 . 10
x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
5
Multivariate Regression
Write matrix X and Y thus:
⎡ .....x1 ..... ⎤ ⎡ x11 x12 ... x1m ⎤ ⎡ y1 ⎤

⎢.....x .....⎥ ⎢ x x22 ⎥
... x2 m ⎥ ⎢y ⎥
X=⎢ 2 ⎥ = ⎢ 21 Y = ⎢ 2⎥
⎢ M ⎥ ⎢ M ⎥ ⎢M⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣.....x R .....⎦ ⎣ xR1 xR 2 ... xRm ⎦ ⎣ yR ⎦
(There are R datapoints. Each input has m components)

The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[m]
The max. likelihood w is w = (XTX) -1(XTY)
Multivariate Regression (con’t)
The max. likelihood w is w = (XTX)-1(XTY)
XTX is an m x m matrix: i,j’th elt is ∑x x

k =1
ki kj
R
XTY is an m-element vector: i’th elt is ∑x
k =1
y
ki k
6
What about a constant term?
We may expect
linear data that does
not go through the
origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
Can you guess??

The constant term

• The trick is to create a fake input “X0” that
always takes the value 1
X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, = w0+w1X1+ w2X2
You should be able
model to see the MLE …has a fine constant
w0 , w1 and w2
by inspection
term
7
What about higher-order terms?
Maybe we suspect a higher-order
polynomial function like
y = w0 + w1 x + w2 x2 + w3 x3
would fit the data better.
In that case, we can simply perform

multivariate linear regression using
additional dimensions for all higher-
order terms.
Higher-order terms
Linear Fit Quadratic Fit
1 X Y 1 X X2 Y
1 1 2 1 1 1 2
1 2 5 1 2 4 5
1 3 10 1 3 9 10
1 5 26 1 5 25 26
8
Maximum Likelihood
Nonlinear Regression
Assume correct function is y = f (x,w),
where f is any function of the input x
parameterized by w , and observations are
corrupted by additive Gaussian noise (with
some fixed variance σ2).
For example, f could be the function

computed by a multilayer neural network
whose weights are w .
As before, we would like to determine for

what w
P(y1, y2…yR |x1, x2, x3,…xR, w)
is maximized.
And just as before, this translates into:
9
For what w is
R
∏ P (y
i =1
i w , x i ) maximized?
For what w is
R
1 ||y i − f (xi , w )|| 2
∏ exp(− 2 (
i =1 σ
) ) maximized?
For what w is 2
1 ⎛⎜ || y i − f ( x i , w ) ||
R ⎞
⎟ maximized?
∑i =1
−
2⎜ σ ⎟
⎝ ⎠
For what w is 2
R
∑ (|| y
i =1
i − f ( x i , w ) || ) minimized?
• So, for example, with the usual squared-

error measure, backpropagation can be
viewed as a technique for searching for a
maximum-likelihood fit of a neural network
to a given set of training data.
• This applies when neural networks are used
for regression, assuming additive Gaussian
noise.
• What about for classification?
10
Maximum Likelihood
Probability Estimation
• Consider a 2-class classification problem,
and assume that the probability that an
instance x is classified as positive has the
functional form y = f (x ,w).
• Then it can be shown that the correct
criterion to optimize to generate ML
estimates of the probability of belonging to
the + class is not squared error.
Maximum Cross-Entropy
• Instead the following cross-entropy measure

should be maximized:
R
∑ (y
i =1
i log f ( x i , w ) + (1 − y i ) log( 1 − f ( x i , w ) )
• In a multilayer neural network, the gradient

computation for this measure still follows
the backpropagation process.
11

Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression

Uploaded by

Copyright:

Available Formats

Linear and Nonlinear

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams

Linear regression assumes that the expected value of

P(y|w,x) has a normal distribution with

Bayesian Linear Regression

We have a set of datapoints (x1,y1) (x2,y2) … (xR,yR)

We want to infer w from the data.

Asks the question:

For what w isR

The maximum likelihood

Often useful to know your confidence.

⎡ .....x1 ..... ⎤ ⎡ x11 x12 ... x1m ⎤ ⎡ y1 ⎤

(There are R datapoints. Each input has m components)

Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY)

XTX is an m x m matrix: i,j’th elt is ∑x x

Can you guess??

The constant term

In that case, we can simply perform

For example, f could be the function

As before, we would like to determine for

And just as before, this translates into:

• So, for example, with the usual squared-

• Instead the following cross-entropy measure

• In a multilayer neural network, the gradient

You might also like