Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Linear and Nonlinear

Regression and
Classification
Note to other teachers and users of
these slides. Andrew would be delighted Ronald J. Williams
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
CSG220, Spring 2007
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
Containing a number of slides adapted from the Andrew Moore
source repository of Andrew’s tutorials: tutorial “Regression and Classification with Neural Networks”
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams

Linear Regression
DATASET

inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2

w x3 = 2 y3 = 2
← 1 →↓
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1

Linear regression assumes that the expected value of


the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 2

1
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei

where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2

P(y|w,x) has a normal distribution with


• mean wx
• variance σ2
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 3

Bayesian Linear Regression


P(y|w,x) = Normal (mean wx, var σ2)

We have a set of datapoints (x1,y1) (x2,y2) … (xR,yR)


which are EVIDENCE about w.

We want to infer w from the data.


P(w|x1, x2, x3,…xR, y1, y2…yR)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 4

2
Maximum likelihood estimation of w

Asks the question:


“For which value of w is this data most likely to have
happened?”
<=>
For what w is
P(y1, y2…yR |x1, x2, x3,…xR, w) maximized?
<=>
For what w is n

∏P( y w, x ) maximized?
i =1
i i

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 5

For what w is
R

∏ P( y
i =1
i w , x i ) maximized?

For what w isR


1
∏ exp(− 2 (
yi − wxi
) 2 ) maximized?
i =1 σ
For what w is
2
R
1 ⎛ y − wx i ⎞
∑i =1
− ⎜ i
2⎝ σ
⎟ maximized?

For what w is 2
R

∑ (y
i =1
i − wx i ) minimized?

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 6

3
Linear Regression

The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i

= ∑ yi − (2∑ xi yi )w +
2
(∑ x )w i
2 2

i
We want to minimize a quadratic function of w.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 7

Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑x y i i

∑x
2
i

The maximum likelihood


model is
Out(x) = wx
We can use it for
prediction
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 8

4
Linear Regression
Easy to show the sum of
squares is minimized
when
∑ xi yi
p(w)

w= w

∑x
2
i Note: In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
model is
Out(x) = wx And predictions would have given a prob
dist of expected output

Often useful to know your confidence.


We can use it for
Max likelihood can give some kinds of
prediction
confidence too.

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 9

Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
. 5 example
.8
x2 . 10

x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 10

5
Multivariate Regression
Write matrix X and Y thus:

⎡ .....x1 ..... ⎤ ⎡ x11 x12 ... x1m ⎤ ⎡ y1 ⎤


⎢.....x .....⎥ ⎢ x x22 ⎥
... x2 m ⎥ ⎢y ⎥
X=⎢ 2 ⎥ = ⎢ 21 Y = ⎢ 2⎥
⎢ M ⎥ ⎢ M ⎥ ⎢M⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣.....x R .....⎦ ⎣ xR1 xR 2 ... xRm ⎦ ⎣ yR ⎦

(There are R datapoints. Each input has m components)


The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[m]
The max. likelihood w is w = (XTX) -1(XTY)

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 11

Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY)

XTX is an m x m matrix: i,j’th elt is ∑x x


k =1
ki kj

R
XTY is an m-element vector: i’th elt is ∑x
k =1
y
ki k

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 12

6
What about a constant term?
We may expect
linear data that does
not go through the
origin.

Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.

Can you guess??


Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 13

The constant term


• The trick is to create a fake input “X0” that
always takes the value 1

X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, = w0+w1X1+ w2X2
You should be able
model to see the MLE …has a fine constant
w0 , w1 and w2
by inspection
term
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 14

7
What about higher-order terms?
Maybe we suspect a higher-order
polynomial function like
y = w0 + w1 x + w2 x2 + w3 x3
would fit the data better.

In that case, we can simply perform


multivariate linear regression using
additional dimensions for all higher-
order terms.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 15

Higher-order terms
Linear Fit Quadratic Fit

1 X Y 1 X X2 Y
1 1 2 1 1 1 2
1 2 5 1 2 4 5
1 3 10 1 3 9 10
1 5 26 1 5 25 26

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 16

8
Maximum Likelihood
Nonlinear Regression
Assume correct function is y = f (x,w),
where f is any function of the input x
parameterized by w , and observations are
corrupted by additive Gaussian noise (with
some fixed variance σ2).

For example, f could be the function


computed by a multilayer neural network
whose weights are w .
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 17

As before, we would like to determine for


what w
P(y1, y2…yR |x1, x2, x3,…xR, w)
is maximized.

And just as before, this translates into:

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 18

9
For what w is
R

∏ P (y
i =1
i w , x i ) maximized?

For what w is
R
1 ||y i − f (xi , w )|| 2
∏ exp(− 2 (
i =1 σ
) ) maximized?
For what w is 2
1 ⎛⎜ || y i − f ( x i , w ) ||
R ⎞
⎟ maximized?
∑i =1

2⎜ σ ⎟
⎝ ⎠
For what w is 2
R

∑ (|| y
i =1
i − f ( x i , w ) || ) minimized?

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 19

• So, for example, with the usual squared-


error measure, backpropagation can be
viewed as a technique for searching for a
maximum-likelihood fit of a neural network
to a given set of training data.
• This applies when neural networks are used
for regression, assuming additive Gaussian
noise.
• What about for classification?

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 20

10
Maximum Likelihood
Probability Estimation
• Consider a 2-class classification problem,
and assume that the probability that an
instance x is classified as positive has the
functional form y = f (x ,w).
• Then it can be shown that the correct
criterion to optimize to generate ML
estimates of the probability of belonging to
the + class is not squared error.

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 21

Maximum Cross-Entropy

• Instead the following cross-entropy measure


should be maximized:
R

∑ (y
i =1
i log f ( x i , w ) + (1 − y i ) log( 1 − f ( x i , w ) )

• In a multilayer neural network, the gradient


computation for this measure still follows
the backpropagation process.

Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 22

11

You might also like