Professional Documents
Culture Documents
Regression
Regression
Regression and
Classification
Note to other teachers and users of
these slides. Andrew would be delighted Ronald J. Williams
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
CSG220, Spring 2007
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
Containing a number of slides adapted from the Andrew Moore
source repository of Andrew’s tutorials: tutorial “Regression and Classification with Neural Networks”
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
Linear Regression
DATASET
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
↑
w x3 = 2 y3 = 2
← 1 →↓
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
1
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 4
2
Maximum likelihood estimation of w
∏P( y w, x ) maximized?
i =1
i i
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 5
For what w is
R
∏ P( y
i =1
i w , x i ) maximized?
∑ (y
i =1
i − wx i ) minimized?
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 6
3
Linear Regression
The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i
= ∑ yi − (2∑ xi yi )w +
2
(∑ x )w i
2 2
i
We want to minimize a quadratic function of w.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 7
Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑x y i i
∑x
2
i
4
Linear Regression
Easy to show the sum of
squares is minimized
when
∑ xi yi
p(w)
w= w
∑x
2
i Note: In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
model is
Out(x) = wx And predictions would have given a prob
dist of expected output
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 9
Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
. 5 example
.8
x2 . 10
x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 10
5
Multivariate Regression
Write matrix X and Y thus:
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 11
R
XTY is an m-element vector: i’th elt is ∑x
k =1
y
ki k
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 12
6
What about a constant term?
We may expect
linear data that does
not go through the
origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, = w0+w1X1+ w2X2
You should be able
model to see the MLE …has a fine constant
w0 , w1 and w2
by inspection
term
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 14
7
What about higher-order terms?
Maybe we suspect a higher-order
polynomial function like
y = w0 + w1 x + w2 x2 + w3 x3
would fit the data better.
Higher-order terms
Linear Fit Quadratic Fit
1 X Y 1 X X2 Y
1 1 2 1 1 1 2
1 2 5 1 2 4 5
1 3 10 1 3 9 10
1 5 26 1 5 25 26
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 16
8
Maximum Likelihood
Nonlinear Regression
Assume correct function is y = f (x,w),
where f is any function of the input x
parameterized by w , and observations are
corrupted by additive Gaussian noise (with
some fixed variance σ2).
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 18
9
For what w is
R
∏ P (y
i =1
i w , x i ) maximized?
For what w is
R
1 ||y i − f (xi , w )|| 2
∏ exp(− 2 (
i =1 σ
) ) maximized?
For what w is 2
1 ⎛⎜ || y i − f ( x i , w ) ||
R ⎞
⎟ maximized?
∑i =1
−
2⎜ σ ⎟
⎝ ⎠
For what w is 2
R
∑ (|| y
i =1
i − f ( x i , w ) || ) minimized?
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 19
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 20
10
Maximum Likelihood
Probability Estimation
• Consider a 2-class classification problem,
and assume that the probability that an
instance x is classified as positive has the
functional form y = f (x ,w).
• Then it can be shown that the correct
criterion to optimize to generate ML
estimates of the probability of belonging to
the + class is not squared error.
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 21
Maximum Cross-Entropy
∑ (y
i =1
i log f ( x i , w ) + (1 − y i ) log( 1 − f ( x i , w ) )
Originals © 2001, Andrew W. Moore, Modifications © 2003, Ronald J. Williams Regression: Slide 22
11