Professional Documents
Culture Documents
Neural 13
Neural 13
Classification with
Neural Networks
Note to other teachers and users of
these slides. Andrew would be delighted
Andrew W. Moore
Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use School of Computer Science
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the Carnegie Mellon University
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm
Comments and corrections gratefully
received. awm@cs.cmu.edu
412-268-7599
Linear Regression
DATASET
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
↑
w x3 = 2 y3 = 2
← 1 →↓
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
1
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
2
Maximum likelihood estimation of w
∏P( y w, x ) maximized?
i =1
i i
For what w is
n
∏ P( y
i =1
i w , x i ) maximized?
∑ (y
i =1
i − wx i ) minimized?
3
Linear Regression
The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i
= ∑ yi − (2∑ xi yi )w +
2
(∑ x )w i
2 2
i
We want to minimize a quadratic function of w.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7
Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑x y i i
∑x
2
i
4
Linear Regression
Easy to show the sum of
squares is minimized
when
∑ xi yi
p(w)
w= w
∑x
2
i Note: In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
model is
Out(x) = wx And predictions would have given a prob
dist of expected output
Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
. 5 example
.8
x2 . 10
x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10
5
Multivariate Regression
Write matrix X and Y thus:
Multivariate Regression
Write matrix X and Y thus:
IMPORTANT EXERCISE:
(there are R datapoints. Each input hasPROVE
m components)
IT !!!!!
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)
6
Multivariate Regression (con’t)
R
XTY is an m-element vector: i’th elt ∑x
k =1
y
ki k
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
7
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1
X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor In this example, = w0+w1X1+ w2X2
You should be able
model to see the MLE w0 …has a fine constant
, w1 and w2 by
inspection
term
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15
LE
heM ?
yi ~ N ( wxi , σ i2 )
t
at’s fw
Assume W h at e o
m
esti
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16
8
MLE estimation with varying noise
argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ 12 , σ 22 ,..., σ R2 , w) =
w Assuming i.i.d. and
R
( yi − wxi ) 2 then plugging in
argmin ∑ σ i2
= equation for Gaussian
and simplifying.
i =1
w
R
x ( y − wx ) Setting dLL/dw
w such that ∑ i i 2 i = 0 = equal to zero
i =1 σi
R xi yi Trivial algebra
∑ 2
i =1 σ i
R xi2
∑ 2
i =1 σ i
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17
argmin ∑ σ i2 y=2
i =1 σ=1/2
w
y=1 σ=1
σ=1/2
σ=2
y=0
x=0 x=1 x=2 x=3
1
where weight for i’th datapoint is σ i2
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18
9
Weighted Multivariate Regression
R xki xkj
(WXTWX) is an m x m matrix: i,j’th elt is ∑
k =1 σ i2
R
(WXTWY) is an m-element vector: i’th elt xki yk
∑
k =1 σ i2
Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
y=3
xi yi
½ ½ y=2
1 2.5
2 3 y=1
3 2 y=0
LE
eM ?
yi ~ N ( w + xi , σ ) 2 th
at’s fw
Assume W h at e o
m
esti
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20
10
Non-linear MLE estimation
argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =
w Assuming i.i.d. and
argmin ∑ (y )
R 2 then plugging in
i − w + xi = equation for Gaussian
i =1 and simplifying.
w
R
y − w + xi Setting dLL/dw
w such that ∑ i = 0 = equal to zero
w + xi
i =1
argmin ∑ (y )
R 2 then plugging in
i − w + xi = equation for Gaussian
i =1 and simplifying.
w
R
y − w + xi Setting dLL/dw
w such that ∑ i = 0 = equal to zero
w + xi
i =1
11
Non-linear MLE estimation
argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ , w) =
w Assuming i.i.d. and
argmin ∑ ( )
R
Common (but not only) approach: 2 then plugging in
yi − w + xi = equation for Gaussian
Numerical Solutions: and simplifying.
i =1
• Line Search w
• Simulated Annealing
R
yi − w + xi Setting dLL/dw
w such that
• Gradient Descent
w +
∑
x
= 0 =
equal to zero
• Conjugate Gradient i = 1 i
• Levenberg Marquart
• Newton’s Method We’re down the
algebraic toilet
Also, special purpose statistical-
hat
ss w
optimization-specific tricks such as
u e
So g e do?
E.M. (See Gaussian Mixtures lecture
for introduction) w
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23
GRADIENT DESCENT
Suppose we have a scalar function
f(w): ℜ → ℜ
We want to find a local minimum.
Assume our current weight is w
∂
GRADIENT DESCENT RULE: w ← w −η f (w)
∂w
12
GRADIENT DESCENT
Suppose we have a scalar function
f(w): ℜ → ℜ
We want to find a local minimum.
Assume our current weight is w
∂
GRADIENT DESCENT RULE: w ← w −η f (w)
∂w
Recall Andrew’s favorite
default value for anything
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05
QUESTION: Justify the Gradient Descent Rule
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25
13
What’s all this got to do with Neural
Nets, then, eh??
For supervised learning, neural nets are also models with
vectors of w parameters in them. They are now called
weights.
As before, we want to compute the weights to minimize sum-
of-squared residuals.
Which turns out, under “Gaussian i.i.d noise”
assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights, we
use GRADIENT DESCENT to SEARCH for them.
our eyes.
s exp ression in y
” you a sk, a querulou
“Wh y?
e later.”
ply: “We’ll se
“Aha!!” I re
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27
Linear Perceptrons
They are multivariate linear models:
Out(x) = wTx
Ε = ∑ (Out (x ) −
k
k yk )2
∑ (w )
Τ 2
= x k − yk
k
14
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1
∂E
So what’s ?
∂w j
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29
15
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1
…where…
∂w j w j ← w j + 2η∑ δk xkj
k =1
∂E R
= −2∑ δk xkj
∂w j k =1 We frequently neglect the 2 (meaning
we halve the learning rate)
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31
3) for i = 1 to R
δ i := yi − wΤxi
4) for j = 1 to m R
w j ← w j + η ∑ δ i xij
i =1
16
δ i ← yi − w Τ x i A RULE KNOWN BY
w j ← w j + ηδ i xij
MANY NAMES
rule
ule Hoff
MSR id row
L W
Th e The
The delta ru
le
Th
e ad
alin
er
ule
Classical
conditioning
observe (x,y)
δ ← y − wΤx
∀j w j ← w j + η δ x j
17
Gradient Descent vs Matrix Inversion
for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35
18
Gradient Descent vs Matrix Inversion
for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
But we’llin wetware)?
• More easily parallelizable (or implementable
GD Disadvantages (MIsoon see that
advantages):
• It’s moronic GD
has an important
• It’s essentially a slow implementation of a way to extra
build the XTX matrix
and then solve a set of linear equations
trick up its sleeve
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37
or
19
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
Blue = Out(x)
We can do a linear fit.
Our prediction is 0 if out(x)≤½
1 if out(x)>½
WHAT’S THE BIG PROBLEM WITH THIS???
or
Blue = Out(x)
We can do a linear fit.
Our prediction is 0 if out(x)≤½ Green = Classification
1 if out(x)>½
20
Classification with Perceptrons I
∑ (y )2
Don’t minimize i − w Τxi .
Minimize number of misclassifications instead. [Assume outputs are
∑ (y ( ))
+1 & -1, not +1 & 0]
i − Round w Τ x i
21
Classification with Perceptrons II:
Sigmoid Functions
The Sigmoid
1
g ( h) =
1 + exp(− h)
i.e. g(h)=1-g(-h)
22
The Sigmoid
1
g ( h) =
1 + exp(− h)
∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )]
R R
2 2
i =1 i =1
23
Gradient descent with sigmoid on a perceptron
First, notice g ' ( x ) = g ( x )(1 − g ( x ))
1 − e− x
Because : g ( x ) = so g ' ( x ) =
1 + e− x 1 + e − x
2
1 −1 − e− x 1 1 −1 1
= = − = 1 − = − g ( x )(1 − g ( x ))
− −
1 + e − x
2
1 + e − x
2 1+ e x 1+ e x 1 + e− x
Out(x) = g ∑ wk xk
k The sigmoid perceptron
update rule:
2
Ε = ∑ yi − g ∑ wk xik
i k
R
∂Ε ∂
= ∑ 2 yi − g ∑ wk xik −
g ∑ wk xik w j ← w j + η ∑ δ i gi (1 − gi )xij
∂w j i k ∂w j k i =1
∂
= ∑ − 2 yi − g ∑ wk xik g ' ∑ wk xik ∑w x m
i k k ∂w j k
k ik
where gi = g ∑ w j xij
= ∑ − 2δ i g (net i )(1 − g (net i ))xij j =1
δ i = yi − gi
i
24
Perceptrons and Boolean Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…
X1
X2
• Can learn the function x1 ∨ x2 .
X1
• Can learn any conjunction of literals, e.g.
x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5
QUESTION: WHY?
25
Multilayer Networks
The class of functions representable by perceptrons
is limited
( )
Out(x) = g w Τ x = g ∑ w j x j
j
Use a wider
representation !
Out(x) = g ∑W j g ∑ w jk x jk This is a nonlinear function
j k
Of a linear combination
Of non linear functions
Of linear combinations of inputs
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51
NINS
w11
v1 = g ∑ w1k xk
k =1 w1
x1 w21
w31
NINS
v2 = g ∑ w2k xk w2 N HID
Out = g ∑ Wk vk
w12 k =1 k =1
w22
x2 w3
w32
NINS
v3 = g ∑ w3k xk
k =1
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52
26
OTHER NEURAL NETS
1
x1
x2
x3
2-Hidden layers + Constant Term
“JUMP” CONNECTIONS
x1
N INS N HID
Out = g ∑ w0 k xk + ∑Wk vk
x2
k =1 k =1
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53
Backpropagation
Out(x) = g ∑W j g ∑ w jk xk
j k
Find a set of weights {W j },{w jk }
to minimize
∑ ( y − Out(x ))
2
i i
i
by gradient descent.
That’s
That’sit!
it!
That’s
That’s thebackpropagation
the backpropagation
algorithm.
algorithm.
27
Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
•In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.
28
Learning-rate problems
29
Momentum illustration
30
Improving Simple Gradient Descent
Newton’s method
∂E 1 T ∂ 2 E
E ( w + h) = E ( w ) + hT + h h + O(| h |3 )
∂w 2 ∂w 2
31
Improving Simple Gradient Descent
Conjugate Gradient
Another method which attempts to exploit the “local
quadratic bowl” assumption
But does so while only needing to use ∂E
∂w
and not ∂ E
2
∂w 2
BEST GENERALIZATION
Intuitively, you want to use the smallest,
simplest net that seems to fit the data.
32
What You Should Know
• How to implement multivariate Least-
squares linear regression.
• Derivation of least squares as max.
likelihood estimator of linear coefficients
• The general gradient descent rule
• Multilayer nets
Æ The idea behind back prop
Æ Awareness of better minimization methods
33
APPLICATIONS
To Discuss:
34