Neural Network

Articial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face
: Face Recognition Advanced topics
Connectionist Models Consider humans: Neuron switching time .001 second Number of neurons 1010 Connections per neuron 1045 Scene recognition time .1 second 100 inference steps doesnt seem like enough much parallel computation
Connectionist Models Properties of articial neural nets (ANNs): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically
When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant
When to Consider Neural Networks Examples: Speech phoneme recognition [Waibel] Image classication [Kanade, Baluja, Rowley] Financial prediction
ALVINN drives 70 mph on highways
Sharp Left
Straight Ahead
Sharp Right
30 Output Units
4 Hidden Units
30x32 Sensor Input Retina
Perceptron
x1 x2 w1 w2 x0=1 w0
. . .
xn
wn
i=0
wi xi
o=
i=0 -1 otherwise
1 if
wi xi > 0

1 if w0 + w1x1 + + wnxn > 0 o(x1, . . . , xn) = 1 otherwise. Sometimes well use simpler vector notation: o(x) = 1 if w x > 0 1 otherwise.
Decision Surface of a Perceptron

x2 + + x1 + x1 + + x2
(a)
(b)
Represents some useful functions What weights represent g(x1, x2) = AN D(x1, x2)? But some functions not representable e.g., not linearly separable Will need more complicated models to represent these.
Perceptron training rule
wi wi + wi where wi = (t o)xi Where: t = c(x) is target value o is perceptron output is small constant (e.g., .1) called learning rate
Perceptron training rule Can prove it will converge If training data is linearly separable and sufciently small
10
Gradient Descent To understand, consider simpler linear unit, where o = w 0 + w1 x 1 + + w n x n Lets learn wis that minimize the squared error 1 E[w] 2 (td od)2
dD
Where D is set of training examples This denes the Loss function that you want to minimize!
11
Gradient Descent
25 20 15
E[w]
10 5 0 2 1 0 -1 w0 3 2 0 -1 -2
1 w1
Gradient
E E E , , E[w] w0 w1 wn
12
Training rule: w = E[w] i.e., Update: E wi = wi
wi wi + wi
13
Gradient Descent Derivation: 1 E (td od)2 = wi wi 2

d
1 = 2 1 = 2 =
d
(td od)2 wi 2(td od) (td od) wi
(td od) (td w xd) wi (td od)(xi,d)
E = wi
14
Gradient Descent G RADIENT-D ESCENT(training examples, ) Each training example is a pair of the form x, t , where x is the vector of input values, and t is the target output value. is the learning rate (e.g., .05). Initialize each wi to some small random value Until the termination condition is met, Do Initialize each wi to zero. For each x, t in training examples, Do Input the instance x to the unit and compute the output o
15
For each linear unit weight wi, Do wi wi + (t o)xi For each linear unit weight wi, Do wi wi + wi
16
Summary Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufciently small learning rate Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufciently small learning rate Even when training data contains noise Even when training data not separable by H
17
Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satised 1. Compute the gradient 2. w w ED [w] Incremental mode Gradient Descent: Do until satised For each training example d in D 1. Compute the gradient 2. w w Ed[w] Ed[w] ED [w]
18
1 ED [w] 2
(td od)2
dD
1 Ed[w] (td od)2 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if made small enough
19
Multilayer Networks of Sigmoid Units

head hid ... ... whod hood
F1
F2
20
21
Sigmoid Unit
x1 x2 w1 w2 x0 = 1 w0
. . .
wn
net = wi xi
i=0
o = (net) =
xn
1+e
-net
1 (x) is the sigmoid function 1+ex
Nice property: dx = (x)(1 (x)) We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoids Backpropagation
d(x)
22
Error Gradient for a Sigmoid Unit E 1 = (td od)2 wi wi 2

dD
1 = 2 1 = 2 =
d
(td od)2 wi 2(td od) (td od) wi
od (td od) wi od netd (td od) netd wi
=
d
23
But we know: od (netd) = = od(1 od) netd netd netd (w xd) = = xi,d wi wi E = wi (td od)od(1 od)xi,d
dD
So:
24
Backpropagation Algorithm Initialize all weights to small random numbers. Until satised, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k k ok (1 ok )(tk ok ) 3. For each hidden unit h
25
h oh(1 oh)
koutputs
wh,k k
4. Update each network weight wi,j wi,j wi,j + wi,j where wi,j = j xi,j
26
More on Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum wi,j (n) = j xi,j + wi,j (n 1) Minimizes error over training examples Will it generalize well to subsequent examples?
27
Training can take thousands of iterations slow! Using network after training is very fast
28
Learning Hidden Layer Representations

Inputs Outputs
A target function: Input Output 10000000 10000000 01000000 01000000 00100000 00100000 00010000 00010000 00001000 00001000 00000100 00000100 00000010 00000010 00000001 00000001
Can this be learned??
29
Learning Hidden Layer Representations A network:

Inputs Outputs
Learned hidden layer: Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Hidden Values .89 .04 .08 .01 .11 .88 .01 .97 .27 .99 .97 .71 .03 .05 .02 .22 .99 .99 .80 .01 .98 .60 .94 .01 Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001
30
Training
Sum of squared errors for each output unit
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
500
1000
1500
2000
2500
31
Training
Hidden unit encoding for input 01000000
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
500
1000
1500
2000
2500
32
Training
Weights from inputs to one hidden unit
4 3 2 1 0 -1 -2 -3 -4 -5 0 500
1000
1500
2000
2500
33
Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different inital weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasing non-linearity possible as training progresses
34
Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
35
Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
36
Overtting in ANNs
Error versus weight updates (example 1) Training set error Validation set error
0.01 0.009 0.008 0.007
Error
0.006 0.005 0.004 0.003 0.002 0 5000 10000 15000 Number of weight updates 20000
37
0.08 0.07 0.06 0.05
Error versus weight updates (example 2) Training set error Validation set error
Error
0.04 0.03 0.02 0.01 0 0 1000 2000 3000 4000 Number of weight updates 5000 6000
38
Neural Nets for Face Recognition left strt rght up
...
...
30x32 inputs
39
Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces
40
Learned Hidden Unit Weights left strt rght up Learned Weights
...
...
30x32 inputs
41
Alternative Error Functions Penalize large weights: 1 E(w) 2 (tkd okd)2 +

dD koutputs i,j 2 wji
Train on target slopes as well as values: 1 E(w) 2 (tkd okd)2

dD koutputs
jinputs
tkd o kd j j xd xd
42
Tie together weights: e.g., in phoneme recognition network
43
Recurrent Networks
y(t + 1)
y(t + 1)
x(t) (a) Feedforward network
x(t)
c(t)
(b) Recurrent network
y(t + 1)
x(t)
c(t) y(t)
x(t 1)
c(t 1) y(t 1)
(c) Recurrent network
x(t 2)
c(t 2)
unfolded in time
44

Neural Network

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network

Uploaded by

Copyright:

Available Formats

Articial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face

: Face Recognition Advanced topics

ALVINN drives 70 mph on highways

30x32 Sensor Input Retina

Decision Surface of a Perceptron

Perceptron training rule

Training rule: w = E[w] i.e., Update: E wi = wi

Gradient Descent Derivation: 1 E (td od)2 = wi wi 2

(td od)2 wi 2(td od) (td od) wi

(td od) (td w xd) wi (td od)(xi,d)

Multilayer Networks of Sigmoid Units

1 (x) is the sigmoid function 1+ex

Error Gradient for a Sigmoid Unit E 1 = (td od)2 wi wi 2

(td od)2 wi 2(td od) (td od) wi

od (td od) wi od netd (td od) netd wi

Learning Hidden Layer Representations

Can this be learned??

Learning Hidden Layer Representations A network:

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.01 0.009 0.008 0.007

0.08 0.07 0.06 0.05

Neural Nets for Face Recognition left strt rght up

Learned Hidden Unit Weights left strt rght up Learned Weights

Alternative Error Functions Penalize large weights: 1 E(w) 2 (tkd okd)2 +

Train on target slopes as well as values: 1 E(w) 2 (tkd okd)2

Tie together weights: e.g., in phoneme recognition network

x(t) (a) Feedforward network

(b) Recurrent network

(c) Recurrent network

You might also like