5 1 ArtificialNeuralNetworks 4up

Contents
Machine Learning and Data Mining
5. Artificial Neural Networks
Luc De Raedt
Threshold units
Gradient descent
Multilayer networks
Backpropagation
Hidden layer representations
Example: Face Recognition
Advanced topics
Most slides taken from Tom Mitchell
Connectionist Models
Consider humans:
Neuron switching time 0.001 second
Number of neurons 1010
Connections per neuron 104-5
Scene recognition time 0.1 second
100 inference steps doesn't seem like
enough
much parallel computation
Connectionist Models
Properties of artificial neural nets
(ANN's):
Many neuron-like threshold switching units
Many weighted interconnections among
units
Highly parallel, distributed process
Emphasis on tuning weights automatically
ALVINN drives 70 mph on highways
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued

(e.g. raw sensor input)
Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Long training times are acceptable
Form of target function is unknown
Human readability of result is unimportant
Examples
Speech phoneme recognition [Waibel]
Image classification
Financial prediction
Perceptron
Decision Surface of a Perceptron
Represents some useful functions

What weights represent g(x1,x2) = AND(x1,x2)
But some functions not representable

e.g., not linearly separable
Therefore, we'll want networks of these...
Perceptron training rule
Perceptron training rule

Can prove it will converge
If training data is linearly separable
and sufficiently small
Gradient Descent
Gradient Descent
Gradient
Training Rule
Gradient Descent
Gradient Descent
Gradient Descent
Summary Perceptron
Applicable when
Hypothesis space continuous
Error function can be differentiated w.r.t.
parameters
Disadvantages
Convergence can be slow
Can get stuck into local minima
Sometimes stochastic gradient descent can avoid
the local minima
Perceptron training rule guaranteed to

succeed if
Training examples are linearly separable
Sufficiently small learning rate
Linear unit training rule uses gradient descent

Guaranteed to converge to hypothesis with
minimum squared error
Given sufficiently small learning rate
Even when training data contains noise
Even when training data not separable by H
Batch Gradient Descent
Incremental Stochastic Gradient

Descent
Incremental versus Batch
Multilayer Networks of Sigmoid Units
Incremental Gradient Descent can

approximate Batch Gradient Descent
closely if
made small enough
Sigmoid Unit
Sigmoid Unit
We can derive gradient decent rules to train

One sigmoid unit
Multilayer networks of sigmoid units
Backpropagation
Error Gradient for a Sigmoid Unit

But we know
Backpropagation
Derivation backprop
For output unit weights
For Hidden Unit Weights
More on Backpropagation
Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will find a local, not necessarily global error minimum
In practice, often works well (can run multiple times)
Often include weight momentum
Learning Hidden Layer

Representations
Minimizes error over training examples

Will it generalize well to subsequent examples?
Training can take thousands of iterations
Using network after training is very fast
Learned hidden representation
Training
Training
Training
Convergence of Backpropagation
Gradient descent to some local minimum
Perhaps not global minimum...

Add momentum
Stochastic gradient descent
Train multiple nets with different initial weights
Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as
training progresses
Overfitting in ANNs
Overfitting in ANNs
Expressive Capabilities of ANNs
Face Recognition
Boolean functions:
Every boolean function can be represented by
network with single hidden layer but might require
exponential (in number of inputs) hidden units
Continuous functions:
Every bounded continuous function can be
approximated with arbitrarily small error, by
network with one hidden layer [Cybenko 1989;
Hornik et al. 1989]
Any function
can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988].
90% accurate learning head pose, and recognizing 1-of-20 face
Face Recognition
Face Recognition
90% accurate learning head pose, and recognizing 1-of-20 face
Recurrent Neural Networks
Alternative Error Functions
Alternative error functions

Predicting probabilities : cross entropy
Alternative error minimization

procedures
Line search / conjugate gradient
Dynamically modifying network

structure
Adding or deleting nodes and/or
connections

5 1 ArtificialNeuralNetworks 4up

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 1 ArtificialNeuralNetworks 4up

Uploaded by

Copyright:

Available Formats

Contents

Machine Learning and Data Mining

5. Artificial Neural Networks

Most slides taken from Tom Mitchell

ALVINN drives 70 mph on highways

When to Consider Neural Networks

Input is high-dimensional discrete or real-valued

Decision Surface of a Perceptron

Represents some useful functions

But some functions not representable

Perceptron training rule

Perceptron training rule

Perceptron training rule guaranteed to

Linear unit training rule uses gradient descent

Batch Gradient Descent

Incremental Stochastic Gradient

Incremental versus Batch

Multilayer Networks of Sigmoid Units

Incremental Gradient Descent can

We can derive gradient decent rules to train

Error Gradient for a Sigmoid Unit

For output unit weights

For Hidden Unit Weights

Learning Hidden Layer

Minimizes error over training examples

Learned hidden representation

Perhaps not global minimum...

Expressive Capabilities of ANNs

90% accurate learning head pose, and recognizing 1-of-20 face

90% accurate learning head pose, and recognizing 1-of-20 face

Recurrent Neural Networks

Alternative Error Functions

Alternative error functions

Alternative error minimization

Dynamically modifying network

You might also like