09 Artificial Neural Networks and Classification

Artificial Neural Networks and
Classification
An artificial neural network is a simple
brain-like device that can learn by
adjusting connections between its
neurons
The brain as a computer
The brains architecture
Human (and animal) brains have a computer

architecture which consists of:
a complex web of about 1011 highly inter-connected

processing units called neurons
Processing involves
signals being sent from neuron to neuron
by complicated electrochemical reaction

in a highly parallel manner
The neuron
A neuron is nerve cell consisting of:
a cell body (soma) containing a nucleus.

branching out from the body:
a number of fibres called dendrites

a single long fibre called the axon
a centimetre or longer.
The axon branches and connects to the

dendrites of other neurons
the connecting junction is called the synapse

each neuron connects to between a dozen and 100,000
other neurons.
A real neuron
Signal propagation
Chemical transmitter substances are released

from the synapses and enter the dendrites.
This raises or lowers the electrical potential of the
cell body
synapses that raise potential are called excitatory

those that lower it are inhibitory
When a threshold is reached, an electrical pulse,

the action potential, is sent down the axon (firing)
This spreads into the axons branches reaching

synapses and releasing transmitters into cell bodies
of other neurons
Brain versus computer
Storage capacity
brain has more neurons than computer has bits
Speed
brain is much slower than a computer
Brain relies on massive parallelism for performance:
a neuron has firing speed of 10-3 secs compared to computer switching

speed of 10-11 secs
you can recognise your mother in 0.1 secs
The brain is more suited to intelligence processing and learning
It is good at forming associations
Is more fault tolerant
this seems to be the basis of learning

neurons die all the time and computation continues
Task performance exhibits graceful degradation
in contrast to brittleness of computers
Artificial neural networks
What is an artificial neural

network?
An artificial neural network (ann) is a grossly

oversimplified version of the brains architecture
It has far fewer neurons
several hundred or thousand
It has much simpler internal structure

The firing mechanism is less complex
The signals consist of real numbers passed from
one neuron to another
How does a network behave?
Most anns can be regarded as input-output

devices
numerical input is propagated through the network from

neuron to neuron till it reaches the output
The connections between neurons have

numerical weights which are used to combine
the signals reaching a neuron
Learning involves establishing the weight values
(strengths) to achieve a particular goal
In theory the strengths could be programmed rather than

learnt but for the most part this would be impossibly tedious
Designing a network
Creating an ann requires the following to be

specified
Network topology
Transfer function
the number of units

the pattern of interconnectivity amongst them
the mathematical type of the weights
This combines the inputs impinging on the unit and produces
the unit activation level which then becomes the output signal
Representation for examples

Learning law
This states how weights are to be modified to achieve the

learning goal
Network topology - neurons and

layers
Specifies how many nodes (neurons) there are and

how they are connected
Often networks are organised in layers (slabs) with no

connections between nodes in a layer - only across
The first layer is the input layer, the last, the output layer
Layers between the input and output layers are called hidden
The input units typically do not carry out internal

computation, i.e. do not have transfer functions
in a fully connected network each node is connected to every

other
they merely pass on their signal values
The output units send their signal directly to the

outside world
Network topology - weights
Weights are usually real-valued

At the start of learning, their values are often set
randomly
If there is a connection from a to b then a has
influence over the activation value of b
Excitatory influence
high activation in unit a contributes to high activation in unit b
is modelled by a positive weight

Inhibitory influence
high activation in unit a contributes to low activation in unit b
is modelled by a negative weight
Network topology - flow of

computation
Although connections are uni-directional, some

networks have pairs of units connected in both
directions
Networks in which there is no looping back of

connections are called feed-forward
there is a connection from unit a to unit b and one

back from unit b to unit a
signals are 'fed forward' from input through to output
Networks in which outputs are eventually fed

back into the network as inputs are called
recurrent
Examples of feed-forward
topologies
4-node input layer
2-node output layer

Single layer network
6-node input layer
4-node hidden layer

Two layer network with 1 hidden layer
1-node output layer
The transfer function combining input signals
The input signals to a neuron must be combined into

a single value, the activation level to be output
Usually this transfer takes place in two stages
first the inputs are combined

and then passed through another function to produce the
output
The most common method of combination is the

weighted sum
sum w1 x1 ... wn xn
Here xi is the signal and wi is the weight on
connection I and n is the number of input signals
The transfer function - the

activation level
The weighted sum is passed through an activation

function to produce the output signal (activation level) y
Commonly used functions are:
Linear
Linear threshold (Step function)
The output is just the weighted sum

The weighted sum is thresholded at a value c: if it is less than
c, then y = 0, otherwise y = 1
Sigmoid response (logistic) function
a continuous version of the step function which produces

graceful degradation around the 'step' at c:
1
1 e ( sum c )
Activation function
graphs
Linear
0
1
Step
0
1
Sigmoid
0
Example
Step activation function with c = 0.38
0
w1 = 0.3
y =1
w2 = 0.5
y =1
y =1
w3 = - 0.1
y =1
Sum = 0.3*0 + 0.5*1 - 0.1*1 = 0.4

Activation level = 1 (0.4 > 0.38)
Learning with Anns
What tasks can a network learn?
Networks can be trained for the following tasks
Classification
Pattern association
Content addressable/associative memory
e.g. English verbs mapped to their past tense

e.g. can recall/restore whole image when provided with a part of
it
These all involve mappings
The mapping of input to output is determined by the settings

of all the weights in the network (the weight vector) this is
what is learnt
The network node configuration together with the weight
vector is the knowledge structure
Learning laws
Learning provides a means of finding the weight

settings to implement a mapping
This is only possible if the network is capable of

representing the mapping
The more complex the mapping, the larger the network
that will be required including a greater number of
hidden layers
Initially, weights are set at random and altered in

response to the training data
A regime for weight alteration to achieve the
required mapping is called a learning law
Even if a network can represent a mapping, a
particular learning law may not be able to learn it
Representation of training
examples
Unlike decision trees which handle both

discrete and continuous (numeric) attributes,
anns can handle only the latter
All discrete attributes must be converted
(encoded) to be numeric
This also applies to the class
Several ways are available and the choice
affects the success of learning
Description attributes
It is desirable for all attributes to have values in

the same range
This is usually taken to be 0 to 1
Achieved for numeric attributes using

normalisation:
value (value - min value) / (max value - min value)
For discrete attributes can use:
1-out-of-N encoding (distributed)
N binary (0-1) units used to represent the N values of the

attribute, one for each
local encoding
values mapped to numbers in range 0 to 1

more suited to ordered values
Class attribute
1-out-of-N or local encoding can be used for the class

The network output after learning is usually only
approximate
e.g. in a binary class problem with classes represented by 0 and 1,

the network might output 0.95 and this would be taken as 1
Using 1-out-of-N encoding allows for a probabilistic

interpretation, e.g.
classes for car domain: unacc, acc, good, vgood

can be represented with four binary units
e.g. acc (0, 1, 0, 0)
Output of (0.2, 0.7 ,0.1, 0) can be interpreted as most likely acc

but could be unacc or (less likely) good
Network configuration
Encoding of training examples affects network

size
Input layer will have
one unit for each numeric attribute

one for each locally encoded discrete attribute
1 for each binary discrete attribute
k for each distributed encoding of a discrete attribute
where the attribute has k>2 values
Usually have a small number of hidden layers

(one or two)
Pyramid structure
Hidden layers are used to reduce the

dimensionality of the input
A network has a pyramid structure if:
the first hidden layer fewer nodes than the input layer
each hidden layer has less than its predecessor
the output layer has least
The pyramid structure facilitates learning
In classification each hidden layer appears to partially

classify the examples until the actual classes are
reached in the output layer
The learning process
Classification learning uses a feedback

mechanism
An example is fed through the network using
the existing weights
The output value is O; the correct output value,
i.e. the class in the example, is T (target)
If O T, some or all of the weights are changed
slightly
The extent of the change usually depends on
T-O, called the error
The delta rule
A weight, wi , on a connection carrying signal,

xi , can be modified by adding an amount wi
proportional to the error:
wi = (T-O) xi
where is the learning rate

is a positive constant usually set at about 0.1
and gradually decreased during learning
The update formula for wi is then
wi wi + wi
Training epochs
For each example in the training set:
the description attribute values are fed as input to

the network
and propagated through to the output
each weight is updated
This constitutes one epoch or cycle of learning

The process is repeated till it is decided to stop
Many thousands of epochs may be necessary
The final set of weights represent the learned
mapping
Worked example - golf domain
Conversion of attributes:
Attribute
Outlook
Temperature
Humidity
Windy
Values
sunny, overcast, rain
-50 to 150 C
low, normal, high,
true, false
Class
yes, no
Attribute
Sunny
Overcast
Rain
Temperature
Low
Normal
High
Windy
Values
0, 1
0, 1
0, 1
0 to 1
0, 1
0, 1
0, 1
0, 1
Play golf
1, 0
(T+50)/200
Network configuration
Use a single layer network (no hidden units) with step function
to illustrate the delta rule
Initialise weights as shown
Set = 0.1
Sunny
Overcast
Rain
Temperature
Low
Normal
High
Windy
w1
w2
w3
w4
w5
w6
w7
w8
-1
w0=0.3
(bias)
w1 =
w2 =
w3 =
w4 =
w5 =
w6 =
w7 =
w =
-0.5
-0.4
0.2
0.3
0.1
-0.1
-0.2
0.4
First example is:
Feeding a training
example
(sunny, 21, low, false) : yes
encoded as:
(1, 0, 0, 0.355, 1, 0, 0, 0) : 1
Calculate weighted sum and threshold at 0 (because bias unit is

included):
Sum = -1*0.3 -0.5*1 -0.4*0 + 0.2*0 + 0.3*0.355+ 0.1*1 -0.1*0 -0.2*0
+0.4*0
= -0.5935
-0.5935 < 0 O = 0
T = 1 thus error is T - O = 1 - 0 = 1
Using delta rule change weights by:
wi = (T-O) xi
= 0.1 * 1 * xi
= 0.1 * xi
The backpropagation algorithm
Learning in multi-layered
networks
Networks with one or more hidden layers are

necessary to represent complex mappings
In such a network the basic delta learning law is
insufficient
It only defines how to update weights in output units

(uses T-O)
To update hidden node weights, we have to define

their error
This is achieved by the Backpropagation algorithm
The Backpropagation
process
Inputs are fed through the network in the usual

way
Output layer weights are adjusted based on

errors...
then weights in the previous layer are adjusted
and so on back to the first layer
this is the forward pass
this is the backwards pass (or backpropagation)
Errors determined in a layer are used to determine

those in the previous layer
Illustrating the error

contribution
A hidden node is partially credited for errors

in the next layer
these errors are created in the forward pass

w1
O
error 1
error 2
error 3
wk
error k
error_contribution = w1 * error 1 + + wk * error k
The backpropagation
algorithm
A backpropagation network is
a multi-layered feed-forward network

using the sigmoid response activation function
Backpropagation algorithm
1. Initialise all network weights to small random numbers
(between -0.05 and 0.05)
2. begin epoch
for each training example do:
propagate input to output layer;
from output layer, backpropagate errors;
update weights
end epoch
3. If termination condition is met, stop else goto 2
Termination conditions
Many thousands of iterations (epochs or cycles)

may be necessary to learn a classification mapping
Several termination conditions are used:
The more complex the mapping to be learnt, the more

cycles will be required
stop after a given number of epochs
stop when the error on the training examples (or on a
separate validation set) falls below some agreed level
Stopping too soon results in underfitting, too late

in overfitting
Backpropagation as a
search
Learning is a search for a network weight

vector to implement the required mapping
The search is hill-climbing or rather
descending called steepest gradient descent
The heuristic used is the total of (T-O)2 over all

examples fed in a single cycle
the weight update produces the greatest fall in
overall error for the size of step
As with all hill-climbing there is the danger of

sticking in local minima
Problems with the search
The size of step is controlled by the learning rate parameter
This must be tuned for individual problems
If the step is too large search becomes inefficient
The error surface tends to have
extensive flat areas

troughs with very little slope
It can be difficult to reduce error in such regions
Weights have to move large distances and it can be hard to

determine the right direction
High numerical accuracy is required, e.g. 32-bit floating point
On the bright side there tend to be many global minima and
few local minima
The trained network
After learning, Backpropagation may be used

as a classifier:
Descriptions of new examples are fed into the

network and the class is read from the output layer
For 1-out-of-N output representations, exact values
of 0 and 1 will not usually be obtained
Sensitivity analysis (using test data)

determines which attributes are most
important for classification
An attribute is regarded as important if small

changes in its value affect the classification
Backpropagation versus
ID3
These two algorithms are the giants of
classification learning
Which is better?
... the jury is still out
There are major differences:
ID3 favours discrete attributes, Backprop favours

continuous (but each handles both types)
Backprop handles noise well. By using pruning, so does
ID3
Backprop is much slower than ID3 and may get stuck
ID3 tells us which attributes are important. Backprop does
this, (to some extent) with sensitivity analysis
Backprops learned knowledge structure (weight vector) is
not understandable whereas an ID3 tree can be
comprehended (although this is difficult if the tree is large)

09 Artificial Neural Networks and Classification

Uploaded by

Copyright:

Available Formats

You might also like

09 Artificial Neural Networks and Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

09 Artificial Neural Networks and Classification

Uploaded by

Copyright:

Available Formats

Artificial Neural Networks and

The brain as a computer

The brains architecture

Human (and animal) brains have a computer

a complex web of about 1011 highly inter-connected

signals being sent from neuron to neuron

by complicated electrochemical reaction

A neuron is nerve cell consisting of:

a cell body (soma) containing a nucleus.

a number of fibres called dendrites

The axon branches and connects to the

the connecting junction is called the synapse

Chemical transmitter substances are released

synapses that raise potential are called excitatory

When a threshold is reached, an electrical pulse,

This spreads into the axons branches reaching

Brain versus computer

brain has more neurons than computer has bits

brain is much slower than a computer

Brain relies on massive parallelism for performance:

a neuron has firing speed of 10-3 secs compared to computer switching

The brain is more suited to intelligence processing and learning

It is good at forming associations

Is more fault tolerant

this seems to be the basis of learning

Task performance exhibits graceful degradation

in contrast to brittleness of computers

Artificial neural networks

What is an artificial neural

An artificial neural network (ann) is a grossly

several hundred or thousand

It has much simpler internal structure

How does a network behave?

Most anns can be regarded as input-output

numerical input is propagated through the network from

The connections between neurons have

In theory the strengths could be programmed rather than

Creating an ann requires the following to be

the number of units

Representation for examples

This states how weights are to be modified to achieve the

Network topology - neurons and

Specifies how many nodes (neurons) there are and

Often networks are organised in layers (slabs) with no

The input units typically do not carry out internal

in a fully connected network each node is connected to every

they merely pass on their signal values

The output units send their signal directly to the

Network topology - weights

Weights are usually real-valued

high activation in unit a contributes to high activation in unit b

is modelled by a positive weight

high activation in unit a contributes to low activation in unit b

is modelled by a negative weight

Network topology - flow of

Although connections are uni-directional, some

Networks in which there is no looping back of

there is a connection from unit a to unit b and one

signals are 'fed forward' from input through to output

Sum = 0.30 + 0.51 - 0.1*1 = 0.4