09 Artificial Neural Networks and Classification

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 43

Artificial Neural Networks and

Classification
An artificial neural network is a simple
brain-like device that can learn by
adjusting connections between its
neurons

The brain as a computer

The brains architecture

Human (and animal) brains have a computer


architecture which consists of:

a complex web of about 1011 highly inter-connected


processing units called neurons

Processing involves

signals being sent from neuron to neuron

by complicated electrochemical reaction


in a highly parallel manner

The neuron

A neuron is nerve cell consisting of:

a cell body (soma) containing a nucleus.


branching out from the body:

a number of fibres called dendrites


a single long fibre called the axon

a centimetre or longer.

The axon branches and connects to the


dendrites of other neurons

the connecting junction is called the synapse


each neuron connects to between a dozen and 100,000
other neurons.

A real neuron

Signal propagation

Chemical transmitter substances are released


from the synapses and enter the dendrites.
This raises or lowers the electrical potential of the
cell body

synapses that raise potential are called excitatory


those that lower it are inhibitory

When a threshold is reached, an electrical pulse,


the action potential, is sent down the axon (firing)

This spreads into the axons branches reaching


synapses and releasing transmitters into cell bodies
of other neurons

Brain versus computer

Storage capacity

brain has more neurons than computer has bits

Speed

brain is much slower than a computer

Brain relies on massive parallelism for performance:

a neuron has firing speed of 10-3 secs compared to computer switching


speed of 10-11 secs
you can recognise your mother in 0.1 secs

The brain is more suited to intelligence processing and learning

It is good at forming associations

Is more fault tolerant

this seems to be the basis of learning


neurons die all the time and computation continues

Task performance exhibits graceful degradation

in contrast to brittleness of computers

Artificial neural networks

What is an artificial neural


network?

An artificial neural network (ann) is a grossly


oversimplified version of the brains architecture
It has far fewer neurons

several hundred or thousand

It has much simpler internal structure


The firing mechanism is less complex
The signals consist of real numbers passed from
one neuron to another

How does a network behave?

Most anns can be regarded as input-output


devices

numerical input is propagated through the network from


neuron to neuron till it reaches the output

The connections between neurons have


numerical weights which are used to combine
the signals reaching a neuron
Learning involves establishing the weight values
(strengths) to achieve a particular goal

In theory the strengths could be programmed rather than


learnt but for the most part this would be impossibly tedious

Designing a network

Creating an ann requires the following to be


specified

Network topology

Transfer function

the number of units


the pattern of interconnectivity amongst them
the mathematical type of the weights
This combines the inputs impinging on the unit and produces
the unit activation level which then becomes the output signal

Representation for examples


Learning law

This states how weights are to be modified to achieve the


learning goal

Network topology - neurons and


layers

Specifies how many nodes (neurons) there are and


how they are connected

Often networks are organised in layers (slabs) with no


connections between nodes in a layer - only across

The first layer is the input layer, the last, the output layer
Layers between the input and output layers are called hidden

The input units typically do not carry out internal


computation, i.e. do not have transfer functions

in a fully connected network each node is connected to every


other

they merely pass on their signal values

The output units send their signal directly to the


outside world

Network topology - weights

Weights are usually real-valued


At the start of learning, their values are often set
randomly
If there is a connection from a to b then a has
influence over the activation value of b

Excitatory influence

high activation in unit a contributes to high activation in unit b

is modelled by a positive weight


Inhibitory influence

high activation in unit a contributes to low activation in unit b

is modelled by a negative weight

Network topology - flow of


computation

Although connections are uni-directional, some


networks have pairs of units connected in both
directions

Networks in which there is no looping back of


connections are called feed-forward

there is a connection from unit a to unit b and one


back from unit b to unit a

signals are 'fed forward' from input through to output

Networks in which outputs are eventually fed


back into the network as inputs are called
recurrent

Examples of feed-forward
topologies
4-node input layer

2-node output layer


Single layer network

6-node input layer

4-node hidden layer


Two layer network with 1 hidden layer

1-node output layer

The transfer function combining input signals

The input signals to a neuron must be combined into


a single value, the activation level to be output
Usually this transfer takes place in two stages

first the inputs are combined


and then passed through another function to produce the
output

The most common method of combination is the


weighted sum

sum w1 x1 ... wn xn
Here xi is the signal and wi is the weight on
connection I and n is the number of input signals

The transfer function - the


activation level

The weighted sum is passed through an activation


function to produce the output signal (activation level) y
Commonly used functions are:

Linear

Linear threshold (Step function)

The output is just the weighted sum


The weighted sum is thresholded at a value c: if it is less than
c, then y = 0, otherwise y = 1

Sigmoid response (logistic) function

a continuous version of the step function which produces


graceful degradation around the 'step' at c:

1
1 e ( sum c )

Activation function
graphs
Linear
0

1
Step
0

1
Sigmoid
0

Example
Step activation function with c = 0.38
0

w1 = 0.3
y =1

w2 = 0.5

y =1
y =1

w3 = - 0.1

y =1

Sum = 0.3*0 + 0.5*1 - 0.1*1 = 0.4


Activation level = 1 (0.4 > 0.38)

Learning with Anns

What tasks can a network learn?

Networks can be trained for the following tasks

Classification
Pattern association

Content addressable/associative memory

e.g. English verbs mapped to their past tense


e.g. can recall/restore whole image when provided with a part of
it

These all involve mappings

The mapping of input to output is determined by the settings


of all the weights in the network (the weight vector) this is
what is learnt
The network node configuration together with the weight
vector is the knowledge structure

Learning laws

Learning provides a means of finding the weight


settings to implement a mapping

This is only possible if the network is capable of


representing the mapping
The more complex the mapping, the larger the network
that will be required including a greater number of
hidden layers

Initially, weights are set at random and altered in


response to the training data
A regime for weight alteration to achieve the
required mapping is called a learning law
Even if a network can represent a mapping, a
particular learning law may not be able to learn it

Representation of training
examples

Unlike decision trees which handle both


discrete and continuous (numeric) attributes,
anns can handle only the latter
All discrete attributes must be converted
(encoded) to be numeric
This also applies to the class
Several ways are available and the choice
affects the success of learning

Description attributes

It is desirable for all attributes to have values in


the same range

This is usually taken to be 0 to 1

Achieved for numeric attributes using


normalisation:
value (value - min value) / (max value - min value)

For discrete attributes can use:

1-out-of-N encoding (distributed)

N binary (0-1) units used to represent the N values of the


attribute, one for each

local encoding

values mapped to numbers in range 0 to 1


more suited to ordered values

Class attribute

1-out-of-N or local encoding can be used for the class


The network output after learning is usually only
approximate

e.g. in a binary class problem with classes represented by 0 and 1,


the network might output 0.95 and this would be taken as 1

Using 1-out-of-N encoding allows for a probabilistic


interpretation, e.g.

classes for car domain: unacc, acc, good, vgood


can be represented with four binary units
e.g. acc (0, 1, 0, 0)

Output of (0.2, 0.7 ,0.1, 0) can be interpreted as most likely acc


but could be unacc or (less likely) good

Network configuration

Encoding of training examples affects network


size
Input layer will have

one unit for each numeric attribute


one for each locally encoded discrete attribute
1 for each binary discrete attribute
k for each distributed encoding of a discrete attribute
where the attribute has k>2 values

Usually have a small number of hidden layers


(one or two)

Pyramid structure

Hidden layers are used to reduce the


dimensionality of the input
A network has a pyramid structure if:

the first hidden layer fewer nodes than the input layer
each hidden layer has less than its predecessor
the output layer has least

The pyramid structure facilitates learning

In classification each hidden layer appears to partially


classify the examples until the actual classes are
reached in the output layer

The learning process

Classification learning uses a feedback


mechanism
An example is fed through the network using
the existing weights
The output value is O; the correct output value,
i.e. the class in the example, is T (target)
If O T, some or all of the weights are changed
slightly
The extent of the change usually depends on
T-O, called the error

The delta rule

A weight, wi , on a connection carrying signal,


xi , can be modified by adding an amount wi
proportional to the error:
wi = (T-O) xi

where is the learning rate


is a positive constant usually set at about 0.1
and gradually decreased during learning
The update formula for wi is then
wi wi + wi

Training epochs

For each example in the training set:

the description attribute values are fed as input to


the network
and propagated through to the output
each weight is updated

This constitutes one epoch or cycle of learning


The process is repeated till it is decided to stop
Many thousands of epochs may be necessary
The final set of weights represent the learned
mapping

Worked example - golf domain

Conversion of attributes:
Attribute
Outlook
Temperature
Humidity
Windy

Values
sunny, overcast, rain
-50 to 150 C
low, normal, high,
true, false

Class

yes, no

Attribute
Sunny
Overcast
Rain
Temperature
Low
Normal
High
Windy

Values
0, 1
0, 1
0, 1
0 to 1
0, 1
0, 1
0, 1
0, 1

Play golf

1, 0

(T+50)/200

Network configuration

Use a single layer network (no hidden units) with step function
to illustrate the delta rule

Initialise weights as shown

Set = 0.1

Sunny
Overcast
Rain
Temperature
Low
Normal
High
Windy

w1
w2
w3
w4
w5
w6
w7
w8

-1

w0=0.3
(bias)
w1 =
w2 =
w3 =
w4 =
w5 =
w6 =
w7 =
w =

-0.5
-0.4
0.2
0.3
0.1
-0.1
-0.2
0.4

First example is:

Feeding a training
example

(sunny, 21, low, false) : yes

encoded as:
(1, 0, 0, 0.355, 1, 0, 0, 0) : 1

Calculate weighted sum and threshold at 0 (because bias unit is


included):
Sum = -1*0.3 -0.5*1 -0.4*0 + 0.2*0 + 0.3*0.355+ 0.1*1 -0.1*0 -0.2*0
+0.4*0
= -0.5935
-0.5935 < 0 O = 0

T = 1 thus error is T - O = 1 - 0 = 1
Using delta rule change weights by:
wi = (T-O) xi
= 0.1 * 1 * xi
= 0.1 * xi

The backpropagation algorithm

Learning in multi-layered
networks

Networks with one or more hidden layers are


necessary to represent complex mappings
In such a network the basic delta learning law is
insufficient

It only defines how to update weights in output units


(uses T-O)

To update hidden node weights, we have to define


their error
This is achieved by the Backpropagation algorithm

The Backpropagation
process

Inputs are fed through the network in the usual


way

Output layer weights are adjusted based on


errors...
then weights in the previous layer are adjusted

and so on back to the first layer

this is the forward pass

this is the backwards pass (or backpropagation)

Errors determined in a layer are used to determine


those in the previous layer

Illustrating the error


contribution

A hidden node is partially credited for errors


in the next layer

these errors are created in the forward pass


w1
O

error 1
error 2
error 3

wk
error k
error_contribution = w1 * error 1 + + wk * error k

The backpropagation
algorithm

A backpropagation network is

a multi-layered feed-forward network


using the sigmoid response activation function

Backpropagation algorithm
1. Initialise all network weights to small random numbers
(between -0.05 and 0.05)
2. begin epoch
for each training example do:
propagate input to output layer;
from output layer, backpropagate errors;
update weights
end epoch
3. If termination condition is met, stop else goto 2

Termination conditions

Many thousands of iterations (epochs or cycles)


may be necessary to learn a classification mapping

Several termination conditions are used:

The more complex the mapping to be learnt, the more


cycles will be required
stop after a given number of epochs
stop when the error on the training examples (or on a
separate validation set) falls below some agreed level

Stopping too soon results in underfitting, too late


in overfitting

Backpropagation as a
search

Learning is a search for a network weight


vector to implement the required mapping
The search is hill-climbing or rather
descending called steepest gradient descent

The heuristic used is the total of (T-O)2 over all


examples fed in a single cycle
the weight update produces the greatest fall in
overall error for the size of step

As with all hill-climbing there is the danger of


sticking in local minima

Problems with the search

The size of step is controlled by the learning rate parameter

This must be tuned for individual problems

If the step is too large search becomes inefficient

The error surface tends to have

extensive flat areas


troughs with very little slope

It can be difficult to reduce error in such regions

Weights have to move large distances and it can be hard to


determine the right direction
High numerical accuracy is required, e.g. 32-bit floating point
On the bright side there tend to be many global minima and
few local minima

The trained network

After learning, Backpropagation may be used


as a classifier:

Descriptions of new examples are fed into the


network and the class is read from the output layer
For 1-out-of-N output representations, exact values
of 0 and 1 will not usually be obtained

Sensitivity analysis (using test data)


determines which attributes are most
important for classification

An attribute is regarded as important if small


changes in its value affect the classification

Backpropagation versus
ID3
These two algorithms are the giants of
classification learning
Which is better?

... the jury is still out

There are major differences:

ID3 favours discrete attributes, Backprop favours


continuous (but each handles both types)
Backprop handles noise well. By using pruning, so does
ID3
Backprop is much slower than ID3 and may get stuck
ID3 tells us which attributes are important. Backprop does
this, (to some extent) with sensitivity analysis
Backprops learned knowledge structure (weight vector) is
not understandable whereas an ID3 tree can be
comprehended (although this is difficult if the tree is large)

You might also like