Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Additional Topics

Adaline
Delta Rule or LMS or Widrow-Hoff
Generalised delta rule or error backpropagation
Effect of Momentum term on Backpropagation
Radial Basis Function Networks
Kohonen SOM
LVQ
Simulated Annealing
Covers Theorem
BPTT
Hard limit activation function

T-norm and T-conorm


Mamdani model
TSK (Sugeno) model
ANFIS

Module 4
Previous Solutions
Important Questions

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

ADALINE

The ADALINE ( Adaptive Linear Neural Element or simply Adaptive


Linear Element ) refers to a single neuron that receives several input
signals from other neurons and also a bias signal which is always +1. An
ADALINE is trained using the delta rule which is also known as LMS
( Least Mean Square ) rule or Widrow-Hoff rule.
The net input is given as
n

I=

w x
i

i 0

where xi represents the input from ith neuron and wi represents the
corresponding weight. The bias signal, i.e. x0 is always +1.
The neuron is linear and hence uses the identity function to find its output,
i.e.
n

y = f(I) = I =

w x
i

i 0

The learning rule minimises the mean squared error between the output y
and the the target output t, and is given as
wi = ( t y ) xi
where i represents the ith input signal and wi represents the change that
needs to be made in the ith weight.
So, this can also be written as
wi(new) = wi(old) x i
where + sign is taken when t = 1; y = 0 and sign is taken when t = 0; y
=1.
An ADALINE can be used to solve linearly separable problems but it
fails in case of problems that are not linearly separable, like the XORpattern problem.
The ADALINE training algorithm is given as:
Step 1) Initialize weights randomly ( may be set to 0 for simplicity ) and
set the learning rate between 0 and 1.
Step 2) do steps (3) to (7)
Step 3)For each training pattern, do steps (4) to ( 6 )

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

Step 4) Compute net input for the single output neuron as


n

I=

w x
i

i 0

Step 5) Compute output of the neuron using


n

y = f(I) = I =

w x
i

i 0

Step 6) Modify the weights using


wi = ( t y ) xi
Step 7) If the largest weight change that occurred is not less than a
specified tolerance, then goto Step(2) else Stop.
Several ADALINES that receive signals from the same input units can be
combined into a single layer network to form a perceptron.
If ADALINES are combined in a manner that the output from some of
them becomes input for some of the others, then the network becomes
multi-layered and such a multi-layered network of ADALINES is called a
MADALINE. So, a MADALINE differs from an ADALINE as
MADALINES are multi-layered while ADALINES are singlelayered
MADALINES can separate patterns that are not linearly
separable
MADALINES use a learning rule which is different from the
delta rule
used by ADALINES and it works on a layer-by-layer manner

Figure 1:

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

DELTA RULE
The delta rule is a learning rule used by neural networks which works on
the principle of minimising the mean-squared error. It is also called the
LMS (Least Mean Square ) rule or Widrow-Hoff rule.
The net input of any output neuron is given as
n

I=

w x
i

i 0

where xi represents the input from ith neuron and wi represents the
corresponding weight. The bias signal, i.e. x0 is always +1.
The neuron uses an activation function to find its output, i.e.
n

y = f(I) = f ( wi xi )
i 0

The delta learning rule minimises the mean squared error between the
output y and the the target output t, and is given as
wi = ( t y ) xi
where i represents the ith input signal and wi represents the change that
needs to be made in the ith weight.
So, this can also be written as
wi(new) = wi(old) x i
where + sign is taken when t = 1; y = 0 and sign is taken when t = 0; y
=1.
The delta rule is used to modify weights in networks like Perceptron and
Adaline.
The features of the delta rule are as follows :
It is one of the simplest learning rules
Learning is said to be distributed because weight-modifications
can be performed locally at each neuron
Learning is said to be online because it takes place in a patternby-pattern manner, where weights get modified after
presentation of each training pattern

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

GENERALIZED DELTA RULE


The Generalised Delta Rule or the Backward Error Propagation rule or
the Backpropagation of Errors rule is a learning rule for neural networks
that works on the principle of minimising the total squared error
computed by the network.
It is generally used for training multi-layered networks where the delta
rule fails.
The error for any output neuron j is calculated as
ej(n) = dj(n) yj(n)
where dj(n) is the target output and yj(n) is the output obtained and nth
training pattern.
Its error energy is given as
Ej(n) =

1 2
e j (n)
2

The total error energy is obtained by adding up the error energies of all
the output neurons and it is given as
E(n) =

1 m 2
e j (n)
2 j 1

where m is the number of output nodes.


The average of these total error energies for the N values of n, i.e. the
number of input patterns is given as
1 N
1 N m 2
E
(
n
)

E=
e j (n)
N n 1
2 N n 1 j 1
According to the gradient-descent search rule,
wij (n)

E (n)
wij (n)

where ve sign indicates that the direction of weight change and change
in E(n) are reverse according to gradient descent in the weight space.
Now, the gradient can be expressed as
E ( n)
E ( n ) e j ( n) y j ( n)

.
wij ( n) e j ( n) y j ( n) I j ( n)

(e j (n))(1)( f j' ( I j (n))( y i (n))


So, the correction rule becomes

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

wij (n) e j (n) f j' ( I j (n)) y i (n)


= j(n)yi(n)
where j(n) is the local gradient given as
j(n) =

e j (n) f j' ( I j (n))

The generalised delta rule is used for training in networks like the
backpropagation network.
Its features are :
It has wider scope than traditional learning rules like the LMS
It is global because the error for the entire network is to be
minimised
Its execution is complex but it can be used for solving a greater
range of problems, like the separation of XOR-type patterns that
are not linearly separable

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

EFFECT OF MOMENTUM TERM ON BACKPROPAGATION


A major demerit of the Backpropagation algorithm is its extremely slow
rate of convergence, i.e. it takes a very large number of iterations to reach
the solution. The effect of addition of a momentum term to the weight
update formula has often been found to increase the speed of convergence.
In the simplest form of this modification, a momentum term comprising
of the product of the weight change in previous iteration with a
momentum constant is added to the original rule, i.e. we have
wij(n) = j(n)yi(n) + wij(n-1)
Advantages :
In many situations, addition of a momentum term has been
found to increase the speed of convergence
It smoothes the weight updating and tends to resist erratic weight
changes
Disadvantages :
It needs to store the weights from the previous iterations to find
the momentum term
It need not always make the speed of training faster

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

RADIAL BASIS FUNCTION NETWORK


A Radial Basis Function Network or Localized Receptive Field Network
is a two-layered network in which
the hidden layer nodes which are called receptive field nodes
produce a significant non-zero response to input signals only
when the input falls within a small localized region of the input
space by using basis or kernel functions.
the output units produce the final output by taking a weighted
sum or a weighted average of the signals produced by the
receptive field nodes.
The kernel function generally used is the Gaussian function :
|| x u i ||2

2 i2

Ri(x) = e
where u is a pre-defined vector with the same dimension as x.
In this case, each receptive field node produces an identical output for
inputs within a fixed radial distance from the centre of the kernel, i.e. they
are radially symmetric.
The output for an output node can be calculated in two ways :
(1) Weighted sum :
n

y(x) =

c R ( x)
i

i 1

where n is the number of receptive fields and x represents the input


while ci represents constants used for finding the linear combination.
(2) Weighted average :
n

c R ( x)
i

y(x) =

i 1
n

R ( x)
i

i 1

Figure 2:

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

KOHONENS SELF-ORGANISING MAP


The Kohonens Self-Organising Map ( SOM ) is a topology-preserving
single layer feedforward neural network that uses competitive learning.
Architecture :

c1

c2

ck

c3

cn

x1

x2

xm

The figure shows a single layer m-n feedforward Kohonens


network, i.e. it has m input nodes and n output nodes. The output
layer is also called the Kohonen layer.
The network is fully connected, i.e. any input node is connected to
every output node.
The SOM is generally used for clustering input patterns while
preserving topology.
Here the input patterns are presented sequentially through the input
layer, without specifying the desired output, and so the learning is
unsupervised.
The network tries to select the output node that has the least
distance from the training pattern and then modifies the weights of
links to this neuron as well as neurons in its neighbourhood, to

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

make them more suitable for recognising similar patterns, in


comparison to other nodes.
The network thus obtains a number of nodes forming a
neighbourhood, each having a central representative node which is
called the prototype for that goup.
The network is said to use competitive learning, because here the
prototype with maximum similarity is considered to be the winner
and weights are modified only for this node and its neighbours
Learning :
For every input pattern x = (x1 , x 2 , ., xm) presented, calculate
the Euclidean distance of each neuron k, and find the neuron w
whose prototype is closest to x
||x cw|| = min || x ck||
Neuron w is the winning neuron, called the excitation center,
which becomes the center of a group of input vectors that lie
closest to cw
For all input vectors closest to cw , update all prototype vectors by
ck(new) = ck(old) + hkw(old) ( x ck (old))
where is the learning rate, hkw is the neighbourhood function or
kernel function or excitation response . This equation is called
the Kohonen Learning Rule.
hkw is selected as a function that decreases with increasing
distance between ck and cw, and a common choice is the Gaussian
function given as

||ck cw || 2

2 (t )

hkw(t) = h0
where h0 is a positive constant and (t) is a decreasing function of t,
which is often taken as the exponential decay function given as
(t) = 0e( -t/ )
where 0 and are constants
The algorithm is said to converge when hkw does not vary with time,
i.e. it becomes constant
Applications :

Clustering
Vector Quantisation
Data visualisation
Feature extraction

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

Demerits :
SOM is not based on minimization of any objective function
Termination is not based on optimising any model of the process or
its data
Convergence is unguaranteed and termination is often a forced one.

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

t-norm and t-conorm


t-norm :
A mapping T : [0,1] X [0,1] [0,1] is said to be a t-norm if it
satisfies the following four properties :
Commutativity : T(x,y) = T(y,x)
Monotonicity : T(x,y) T(x,z) if y z
Associativity : T(x, T(y,z)) = T(T(x,y),z)
Linearity
: T(x,1) = x
The most commonly used t-norm operators are :
Minimum or Standard Intersection :
Tmin(x,y) = min(x,y)
Algebraic Product :
Tap(x,y) = xy
Bounded Product :
Tbp(x,y) = max(0,x+y-1)
Drastic Product or Drastic Intersection :

x; y 1

T ( x, y ) y; x 1
0; otherwise

t-conorm or s-norm :
A mapping S : [0,1] X [0,1] [0,1] is said to be a t-norm if it
satisfies the following four properties :
Commutativity : S(x,y) = S(y,x)
Monotonicity : S(x,y) S(x,z) if y z
Associativity : S(x, S(y,z)) = S(S(x,y),z)
Linearity
: S(x,0) = x
The most commonly used t-norm operators are :
Maximum or Standard Union :
Smax(x,y) = max(x,y)
Algebraic Sum :
Sas(x,y) = x + y xy
Bounded Sum :
Sbp(x,y) = min(1,x+y)
Drastic Sum or Drastic Union :

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

x; y 0

S ( x, y ) y; x 0
1; otherwise

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

INT AND DIM


The INT or Contrast Intensification operator is defined as
2 A 2 ;0 A ( x) 0.5
INT (A) =
2

~ (2(~ A) );0.5 A ( x) 1

This operator is called the Contrast Intensification operator


because it increases the value of membership values A(x) which
are greater than 0.5 and reduces those that are less than 0.5
The inverse operator of INT is DIM which is called the Contrast
Diminisher operator.
NOTE :
For question 1(ii) of 2008, we can use the INT operator
Aggregation is another term used for composition

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

FUZZY INFERENCE SYSTEM AND FUZZY CONTROLLER


A Fuzzy Inference System ( also known as Fuzzy System or Fuzzy
Associative Memory ) is a system which forms the core of any fuzzy
controller, and whose working involves fuzzification, decision-making
based on a fuzzy knowledge base and finally defuzzification.
The structural components of an FIS ( Fuzzy Inference System ) are as
follows :
A data base consisting of the linguistic term sets considered in the
fuzzy rules and also the membership functions defining the
semantics of the linguistic variables and information about their
domains.
The rule base contains a collection of fuzzy rules that determine
the working of the system, and it contains rules like :

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

Data Base

Fuzzification

Crisp
Input

(i)
(ii)
(iii)

Fuzzy
Inference
Engine

Defuzzification

Fuzzy
Rule
Base

Crisp
Output

If x is A then y is B : Single Association


If x is A and y is B then z is C : Multiple Association
If x is A then y is B or z is C : Multiple Association

The fuzzification process collects the inputs and then converts


them into fuzzy sets by associating membership values with them.
The fuzzy inference engine is responsible for generating the fuzzy
output from the given input using the rule base and the data base.
The defuzzification process finally produces a crisp output which
can be used by the fuzzy controller

FIS MODELS :
FISs are capable of performing non-linear mappings between
inputs and outputs based on a set of fuzzy rules.
The interpretations of any rule in the rule base depends on the FIS
model
The two most popular FIS models are :

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

(i)
(ii)

the Mamdani model, and


the TSK model, which is also called the Sugeno model
or the Takagi-Sugeno-Kang model

The Mamdani model is a non-additive fuzzy model that combines


the output of fuzzy rules using the maximum operator.
The TSK model is an additive fuzzy model that combines the
output of fuzzy rules using the addition operator.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

MAMDANI MODEL
The Mamdani model is one of the most popular models used in fuzzy
inference systems, which is a non-additive model that that uses the
maximum operator to combine the outputs of fuzzy rules.
For the Mamdani model with N rules, the ith rule is given as
Ri : IF x is Ai THEN y is Bi ;
for i = 1, 2, ., N
where Ais and Bis are fuzzy sets defined on the input and output spaces
respectively.
An input in the form of
x is A'
produces an output of the form
y is B'
by combining the rules using

B ' ( y ) max(min( A ' ( x), Bi ( y ))


i

--------------- (1)

where i = 1,2,., N, and

A ' ( x) min( A ' ( x ), Ai ( x))


i

---------------- (2)

In place of the max and min, other operators like sum and product can
also be used.

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

Inference Procedure for Mamdani Model :


Let it be assumed that two rules are given as
R: IF x is Ai THEN y is Bi
where i = 1 and 2
and the input is given as
x is A'

For the first row ( 3 graphs ):


In the first figure, the dotted curve represents fuzzy set A1 and the
triangular curve represents A'
In the second figure, the dotted curve represents B1 and the
triangular curve represents B'
The dashed horizontal line shows the value of (2) for i = 1.

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

In the third figure the output is found by taking the maximum.


For the secomd row ( 3 graphs ):
In the first figure, the dotted curve represents fuzzy set A2 and the
triangular curve represents A'
In the second figure, the dotted curve represents B2 and the
triangular curve represents B'
The dashed horizontal line shows the value of (2) for i = 2.
In the third figure the output is found by taking the maximum.
In the third row, the value of (1) has been found by taking the maximum
of the last curves in the previous two rows. This is the required output.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SUGENO (TSK) MODEL


The Sugeno Model (also known as the TSK model, or the TakagiSugeno-Kang model ) is an additive fuzzy model used in fuzzy inference
systems that combines the output of fuzzy rules using the addition
operator.
For the TSK model with N rules, the ith rule is given as
Ri : IF x is Ai THEN y is fi(x) ;
for i = 1, 2, ., N
where Ais are fuzzy sets defined on the input space while fi(x)s are
mappings to the output space.
Generally, the functions fi are linear and are of the form
fi(x) = ai0 + ai1x1 + ai2x2 + .. + ainx n
For an input of the form x is A', the model produces an output of the form
y is B' by combining the rules using

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

y'

i 1

Ai '

( x) f i ( x)

i 1

Ai'

--------------- (3)

( x)

where

A ' ( x) min( A' ( x ), Ai ( x))


i

---------------- (2)

The TSK model has been found to work better than Mamdani
model with lesser rules.
It can extract rules from the data automatically, and has been
found to be stronger and more flexible than Mamdani model.
Its disadvantage is that the rules generated may not be meaningful.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ANFIS
The ANFIS ( Adaptive Neuro Fuzzy Inference System or Adaptive
Network-based Fuzzy Inference System) is a hybrid neuro-fuzzy
system that may be defined as an adaptive neural network that
functions like a fuzzy inference system.

It uses a hybrid learning algorithm


It generally represents the TSK model of FIS using a neural
architecture.

Architecture of ANFIS :
An ANFIS usually has an architecture consisting of six layers which is of
the form n-nK-K-K-K-1 model, where n represents the number of inputs
in layer 0 and K represents the number of nodes in the next layer to which
each input node is connected. In the figure n = 2 and also K = 2.

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

Functioning of ANFIS :
Layer 0 :
It is the input layer containing n input nodes.
The figure has a network with 2 inputs
Layer 1 :
Every node i in this layer is an adaptive node with a node function

Ai ( x ); i 1,2
O1i
Bi 2 ( x ); i 3,4
Each node is associated with a linguistic label Ai or Bi
The node function value O1i is just the membership value
associated with these fuzzy sets
Typically membership function used is

A ( x)

1
x ci
1
ai

2 bi

Here ai, b i, and ci are parameters which are called the premise
parameters
Layer 2 :
Every node in this layer represents a fuzzy neuron with the
algebraic product t-norm as the operator

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

The output is the product of all the incoming signals , i.e.

O2i wi Ai ( x). Bi ( y )

where i = 1,2

Layer 3 :
Every node in this layer performs normalization, and the outputs
are called normalized firing strengths
The outputs are given as :

O3i

O2 i
2

2j

j 1

Layer 4 :
Every node i in this layer is an adaptive node with a node function

O4i O3i f i O3i ( pi x qi y ri )


Every node has an associated parameter set {pi , qi , ri } which are
called the consequent parameters.

Level 5 :
The single node in this layer finds the overall output by summing
up all the incoming signals
The output is given as
2

O5i O4 j
j 1

Learning of ANFIS :
In the ANFIS model, the functions used at all the nodes are
differentiable, and hence the backpropagation algorithm can be
used to train the network.
The ANFIS model uses the TSK fuzzy rules
Ri : IF x is Ai THEN y is fi(x) ;
for i = 1, 2, ., N
where the functions fi are of the form
fi(x) = ai0 + ai1x1 + ai2x2 + .. + ainx n
The network finds the output y and then finds the error, and then it
adjusts only the membership functions of the parameters in ANFIS
network.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

COVERS THEOREM
The Covers Theorem is a result that gives us a method for dealing with
separability of patterns which are not linearly separable.
It states that :
A complex pattern-classification problem which is not linearly
separable in a given n-dimensional input space, when transformed to a
higher dimensional feature space, is more likely to be linearly separable,
provided that the space is not densely populated.
The theorem is often used for converting patterns which are not linearly
separable into linearly separable patterns.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Create PDF files without this message by purchasing novaPDF printer (http://www.novapdf.com)

You might also like