Artificial Neural Network

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Artificial Neural Internal network structure

Network:
Artificial Neural Networks Neuron / Node
Σ
Connection weight

• Artificial neural networks (ANNs) mimic the Input Σ Σ Σ Output


layer
layer
neurons of the brain. There are 1011 neurons
in the brain, each connected to 103- 105 other Σ
Σ Σ Σ

neurons. When a strong enough signal Network


input
Network output

stimulates the neuron it fires a pulse through Σ Σ


Σ Σ

the axon to neighboring neurons


w1 Σ Σ
x1 Σ
• Parallel and distributed
w2
x2 . computation
.
. Σ Output • Large number of
wI-1 Σ
xl-1 connections
wl
xl

Hidden layer 1
Hidden layer 3
biological neuron artificial neuron Hidden layer 2

Neural Network applications

Neural network topologies


• The way interconnections and nodes are
arranged
clustering

classification • Choice of topology is determined by the problem


being considered

regression prediction optimization

image analysis
control

The feed-forward topology The recurrent topology

• Nodes are hierarchically arranged • Allows feedback connections between nodes


• Nodes are connected in one direction from • Nodes can store information in their output
one layer to another • The state can depend on the previous state of
the network unlike in the feed-forward
topology

1
 
Activation function o = f  ∑ w i xi − θ 
 i 
• Typical neuron output is • The threshold θ is used for bias-effect
• Activation function f can take different forms
 
o = f  ∑ w ixi − θ  1
 i  sigmoid(x) =
1+ e −x
• f is the activation function, w are the connection 1 if x > 0

weights, x are the inputs and θ is the threshold signum(x) = 0 if x = 0
− 1 if x < 0

1 if x > 0
step(x) = 
0 otherwise

Learning algorithms Supervised learning

• Learning is used to update the weights w • The network is provided with input data for which
the corresponding output data is known
• Neural networks can be classified by the
algorithm used for learning • During learning output data of the network is
compared to the desired output
• Three most used learning mechanisms are • Different learning rules are used to adjust the
– Supervised weights w to obtain desired output
– Unsupervised
– Reinforced

Unsupervised learning Reinforcement learning

• Does not require desired outputs • Mimics the way humans learn when
• Explores underlying structure or interacting with physical environment
correlations in the data and organizes • Network is presented with inputs, but not
patterns into categories with desired outputs
• SOM • If network delivers desired output, the
connections leading to this are
strengthened, otherwise weakened
• Can be used for control

2
Fundamentals of connectionist
McCulloch - Pitts Model
modeling
 
• In this section we will get familiar with • Output: o = f  ∑ w i x i − θ 
 i 
some of the earlier developments in the Bias input θ

field of Neural networks x1


w1
w2
– McCulloch-Pitts model x2 .
.
. Σ Output
– Perceptron xl-1
wI-1
wl
xl
– Adaline
– Madaline • f is the step function
• No learning mechanism

Perceptron Perceptron
 
• Output: o = f  ∑ w i xi − θ  • Was developed for pattern classification of
 i 
linearly separable sets
Bias input θ

w1
x1
w2
x2 .
.
wI-1
. Σ Output of perceptron
xl-1
wl
xl

Learning
Mechanism
+
- • As long as the patterns are linearly separable
(Hebbian Rule)
the learning algorithm should converge
Target output

Perceptron Perceptron

• Steps to train perceptron: • Two dimensional


1. Initialize weights and thresholds to small random numbers case
2. Choose an input-output pattern from the data set Bias input θ

3. Compute output
 l 
o = f  ∑ w i xi − θ 
∑w x −θ = 0
i
i i x1 w1

Σ Output of perceptron
 i=1  w 1x1 + w 2 x 2 − θ = 0
x2 w2

4. Adjust the weights according to the perceptron rule


w1 θ
x2 = − x1 +
Learning
Mechanism -
   +

w i = η t − f  ∑ w i x i − θ   x i
(Hebbian Rule)
w2 w2
  i  Target output

Where η is a positive number ranging from 0 to 1,


representing the learning rate, t is desired output
5. If weights do not reach steady-state values wi=0 go to 2.

3
Adaline Adaline

• Adaptive linear neuron • Weights are adjusted according the Least


mean square algorithm
Bias input θ

w1
• Learning rule is formally derived using the
x1
x2
w2
.
gradient descent algorithm
.
. Σ Output of adaline
xl-1
wI-1
Wl
• Adjust weights by incrementing them by
xl
an amount proportional to the gradient of
LMS learning
mechanism +
-
the cumulative error of the network
Target output

Adaline Adaline example


1

• Steps to train Adaline: 0.9

1. Initialize weights and thresholds to small random numbers 0.8


2. Choose an input-output pattern from the data set 0.7
3. Compute output before activation
l 0.6
r = ∑ w i xi − θ 0.5
i=1
4. Adjust the weights according to the LMS rule 0.4

  
w i = η  t −  ∑ w i x i − θ  x i
0.3

  i  0.2
Where η is a positive number ranging from 0 to 1, representing
0.1
the learning rate
0
5. If weights do not reach steady-state values go to 2. 0 0.2 0.4 0.6 0.8 1

Data belonging to 2 classes and the decision boundary


found with adaline

Adaline example Madaline


2.5

2
w1 • Perceptron and Adaline suffer from their
w2

1.5
θ inability to train patterns belonging to
1 nonlinearly separable spaces, for example
0.5
XOR
0

-0.5 • It was possible to solve the problem of


-1
nonlinear separability by combining a
-1.5

-2
number of Adaline units in parallel →
-2.5
0 1000 2000 3000 4000 5000 6000
Madaline
Evolution of the parameters during training

4
Madaline Major classes of modern
neural networks
• Build a system of perceptrons that solves
the XOR problem. Use sign as • The multilayer perceptron
0 0 -1
activation function • Radial basis function networks
0 1 1
1 0 1 • Kohonen’s self-organizing network
1
1 1 -1 • Recurrent neural networks

0
0 1

The multilayer perceptron The multilayer perceptron


Internal network structure
• The multilayer perceptron is a feedforward Neuron / Node
Σ
Connection weight

network
Σ Σ Σ Output
Input
• MLP consists of an input layer, one or layer layer

more hidden layers and an output layer Σ


Σ Σ Σ

Network
• The amount of hidden layers depends on input
Σ Σ Σ Σ

the task
Σ Σ Σ

Hidden layer 1 Hidden layer 3


Hidden layer 2

The multilayer perceptron Backpropagation


• MLP often uses back-propagation learning • The equations for “on-line” Backpropagation:
algorithm
• The Algorithm is based on the gradient descent
∑ (t i (k ) − oOUT,i (k ))2 q = number of outputs
1 q
E(k ) =
dE
w = −η
optimization method 2 i =1 dw
• The Sigmoid function is often used as the dE dE do dtot
=
dw do dtot dw evaluating the derivative gives:
activation function
• It updates the network weight by using a w ij = ηδio j
where δi
feedback signal, which is the error between the Output node Hidden node
target signal and actual output (t i − oLi )f ′(tot Li ) i )∑ δp
f ′(tot (l) (l+1) (l+1)
wp
p

For sigmoids: f ′(tot i ) = oi (1− oi )

5
Backpropagation learning Backpropagation learning
• Let’s test the on-line backpropagation • If the weight is connected to the output layer
equations for a simple case with one we get the following update rule
hidden layer with 3 sigmoids and a linear w new,i = w old,i + η(t − oout )o j
w new,bias = w old,bias − η(t − oout )
node in the output layer
• If the weight is connected to a node in the
OOUT
OOUT = -w7+w8o1+w9o2+w10o3 hidden layer we get
w new, j = w old, j + η(t − o out )w old,io j (1− o j )IN
7 8 9 10 o1 = 1/(1+exp(-tot1))
tot1 = -w1+w2 IN

w new,bias = w old,bias − η(t − oout )w old,io j (1− o j )


O1 O2 O3
1 2 3 4 5 6 o2 = 1/(1+exp(-tot2))
-1
tot2 = -w3+w4 IN

o3 = 1/(1+exp(-tot3))
tot3 = -w5+w6 IN
-1 IN

Example Effect of starting point


• Try to make a model of sin(x) in the range • Depending on the starting point the training
0...6.3 will go differently. Below an example with
• Let’s use 3 hidden nodes Matlabs NNtoolbox
• After 10000 epochs we get:
1
OOUT 0.8
NNmodel
Original data

0.6

0.4

0.2

0
-1
-0.2

-0.4

-0.6

-1 IN -0.8

-1
0 1 2 3 4 5 6 7

Improvements of Backpropagation
Example
and other methods
• Note that the the derivative of the sigmoid • The are many variants of backpropagation
function f(tot) is simply: that try to improve the basic algorithm
df (tot ) • On such method is variant is BP with
= f (tot )(1 − f (tot ))
dtot momentum
where
1 dE
f (tot ) = w new = −η + γ w old
1 + exp (− tot ) dw
• Show this fact!!
• Also a variable learning rate can improve
the results

6
Batch Backpropagation Backpropagation
• For the case of batch training, all samples • The main drawback of BD is that it has slow
are inspected before the weights are convergence
updated. The error signals for all patterns • Numerous other training approaches have been
in the batch are included in the sum: suggested
• To mention some
w ij = η∑ δi,k o j.k – Levenberg-Marquardt
k
– Genetic Algorithms
– Simulated annealing
– Extended Kalman-filter
– Unscented Kalman-filter

Choosing the number of hidden


Levenberg-Marquardt training
neurons of a neural network
• Levenberg-Marquardt training is based on • The easiest way to choose the complexity is by using
the following equations independent test data
• When the model has too many neurons it starts to adapt
J=Jacobian to noise and special cases in training data this causes
e=residual vector  ∂e1 ∂e1 
 ∂w ⋯ ∂w  the error on the test data to start to grow
=interpolation parameter
∂e  1 M

J= = ⋮ ⋱ ⋮ 
∂w  ∂eN
[ ]−1
w k +1 = w k − JT J + I JT e 
 ∂w1

∂ e N
∂w M 
 training data


RMS

Good model size!


• LM interpolates between the Gauss-
Newton method and the method of
gradient descent
# of hidden neurons

Radial basis function networks Radial basis function networks


• Radial basis function networks represent a
special category of the feedforward networks
• Development of RBFN was inspired by the
biological receptive fields of cerebral cortex
• Consists of an input layer, a single hidden layer
and an output layer

Radial basis function network with a Gaussian activation function

7
Radial basis function networks Radial basis function networks
• The activation functions it the hidden layer • The standard learning algorithm RBF network is
are symmetrical and they get their maximum the hybrid technique
values at the center of the function • The hybrid technique consists of two stages
• Most commonly used functions are • An unsupervised algorithm (k-means, maximum
Gaussians likelihood, self-organizing map etc) to specify the
parameter of the radial basis functions
 − x − vi 2
 • A supervised algorithm to update the weights
gi (x ) = exp 
 2σ i2  between the hidden layer and the output layer. This
 
step is normally performed with standard least
squares

Kohonen’s self-organizing
Kohonen network
network
• Also known as Kohonen self-organizing
map or self-organizing map
• Mapping of the input vectors to the output
layer results in reduction of the
dimensionality of the input space
• Output units are typically represented as a
two or three dimensional grid

Kohonen’s self-organizing Kohonen’s self-organizing


network network
• The learning algorithm updates networks
weights without a performance feedback
• Learning is based on the competitive learning
technique also known as the ’winner take all’
strategy
• Algorithm distributes the output across the input
space grouping similar input vectors

a) Initial state b) Trained network

8
Kohonen’s self-organizing
Kohonen’s self-organizing network
network
• Another example • Steps of the learning algorithm
• Initializing weights
• Choose an input from the input data set
• Select the winning output unit (BMU = best
matching unit)
• Update the weights of the winning unit
• Update the weights of the neighboring units
I = x - w c = min x - w ij
ij

[ ]
w (k ) + α(k ) x − w ij (k ) , if (i, j) ∈ Nc (k)
w ij (k + 1) =  ij
 w ij (k ) if (i, j) ∉ Nc (k)

Monitoring with neural networks Monitoring with MLP - regression


• MLP • Make a regression model using healthy
– Regression (analytical redundancy) data from the process
– Pattern recognition (classification) • The model output(s) represent some
• SOM interesting variable(s) for which faults
– Pattern recognition (classification) needs to be detected
– ”Regression” • Track the model residual: r = y - yˆ
– Distance to BMU • This approach is often called analytical
redundancy ŷ1
y1 y2
r1

r2

ŷ 2

Monitoring with MLP - Monitoring with SOM – Pattern


Classification recognition
• The neural network is trained to recognize • Different faults are associated with
different fault classes different nodes of the map
Fault 1
The training data has to
Fault 2 include faulty data with
1.5
knowledge of the fault
fault 1 Fault 2

1
x3

0.5

fault 2
The training data has to
0
1.5
include faulty data with Fault 1

1
1.5 knowledge of the fault
1
0.5
0.5
x2 0 0
x1

9
Monitoring with SOM – Distance to
Monitoring with SOM – ”regression”
BMU
• The model is based on fault free data • Find the BMU for the given input
• When new data is fed to the model the • Calculate distance between BMU and
BMU is found based on a subset of the input
input vector (the monitored variable y is • If the distance exceeds the threshold a
left out) fault has been detected
• Check the value of y for the BMU
• Calculate the residual between measured
y and y from BMU

Recurrent neural networks Recurrent neural networks


• With standard feed-forward networks only • In recurrent neural networks, some
static mappings can be made. outputs are directed back as inputs to
• The models are without memory - they
don’t have a state.
nodes in the same or preceding layer
• In some cases this is not enough. E.g. if • These models have a memory – a
long term prediction capabilities and state
effective noise handling are important
• They are much more challenging to
• By introducing feedback to our networks
we enable true dynamic modeling train

Introductory linear example Introductory linear example


• How should we determine a linear model of this
• Let’s consider the following linear system system?
• We have noisy measurements y available • The easiest way is to make an ARX (Auto
from this system and we want to determine Regressive eXogenous input) model using standard
the system model least squares
• Another way is to make an OE (output error) model
which is analogous to a recurrent (linear) network
NN representation of ARX model NN representation of OE model

input u

10
Introductory linear example Introductory linear example
• If we make an ARX model: • We can now use this model to make one step
N = length(y); predictions and compare them with the true
D = [y(1:N-2) y(2:N-1) u(2:N-1)];
yout = y(3:N); state. Looks rather good (SSQ = 0.08)
0.15
par = pinv(D)*yout

par = 0.1

-0.05575832188004 0.05

0.00169077887767
0.26950929371295
0

• The parameters differ considerably from the ones of the -0.05

original model because of the noise introduced


-0.1
0 50 100 150 200 250 300 350 400

Introductory linear example Introductory linear example


• Now let’s make an OE model with Matlab • Testing the model reveals even better
results (SSQ = 0.005)
data = iddata(y,u,1)
M = oe(data,[1 2 1]) 0.15

Discrete-time IDPOLY model: y(t) = [B(q)/F(q)]u(t) + e(t) 0.1

B(q) = 0.2113 q^-1 0.05

F(q) = 1 - 0.8357 q^-1 + 0.7361 q^-2 0

-0.05

Estimated using OE from data set data


Loss function 0.00259761 and FPE 0.00263717 -0.1

Sampling interval: 1
-0.15

-0.2
0 50 100 150 200 250 300 350 400

Now the parameters are almost the correct ones

Introductory linear example Introductory linear example


• We can further try to use the ARX model for • The ARX model doesn’t explain the true
long term predictions by feeding back dynamics of the system. It is only in the limiting
outputs from the model. Now it turns out that case that the noise approaches zero that the
the results aren’t so good parameters will be correct
0.15

0.8
0.1
0.6

0.4
parameter values

0.05
0.2

0
0

-0.2

-0.4
-0.05

-0.6

-0.8
-0.1
0 50 100 150 200 250 300 350 400
-1
0 0.005 0.01 0.015 0.02 0.025
noise variance

11
Architectures Architectures – Symmetrical
• Defining feature is the use of feedback signals • Hopfield network
• Self-loops or backward connections – Feedback links with symmetrical weights from each
output signal to all other output nodes
• Two main categories
– No self-loops
– Symmetrical weight connections
– Asymmetrical weight connections

The Elman and Jordan


Architectures - Asymmetrical
Networks
• Partially and fully recurrent networks
• A partially recurrent network is basically a
multilayer FF network with feedback links
from the hidden or output layers
• Two well-known models
– Elman network
– Jordan network
• Fully recurrent networks
– Any node can be connected to any other

Fully recurrent Backpropagation through time


• Any recurrent neural network can be
”unfolded in time” into an equivalent feed-
forward representation for which BP can be
applied

12
Neural networks for system
NARX (series parallel)
identification
• Each of the main linear identification
methods (ARX, OE, FIR, ARMAX) have
analogous neural variants: NARX, NOE,
NFIR, NARMAX

NOE (parallel) Example: A pendulum


• Lets model a pendulum that is described with
the below equations with an Elman Network
xɺ 1 = x 2
xɺ 2 = −9.8sin(x1 ) − 0.5x 2 + 0.5u
y = x1
We have 2 hidden nodes in the network
and two context nodes
We evaluate the network with 2 different
Test sets

RMStrain=2.3*10-4
RMStest1=2.1*10-4
RMStest2=3.2*10-4

Example: The pendulum Example: The pendulum


Training set Test set 1
0.4 0.25
0.3 0.20
0.2 0.15
x des x des
0.1 0.10
0.0 0.05
100 200 300 400 500
-0.1 0.00
50 100 150 200
-0.2 -0.05
x net x net
-0.3 -0.10
-0.4 -0.15
-0.5 -0.20
Pattern index Pattern index

13
Example: The pendulum Example: The pendulum
Test set 2
• It is necessary to make some additional
0.25 comments regarding the pendulum
0.20
0.15 • Feed-forward approaches (NARX/ARX)
x des
0.10 can also be used in this case mainly
0.05 because the measurements are noise free
0.00
50 100 150 200
-0.05 • The pendulum model is almost linear as
x net
-0.10 long as the angles are small: sin(x)≈x
-0.15
-0.20
• It turns out that a simple linear ARX model
Pattern index in this case gives a really good fit

Case study: NARX model of the


Example: The pendulum
Mackey-Glass system
• The Mackey-Glass system is a classical
Linear ARX model test data test problem for different prediction
0.25 methods
0.20
0.15
• It is described by a nonlinear differential
x des
0.10 equation including a delay
0.05
0.00

ax (t − τ )
50 100 150 200
-0.05
x net
dx
= − bx
-0.10

dt 1+ x c (t − τ )
-0.15
-0.20
Pattern index

Case study: NARX model of the Case study: NARX model of the
Mackey-Glass system Mackey-Glass system
• We integrate the system with ODE4 1.6

(Runge-Kutta) in Matlab using a 1.4

constant step length of 1s 1.2

• We want to make a NARX model that


1

predicts the the output 6s ahead based


x(t)

on the state now and tree previous 0.8

states 0.6

xˆ (t + 6 ) = f (x(t ), x (t − 6 ), x(t − 12 ), x(t − 18 ))


0.4

0.2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
t

14
Case study: NARX model of the Case study: NARX model of the
Mackey-Glass system Mackey-Glass system
model # of params RMS train RMS test
• We use 500 data Linear model 5 9.62E-02 9.66E-02

points for training and 1 hidden node


2 hidden nodes
7
13
6.76E-02
1.16E-02
6.82E-02
1.14E-02
1.5

1300 for testing 3 hidden nodes


4 hidden nodes
19
25
6.28E-03
5.28E-03
6.15E-03
5.33E-03

• Lets see how well 5 hidden nodes


6 hidden nodes
31
37
4.46E-03
4.16E-03
4.44E-03
4.14E-03
1.0 x(t+6) des

different sized models 7 hidden nodes 43 3.85E-03 3.91E-03


8 hidden nodes 49 3.19E-03 3.17E-03
perform 9 hidden nodes 55 2.55E-03 2.81E-03 0.5
10 hidden nodes 61 2.47E-03 2.49E-03
0.12
11 hidden nodes 67 2.15E-03 2.30E-03
x(t+6) net
0.10
12 hidden nodes 73 2.03E-03 2.11E-03
0.08
20 hidden nodes 121 1.52E-03 1.94E-03 0.0
0 100 200 300 400 500
RMS error

RMS train
0.06
RMS test

0.04 Pattern index


0.02

0.00
A model with three hidden neurons is
0 20 40 60 80
# of parameters
100 120 140
quite good result with 3 hidden nodes

Monitoring with recurrent neural


networks
• If there is significant dynamics as well as
noise in the monitoring application a
standard feed-forward neural network
might not give satisfactory results
• Then a recurrent approach might be useful

15

You might also like