Neural Networks - Presentation

NEURAL
NETWORKS
Data Mining PhD Seminar 31 October 2011
Gabriela Sava
What is a Neural Network?
A complex structure that has the ability to

store experiential knowledge and use it in
decision making
Neural Networks are biologically inspired
The main goal of using the Neural Networks
is to train them to learn a classification
task
The main characteristics of the Neural
Networks are:
Learning can be trained to modify their

behavior in response to their environment
Generalization ability to see through noise
and distortion to the pattern needed to be
recognize
Abstraction some networks are capable of
abstracting the essence of a of inputs
Applicability especially to patternrecognition tasks
Mathematical definition
neural network is a directed graph
A
with vertices and arcs with the following
restrictions:
1. V is partitioned into a set of input nodes
, hidden nodes and output nodes
2. The vertices are partitioned into layers
with all input nodes in layer 1 and output
nodes in layer k. The hidden nodes are in
layers between 2 and k-1 hidden layers

3. Any arc must have node i in layer h-1
and node j in layer h
4. Any arc is labeled with a numeric value
called weight
5. Node I is labeled with a function
What is a Neural Network

Model?
A neural network model is a
computational model consisting in:
1. Neural network graph that defines the
data structure of the neural network
2. Learning algorithm that indicates how
learning takes place
3. Recall techniques that determine how
information is obtained from the network
Characteristics
Multilayer neural networks with at least one

hidden layer are universal approximators can
be used to approximate any target function
Can handle redundant features because the
weights are automatically learned during the
training step the weights for redundant
features are very small
Sensitive to the noise presence in the training
data to handle the noise we can use a
validation set to determine the generalization
error of the model or to decrease the weight by
a factor at each iteration
Terminology
Input nodes nodes that accept input

patterns
Bias extra input for a node with value 1
which has a negative weight
Hidden nodes nodes that accept data from
input nodes, perform computation on them
and then send the results to outputs nodes
Output nodes nodes that accept data from
hidden ones and give the output to a user or
a user interface or compare the output with
the target patterns
Training set used to train and teach

the network to recognize patterns
Validation set used to tune the
parameters of a classifier by choosing
the number of hidden nodes or hidden
layers in the network
Test set used to test the performance
of the neural network (only for a fully
specified classifier)
Activation function function which is

applied to the set of inputs coming in to a
node
There have been many proposal for
activation functions during the years, but
the most used ones currently are:
Threshold or step the output value is
0 or 1, depending on the sum of the
products of the input values and their
associated weights
The binary output values may be also -1 or

Functions
form:
where: - sum of the adjusted inputs by

weight
T threshold
Networks that use threshold activation
function:
Hopfield Networks
Bidirectional Associative Memory Models
(BAM)

Sigmoid
S-shape curve with output
values between -1 and 1 (or 0 and 1)
which is monotonically increasing
Although there are several types of
sigmoid function, a common one is
logistic function:
where: c positive constant value that

changes the slope of the function
sigmoid function is really usefully

The
because is having some nice properties:
1. It is a smooth threshold compared with the
simple threshold
2. Is having a simple derivative which is
critical in finding the proper weights to use
Networks that use sigmoid activation

function:
Backpropagation Neural Network Model
(BPNN)
Gaussian
is a bell-shaped curve with
output values in the range [0,1]

A typical function is:
where: S the mean

V variance of the function
Networks that use Gaussian activation
function:
Kohonen Networks
Probabilistic Neural Networks (PNN)
Training algorithm there are various training

techniques used to train the neural networks:
Hebbian Learning Algorithm unsupervised
learning training which can be described as a
local phenomenon, involving only 2 nodes and
a connection
Instar Training performs pattern recognition;
the network is training to respond to a specific
input vector
Self-organization algorithm used to construct
Kohonen maps
Design issues
Number of source nodes assign an input

node to each numerical or binary input
variable (if the variable is categorical we can
use a coding system)
Number of hidden layers depending on the
network complexity one or two hidden layers
are enough; can be decided manually at the
beginning or automatically by the training set
Number of hidden nodes depends on the
structure of the network, activation function
type, training algorithm, problem being solved,
the amount of noise
If too few hidden nodes are used, the target

function may not be learned underfitting
If too many nodes are used may occur
overfitting
Rules of thumb are often given based on the
training set size
Number of output nodes usually the
number of the output nodes is the same with
the number of classes, but this is not always
the case (e.g two classes can have only one
output node)
Interconnections - to select the right

model complexity we can start from a fully
connected network and remove some
nodes evaluating the remaining structure
Weights initial ones are assumed to be
small positive values assigned randomly
Activation function it will be used the
one which best describes the learning
algorithm that we want to implement
Learning algorithm the most common

approach is an adaptive form of
backpropagation
Training data with too much training data
the network may suffer from overfitting while
with too little may not be able to classify
accurately enough
Training examples with missing values
should be removed or replaced
Needs to cover the full range of values for all
features that the network might encounter
(including the output)
When to use Neural

Networks?
Instances are represented by many attributevalue pairs target function is described as a

vector
Target function output may be discretevalued, real-valued or a vector of several real
or discrete attributes
Training set may contain errors learning
methods are robust to noise
Long training time is necessary
Fast evaluation of the learned target function
recall process in neural networks is much
faster than the learning one
The ability of humans to understand the

learned target function is not important
learned neural networks are easily
communicated to humans
Architecture
Feed-forward networks connections
are only to layers later in the structure;
the signal propagates only in one
direction, the nodes from next layer
use the values
produces by
previous layer as
input values
Feedback networks exists connections

back to earlier layers which allows the
signals to come back to previous nodes;
can learn simultaneously new patterns
and recall old ones
Classification
Unsupervised Learning Models
process does not require an example of
desired output (in most models the
target output is the same with the input)
Objective to categorize or discover
features or patterns in the training data
Used in a wide variety of fields under
different names the most known is
cluster analysis
The most common variety is Hebbian
For more about the Hebb Laws: T Kohonen, SelfOrganizing
Maps, 3edition,
Springer, 2001, pag 91-96
learning
dimensionality
reduction
rd
Feedback Nets
Networks which allow the output to be

fed back to the input
Models from this category
Binary Adaptive Resonance Theory
(ART1)
Discrete and Continuous Hopfield (DH
and CH)
Discrete Bidirectional Associative
Memory (BAM)
Feed-forward Nets
Networks which don not allow any

feedback from the output to the input
the connections are unidirectional
Learning Matrix (LM)
Linear Associative Memory (LAM)
Fuzzy Associative Memory (FAM)
Counterpropagation Network (CPN)
Supervised learning models

requires examples of desired output to
be specified, from which rules are
generated
Objective obtain the desired output by
iterative process of adjusting the weights
to develop an input/output behavior that
maximizes the probability of receiving a
reward and minimizes the one of
receiving a penalty
Feedback Nets
The system can reduce the learning time

by adjusting the weights until it learns
the input patterns
Brain-State-in-a-Box (BSB)
Fuzzy Cognitive Map (FCM)
Boltzmann Machine (BM)
Backpropagation Through Time (BPTT)
Real-Time Recurrent Learning (RTRL)
Feed-forward Nets
Most applied neural networks models if

the user does not obtain the desired
output, the process will be iterated using
the connected nodes
Perceptron
Backpropagation (BP)
Adaptive Logic Network (ALN)
Learning Vector Quantization (LVQ)
Probabilistic Neural Networks (PNN)
Backpropagation Neural
Network
Backpropagation training is an iterative

gradient algorithm designed to minimize
the mean-square error between the actual
output and the desired one
The process is a step-by-step one which
means the learning time is usually long
Strength gives good performance and
easily handles complex patterns recognition
Weakness learning speed is slow and may
become trapped at local minima
Network Architecture
Input layer input variables which use
the linear transformation function
Hidden layer represents the interaction
among the input nodes; uses the
sigmoid transformation
function
Output layer
represents the output
variables
Algorithm

Learning
Process
1. Set the parameter of the network
2. Set the uniform random values for:
- the weights matrix between the input
layer and the hidden layer
- the weights matrix between the hidden
layer and the output layer
- the bias from in the hidden layer
- the bias from the output layer

3. Obtain an input training vector X and
the desired output vector T
4. Calculate the output vector Y as follows:
where: - net activation for each hidden

node given the inputs;
4.a Calculate the output vector H in the
hidden layer

4.b Calculate the output of vector Y:
where: - net activation for each output

node given the hidden nodes signals;

5. Calculate the sensitivity value (output
error)
where: - sensitivity for the unit j; training

error
5.a Calculate the value in the output layer
5.b Calculate the value in the hidden
layer:
Adjust the weight

6.
6.a at the output layer:
where: learning rate and indicates the

relative size of the change in weights
6.b at the hidden layer:
Update W and using Hebbian Learning

7.
algorithm
7.a at the output layer
7.B at the hidden layer
8. Repeat steps 3 to 7 until the network

converges

Recall
Process
1. Set the network parameter
2. Read in the weights and , and the vectors
and
3. Read in the test vector X
4. Calculate the output vector Y as follows:
4.a Calculate the output vector H in the hidden

layer

4.b Calculate the output of vector Y:
where: - net activation for each output

node given the hidden nodes signals;
Limitations
Local Minima - occurs because the algorithm
always changes the weights in such a way as to
cause the error to fall, but the error might briefly
have to rise as part of a more general fall; if this
is the case, the algorithm will gets stuck
(because it cant go uphill) and the error will not
decrease further
Solution:
Reset the weights and start the training again
with other random values
XOR problem
The simplest problem which can be

solved using Backpropagation is XOR
which can be described as follows:
given two inputs representing
conditions which are either both true or
both false then the result should be
false, whereas given two inputs for
which only one of the conditions
represented is true then the output
should be true
For example if I would like a new car
Input
A
Input
B
Output
F
The situations described are synthetized in

the near table
The logical expression that describes the

problem is
(A or B) not (A and B)
XOR representation
Graphically the XOR problem can be

represented using a neural network as
follows:
Solving XOR
When A and B are both zero on input,

then their sum is still zero on reaching
the hidden layer, thus neither node is
activated resulting a zero output
When A is 0 and B is 1 then their sum is
greater then but less than 1 thus
the upper node is activated resulting in a
1 and the lower node is not activated
resulting in a 0
These values times their respective
weights of 1 and -1 result in a 1 at the
When A is 1 and B is 1, then their sum is

greater than both a and 1 resulting
in both nodes being activated
However because the weighting on
output from the lower node is inverted
thus the sum of the values at the
output node is 0
(1 x 1) + (1 x -1) or 1 -1 = 0.
Examples
Pattern classification
Adaptive control
Noise filtering
Data compression
Expert systems
Probabilistic Neural Network (PNN)
Probabilistic neural networks are forward

feed networks built with three layers
uses nonlinear decision boundaries that
are derived from Bayes Decision
strategies for classifying input vectors
They train quickly since the training is
done in one step of each training vector,
rather than several
Estimate the probability density function
for each class based on the training
samples
Strengths:
High computation capability save time and
effort when working with huge databases and
improve the accuracy of the computation results
Learning PNN is a dynamic system which can
learn quickly from the data source, also the
decision boundaries can be updated in real time
using new data when they become available
Fault tolerance a damage to the connections
will only decrease slightly the functionality;
incomplete input information or with noise will
not stop the network processes
Weaknesses:
Large memory requirements because
the information is stored in matrix form
and as the number of training is
increasing, the matrix will become very
large
Slower recall process due to
processing of the large matrices
Input layers the one which need to be
classified
Pattern layer has one neuron for each
training vector sample
Summation layer has one neuron for each
population class
Output layer
threshold discriminator
which decided the
summation with maximum
output
Basic concepts
Assume
k possible classifications: ..
Classification rule is determined by the
following vector:
The probability of classifying the input
vector into each class is determined by
function which has a Gaussian
distribution:

where:
X input vectors
- total number of training patterns for
category k
j pattern number
m space dimension
smoothing parameter
- j-th training pattern for category k
In PNN we are interested only about the

relative probability between each
category, so the formula used to code
the learning program is the following:
Algorithm

Learning
process
1. Use random numbers to initialize the
original network weights and set the
smoothing parameter
2. Input the vector X of the training
sample and the target vector T
3. Set the matrix W
3.a matrix W_xh is between the input layer
and the hidden layer
where: - the value of one of the input

3.b matrix W_hy is between the hidden
layer and the output layer
where: - the value of one of the output

vectors in one of the training samples
Recall process
1. Set the smoothing parameter by
educated guess based on knowledge of
the data or using a heuristic technique
(e.g Jackknifing)
2. Read de matrices W_xh and W_hy
3. Input the vector X of one of the testing
examples
4. Compute the deductive output vector Y
Compute the output vector H of hidden

4.a
layer:
4.b Compute the deductive output Y:
5. Modify the network weights based on

the difference between deductive output
vectors and the target output vectors of
the training sample
6. Repeat steps 2 to 5 until convergence is
achieved
Training
The training set must be thoroughly

representative of the actual population
for effective classification
Adding and removing training samples
simply involves adding or removing
neurons in the pattern layer
As the training set increases in size, the
PNN asymptotically converges to the
Bayes optimal classifier
The training process of a PNN is
essentially the act of determining the
Examples
Applications in databases and signal

processing
Modeling the knowledge in
computational biology and bioinformatics
(gene regulatory networks, protein
structure, gene expression analysis)
medicine (probabilistic relationships
between diseases and symptoms - given
symptoms, the network can be used to
compute the probabilities of the
presence of various diseases)
document classification
information retrieval
image processing
decision support systems
engineering
gaming
law
Kohonen Networks
Self-Organising Maps (SOM)
It is a self-organizing network the
correct output can not be defined a priori
and therefore a numerical measure of
the magnitude of the mapping error can
not be used
Main characteristic - transform the input
space into a 1-D or 2-D discrete map (for
visualization and dimension reduction) in
a topologically-preserving way
(neighboring neurons respond to
For more details see: T Kohonen, Self-Organizing Maps, 3
similaredition,
input
patterns)
Springer,
2001, ch 3, 4 and 5
rd
Self-organizing maps are an example of

competitive learning the process finds this
topology directly from data
Kohonen model has a strong neurobiological
background the mapping is similar with
the one of the visual field on the cortex
Kohonen SOMs result from the synergy of
three basic processes: Competition,
Cooperation and Adaptation
Competition
Each neuron in a SOM is assigned a
weight vector with the same
dimensionality as the input space
Any given input pattern is compared to
the weight vector of each neuron and
the closest neuron is declared the winner
The Euclidean norm is
commonly used to measure
distance
Cooperation
The activation of the winning neuron is

spread to neurons in its immediate
neighborhood
This allows topologically close neurons to
become sensitive to similar patterns - the
winners neighborhood is determined
Distance in the area is a function of the
number of lateral connections to the winner
(as in city-block distance)
The size of the neighborhood is initially

large, but shrinks over time
An initially large neighborhood promotes
a topology-preserving mapping
Smaller neighborhoods
allows neurons to specialize
in the latter stages of
training
Mathematical
implementation
The neighborhood function must be:

Symmetric around the origin the
function is shift-invariant
The amplitude must decrease
monotonically to zero
The width must be adjustable.
A typically choice for the neighborhood

function is the Gaussian:
where: - topological distance (the map

one)
Adaptation
During training, the winner neuron and
its topological neighbors are adapted to
make their weight vectors more similar
to the input pattern that caused the
activation
Neurons that are closer to the
winner will adapt more heavily
than neurons that are
further away
Mathematical
implementation
The magnitude of the adaptation is

controlled with a learning rate, which
decays over time to ensure convergence
of the SOM
Learning rate decay rule:
Neighborhood size decay rule:
Algorithm
Initialize weights to some small, random
1.
values
2.a Select the next input pattern form the
database
Find the unit that best matches the
input pattern
Update the weights of the winner and

all of its neighbors
+
2b. Decrease the learning rate

2.c Decrease neighborhood size
3. Repeat step 2 until convergence
Examples
Visualization of higher dimensional data

or process
Density estimation
Inverse kinematics
Discrete Hopfield Networks
Basic idea of Hopfield was to add feedback

connections to the network and show that with
these connections the networks are capable to
have memories - content-addressable
memorysystems withbinarythreshold units
Main characteristic - Hopfield network can
memorize and reconstruct a pattern from a
corrupted original (auto-associative
memory)
The networks operates similarly with the feedforward ones
Single
layered recurrent networks
All the neurons receive feedback from
everybody
The states of neurons are binary -1 and 1
The connections are symmetric
No self connections
The information is stored in
fixpoint attractors
Algorithm
Train the network using a Standard
1.
pattern
2. Update weight vectors of network
according to the next thresholding rule
(activation function):
Also we need to measure the state of

the network via Liapunov Energy
function which is useful in
understanding the dynamics
The updating of the network can be made

sequential or and random
3. Run the trained network with corrupted
pattern
4. Network returns the decrypted pattern
The network always will converge to a
fixpoint attractor the pattern, only if the
connections are symmetric
Examples
Pattern reconstruction
Limitations
Training patterns can represent

approximately 14% of the number of nodes
in the network
If more patterns are used then
the stored patterns become unstable
spurious stable states appear - states which
do not correspond with stored patterns
Sometimes misinterpret the corrupted
pattern.
Bidirectional Associative
Memory (BAM)
BAM Network is a generalization of the Hopfield

Networks
Main characteristic implements a
heteroassociative memory which means that
given a pattern the network can return another
pattern which is potentially of a different size
Strength can quickly recall the original
uncorrupted pattern
Weakness poor internal capacity to hold
information required to perform reasoning
Network has only two layers input and
output
Training vector takes values (-1,1) and it
is divided in 2 parts: Front part input
layer and Rear part Output layer
BAM rule: network can
remember the relationships
from the Front part to the
Rear part
Algorithm

Learning
Process
1. Set the network parameter
2. Calculate the weight matrix
where: - input value for the front part

- input value for the rear part
p learning data at the p-th
element

Recall
Process
1. Read the weight matrix W
2. Input a test vector X
3. Calculate the output vector Y:
where:
Calculate the vector X at the input layer

4.
as follows:
where:
5. Repeat steps 3 and 4 until the network
converges to the learning rule output
nodes are associated with the input ones
Examples
Applications in databases and signal

processing
Connection between names and phone

numbers stored as vectors
Character recognition
Competitive Learning
Unsupervised learning model where

the outputs are in competition for
the inputs
During training, the output unit that
provides the highest activation to a given
input pattern is declared the winner and is
moved closer to the input pattern, whereas
the rest of the neurons are left unchanged
The strategy is also called winner-take-all
since only the winning neuron is updated
Network architecture
Output units may have lateral inhibitory

connections so that a winner neuron can
inhibit others by an amount proportional
to its activation level
Algorithm
1. Normalize all input patterns to get
values between (0,1)
2. Randomly select a pattern
2a. Find the winner neuron maximum
value given by the activation function
2.b. Update the winner neuron

+
2c. Normalize the winner neuron

3. Go to step 2 until no changes occur
Examples
During an International competition, the

competitive learning exists when only
one student goal is achieved winner of
the 1st place and all other students fail to
reach the goal
Consider bidding in the stock market.

The stock are the input and each broker
competes by bidding with a value. The
most suitable output is the highest
value!
Advantages
Neural networks are robust structures

because of the weights
Improves its performance by learning,
which can continue even after the
training set has been applied
Low error rate means a high degree of
accuracy once the appropriate training
has been performed
More robust in noisy environments
Disadvantages
Difficult to understand for non-technical

users
Generating rule for the neural networks
is not straightforward
Input attributes values must be numeric
May occur the network overfitting
The learning phase may fail to converge

Neural Networks - Presentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks - Presentation

Uploaded by

Copyright:

Available Formats

NEURAL

What is a Neural Network?

A complex structure that has the ability to

Learning can be trained to modify their

What is a Neural Network

Multilayer neural networks with at least one

Input nodes nodes that accept input

Training set used to train and teach

Activation function function which is

where: - sum of the adjusted inputs by

where: c positive constant value that

sigmoid function is really usefully

Networks that use sigmoid activation

output values in the range [0,1]

where: S the mean

Training algorithm there are various training

Number of source nodes assign an input

If too few hidden nodes are used, the target

Interconnections - to select the right

Learning algorithm the most common

When to use Neural

Instances are represented by many attributevalue pairs target function is described as a

The ability of humans to understand the

Feedback networks exists connections

Networks which allow the output to be

Networks which don not allow any

Supervised learning models

The system can reduce the learning time

Most applied neural networks models if

Backpropagation training is an iterative

where: - net activation for each hidden

where: - net activation for each output

where: - sensitivity for the unit j; training

Adjust the weight

where: learning rate and indicates the

Update W and using Hebbian Learning

7.B at the hidden layer

8. Repeat steps 3 to 7 until the network

4.a Calculate the output vector H in the hidden

where: - net activation for each output

The simplest problem which can be

The situations described are synthetized in

The logical expression that describes the

Graphically the XOR problem can be

When A and B are both zero on input,

When A is 1 and B is 1, then their sum is

Probabilistic Neural Network (PNN)

Probabilistic neural networks are forward

In PNN we are interested only about the

where: - the value of one of the input

where: - the value of one of the output

Compute the output vector H of hidden

4.b Compute the deductive output Y:

5. Modify the network weights based on

The training set must be thoroughly

Applications in databases and signal

Self-organizing maps are an example of

The activation of the winning neuron is

The size of the neighborhood is initially

The neighborhood function must be:

A typically choice for the neighborhood

where: - topological distance (the map

The magnitude of the adaptation is

Update the weights of the winner and