Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 92

NEURAL

NETWORKS
Data Mining PhD Seminar 31 October 2011
Gabriela Sava

What is a Neural Network?

A complex structure that has the ability to


store experiential knowledge and use it in
decision making
Neural Networks are biologically inspired
The main goal of using the Neural Networks
is to train them to learn a classification
task
The main characteristics of the Neural
Networks are:

Learning can be trained to modify their


behavior in response to their environment
Generalization ability to see through noise
and distortion to the pattern needed to be
recognize
Abstraction some networks are capable of
abstracting the essence of a of inputs
Applicability especially to patternrecognition tasks

Mathematical definition
neural network is a directed graph
A
with vertices and arcs with the following
restrictions:
1. V is partitioned into a set of input nodes
, hidden nodes and output nodes
2. The vertices are partitioned into layers
with all input nodes in layer 1 and output
nodes in layer k. The hidden nodes are in
layers between 2 and k-1 hidden layers


3. Any arc must have node i in layer h-1
and node j in layer h
4. Any arc is labeled with a numeric value
called weight
5. Node I is labeled with a function

What is a Neural Network


Model?
A neural network model is a
computational model consisting in:
1. Neural network graph that defines the
data structure of the neural network
2. Learning algorithm that indicates how
learning takes place
3. Recall techniques that determine how
information is obtained from the network

Characteristics

Multilayer neural networks with at least one


hidden layer are universal approximators can
be used to approximate any target function
Can handle redundant features because the
weights are automatically learned during the
training step the weights for redundant
features are very small
Sensitive to the noise presence in the training
data to handle the noise we can use a
validation set to determine the generalization
error of the model or to decrease the weight by
a factor at each iteration

Terminology

Input nodes nodes that accept input


patterns
Bias extra input for a node with value 1
which has a negative weight
Hidden nodes nodes that accept data from
input nodes, perform computation on them
and then send the results to outputs nodes
Output nodes nodes that accept data from
hidden ones and give the output to a user or
a user interface or compare the output with
the target patterns

Training set used to train and teach


the network to recognize patterns
Validation set used to tune the
parameters of a classifier by choosing
the number of hidden nodes or hidden
layers in the network
Test set used to test the performance
of the neural network (only for a fully
specified classifier)

Activation function function which is


applied to the set of inputs coming in to a
node
There have been many proposal for
activation functions during the years, but
the most used ones currently are:
Threshold or step the output value is
0 or 1, depending on the sum of the
products of the input values and their
associated weights
The binary output values may be also -1 or


Functions
form:

where: - sum of the adjusted inputs by


weight
T threshold
Networks that use threshold activation
function:
Hopfield Networks
Bidirectional Associative Memory Models
(BAM)


Sigmoid
S-shape curve with output
values between -1 and 1 (or 0 and 1)
which is monotonically increasing
Although there are several types of
sigmoid function, a common one is
logistic function:

where: c positive constant value that


changes the slope of the function

sigmoid function is really usefully


The
because is having some nice properties:
1. It is a smooth threshold compared with the
simple threshold
2. Is having a simple derivative which is
critical in finding the proper weights to use

Networks that use sigmoid activation


function:
Backpropagation Neural Network Model
(BPNN)

Gaussian
is a bell-shaped curve with

output values in the range [0,1]


A typical function is:

where: S the mean


V variance of the function
Networks that use Gaussian activation
function:
Kohonen Networks
Probabilistic Neural Networks (PNN)

Training algorithm there are various training


techniques used to train the neural networks:
Hebbian Learning Algorithm unsupervised
learning training which can be described as a
local phenomenon, involving only 2 nodes and
a connection
Instar Training performs pattern recognition;
the network is training to respond to a specific
input vector
Self-organization algorithm used to construct
Kohonen maps

Design issues

Number of source nodes assign an input


node to each numerical or binary input
variable (if the variable is categorical we can
use a coding system)
Number of hidden layers depending on the
network complexity one or two hidden layers
are enough; can be decided manually at the
beginning or automatically by the training set
Number of hidden nodes depends on the
structure of the network, activation function
type, training algorithm, problem being solved,
the amount of noise

If too few hidden nodes are used, the target


function may not be learned underfitting
If too many nodes are used may occur
overfitting
Rules of thumb are often given based on the
training set size
Number of output nodes usually the
number of the output nodes is the same with
the number of classes, but this is not always
the case (e.g two classes can have only one
output node)

Interconnections - to select the right


model complexity we can start from a fully
connected network and remove some
nodes evaluating the remaining structure
Weights initial ones are assumed to be
small positive values assigned randomly
Activation function it will be used the
one which best describes the learning
algorithm that we want to implement

Learning algorithm the most common


approach is an adaptive form of
backpropagation
Training data with too much training data
the network may suffer from overfitting while
with too little may not be able to classify
accurately enough
Training examples with missing values
should be removed or replaced
Needs to cover the full range of values for all
features that the network might encounter
(including the output)

When to use Neural


Networks?

Instances are represented by many attributevalue pairs target function is described as a


vector
Target function output may be discretevalued, real-valued or a vector of several real
or discrete attributes
Training set may contain errors learning
methods are robust to noise
Long training time is necessary
Fast evaluation of the learned target function
recall process in neural networks is much
faster than the learning one

The ability of humans to understand the


learned target function is not important
learned neural networks are easily
communicated to humans

Architecture
Feed-forward networks connections
are only to layers later in the structure;
the signal propagates only in one
direction, the nodes from next layer
use the values
produces by
previous layer as
input values

Feedback networks exists connections


back to earlier layers which allows the
signals to come back to previous nodes;
can learn simultaneously new patterns
and recall old ones

Classification
Unsupervised Learning Models
process does not require an example of
desired output (in most models the
target output is the same with the input)
Objective to categorize or discover
features or patterns in the training data
Used in a wide variety of fields under
different names the most known is
cluster analysis
The most common variety is Hebbian
For more about the Hebb Laws: T Kohonen, SelfOrganizing
Maps, 3edition,
Springer, 2001, pag 91-96
learning
dimensionality
reduction

rd

Feedback Nets

Networks which allow the output to be


fed back to the input
Models from this category
Binary Adaptive Resonance Theory
(ART1)
Discrete and Continuous Hopfield (DH
and CH)
Discrete Bidirectional Associative
Memory (BAM)

Feed-forward Nets

Networks which don not allow any


feedback from the output to the input
the connections are unidirectional
Models from this category
Learning Matrix (LM)
Linear Associative Memory (LAM)
Fuzzy Associative Memory (FAM)
Counterpropagation Network (CPN)

Supervised learning models


requires examples of desired output to
be specified, from which rules are
generated
Objective obtain the desired output by
iterative process of adjusting the weights
to develop an input/output behavior that
maximizes the probability of receiving a
reward and minimizes the one of
receiving a penalty

Feedback Nets

The system can reduce the learning time


by adjusting the weights until it learns
the input patterns
Models from this category

Brain-State-in-a-Box (BSB)
Fuzzy Cognitive Map (FCM)
Boltzmann Machine (BM)
Backpropagation Through Time (BPTT)
Real-Time Recurrent Learning (RTRL)

Feed-forward Nets

Most applied neural networks models if


the user does not obtain the desired
output, the process will be iterated using
the connected nodes
Models from this category

Perceptron
Backpropagation (BP)
Adaptive Logic Network (ALN)
Learning Vector Quantization (LVQ)
Probabilistic Neural Networks (PNN)

Backpropagation Neural
Network

Backpropagation training is an iterative


gradient algorithm designed to minimize
the mean-square error between the actual
output and the desired one
The process is a step-by-step one which
means the learning time is usually long
Strength gives good performance and
easily handles complex patterns recognition
Weakness learning speed is slow and may
become trapped at local minima

Network Architecture
Input layer input variables which use
the linear transformation function
Hidden layer represents the interaction
among the input nodes; uses the
sigmoid transformation
function
Output layer
represents the output
variables

Algorithm

Learning
Process
1. Set the parameter of the network
2. Set the uniform random values for:
- the weights matrix between the input
layer and the hidden layer
- the weights matrix between the hidden
layer and the output layer
- the bias from in the hidden layer
- the bias from the output layer


3. Obtain an input training vector X and
the desired output vector T
4. Calculate the output vector Y as follows:

where: - net activation for each hidden


node given the inputs;
4.a Calculate the output vector H in the
hidden layer


4.b Calculate the output of vector Y:

where: - net activation for each output


node given the hidden nodes signals;


5. Calculate the sensitivity value (output
error)

where: - sensitivity for the unit j; training


error
5.a Calculate the value in the output layer
5.b Calculate the value in the hidden
layer:

Adjust the weight


6.
6.a at the output layer:

where: learning rate and indicates the


relative size of the change in weights
6.b at the hidden layer:

Update W and using Hebbian Learning


7.
algorithm
7.a at the output layer

7.B at the hidden layer

8. Repeat steps 3 to 7 until the network


converges


Recall
Process
1. Set the network parameter
2. Read in the weights and , and the vectors
and
3. Read in the test vector X
4. Calculate the output vector Y as follows:

4.a Calculate the output vector H in the hidden


layer


4.b Calculate the output of vector Y:

where: - net activation for each output


node given the hidden nodes signals;

Limitations
Local Minima - occurs because the algorithm
always changes the weights in such a way as to
cause the error to fall, but the error might briefly
have to rise as part of a more general fall; if this
is the case, the algorithm will gets stuck
(because it cant go uphill) and the error will not
decrease further
Solution:
Reset the weights and start the training again
with other random values

XOR problem

The simplest problem which can be


solved using Backpropagation is XOR
which can be described as follows:
given two inputs representing
conditions which are either both true or
both false then the result should be
false, whereas given two inputs for
which only one of the conditions
represented is true then the output
should be true
For example if I would like a new car

Input
A

Input
B

Output
F

The situations described are synthetized in


the near table

The logical expression that describes the


problem is
(A or B) not (A and B)

XOR representation

Graphically the XOR problem can be


represented using a neural network as
follows:

Solving XOR

When A and B are both zero on input,


then their sum is still zero on reaching
the hidden layer, thus neither node is
activated resulting a zero output
When A is 0 and B is 1 then their sum is
greater then but less than 1 thus
the upper node is activated resulting in a
1 and the lower node is not activated
resulting in a 0
These values times their respective
weights of 1 and -1 result in a 1 at the

When A is 1 and B is 1, then their sum is


greater than both a and 1 resulting
in both nodes being activated
However because the weighting on
output from the lower node is inverted
thus the sum of the values at the
output node is 0
(1 x 1) + (1 x -1) or 1 -1 = 0.

Examples

Pattern classification
Adaptive control
Noise filtering
Data compression
Expert systems

Probabilistic Neural Network (PNN)

Probabilistic neural networks are forward


feed networks built with three layers
uses nonlinear decision boundaries that
are derived from Bayes Decision
strategies for classifying input vectors
They train quickly since the training is
done in one step of each training vector,
rather than several
Estimate the probability density function
for each class based on the training
samples

Strengths:
High computation capability save time and
effort when working with huge databases and
improve the accuracy of the computation results
Learning PNN is a dynamic system which can
learn quickly from the data source, also the
decision boundaries can be updated in real time
using new data when they become available
Fault tolerance a damage to the connections
will only decrease slightly the functionality;
incomplete input information or with noise will
not stop the network processes

Weaknesses:
Large memory requirements because
the information is stored in matrix form
and as the number of training is
increasing, the matrix will become very
large
Slower recall process due to
processing of the large matrices

Network Architecture
Input layers the one which need to be
classified
Pattern layer has one neuron for each
training vector sample
Summation layer has one neuron for each
population class
Output layer
threshold discriminator
which decided the
summation with maximum
output

Basic concepts

Assume
k possible classifications: ..
Classification rule is determined by the
following vector:
The probability of classifying the input
vector into each class is determined by
function which has a Gaussian
distribution:


where:
X input vectors
- total number of training patterns for
category k
j pattern number
m space dimension
smoothing parameter
- j-th training pattern for category k

In PNN we are interested only about the


relative probability between each
category, so the formula used to code
the learning program is the following:

Algorithm

Learning
process
1. Use random numbers to initialize the
original network weights and set the
smoothing parameter
2. Input the vector X of the training
sample and the target vector T
3. Set the matrix W
3.a matrix W_xh is between the input layer
and the hidden layer

where: - the value of one of the input


3.b matrix W_hy is between the hidden
layer and the output layer

where: - the value of one of the output


vectors in one of the training samples

Recall process
1. Set the smoothing parameter by
educated guess based on knowledge of
the data or using a heuristic technique
(e.g Jackknifing)
2. Read de matrices W_xh and W_hy
3. Input the vector X of one of the testing
examples
4. Compute the deductive output vector Y

Compute the output vector H of hidden


4.a
layer:

4.b Compute the deductive output Y:

5. Modify the network weights based on


the difference between deductive output
vectors and the target output vectors of
the training sample
6. Repeat steps 2 to 5 until convergence is
achieved

Training

The training set must be thoroughly


representative of the actual population
for effective classification
Adding and removing training samples
simply involves adding or removing
neurons in the pattern layer
As the training set increases in size, the
PNN asymptotically converges to the
Bayes optimal classifier
The training process of a PNN is
essentially the act of determining the

Examples

Applications in databases and signal


processing
Modeling the knowledge in
computational biology and bioinformatics
(gene regulatory networks, protein
structure, gene expression analysis)
medicine (probabilistic relationships
between diseases and symptoms - given
symptoms, the network can be used to
compute the probabilities of the
presence of various diseases)

document classification
information retrieval
image processing
decision support systems
engineering
gaming
law

Kohonen Networks
Self-Organising Maps (SOM)
It is a self-organizing network the
correct output can not be defined a priori
and therefore a numerical measure of
the magnitude of the mapping error can
not be used
Main characteristic - transform the input
space into a 1-D or 2-D discrete map (for
visualization and dimension reduction) in
a topologically-preserving way
(neighboring neurons respond to
For more details see: T Kohonen, Self-Organizing Maps, 3
similaredition,
input
patterns)
Springer,
2001, ch 3, 4 and 5

rd

Network Architecture

Self-organizing maps are an example of


competitive learning the process finds this
topology directly from data
Kohonen model has a strong neurobiological
background the mapping is similar with
the one of the visual field on the cortex
Kohonen SOMs result from the synergy of
three basic processes: Competition,
Cooperation and Adaptation

Competition
Each neuron in a SOM is assigned a
weight vector with the same
dimensionality as the input space
Any given input pattern is compared to
the weight vector of each neuron and
the closest neuron is declared the winner
The Euclidean norm is
commonly used to measure
distance

Cooperation

The activation of the winning neuron is


spread to neurons in its immediate
neighborhood
This allows topologically close neurons to
become sensitive to similar patterns - the
winners neighborhood is determined
Distance in the area is a function of the
number of lateral connections to the winner
(as in city-block distance)

The size of the neighborhood is initially


large, but shrinks over time
An initially large neighborhood promotes
a topology-preserving mapping
Smaller neighborhoods
allows neurons to specialize
in the latter stages of
training

Mathematical
implementation

The neighborhood function must be:


Symmetric around the origin the

function is shift-invariant
The amplitude must decrease
monotonically to zero
The width must be adjustable.

A typically choice for the neighborhood


function is the Gaussian:

where: - topological distance (the map


one)

Adaptation
During training, the winner neuron and
its topological neighbors are adapted to
make their weight vectors more similar
to the input pattern that caused the
activation
Neurons that are closer to the
winner will adapt more heavily
than neurons that are
further away

Mathematical
implementation

The magnitude of the adaptation is


controlled with a learning rate, which
decays over time to ensure convergence
of the SOM
Learning rate decay rule:
Neighborhood size decay rule:

Algorithm
Initialize weights to some small, random
1.
values
2.a Select the next input pattern form the
database
Find the unit that best matches the
input pattern

Update the weights of the winner and


all of its neighbors
+

2b. Decrease the learning rate


2.c Decrease neighborhood size
3. Repeat step 2 until convergence

Examples

Visualization of higher dimensional data


or process

Density estimation

Inverse kinematics

Discrete Hopfield Networks

Basic idea of Hopfield was to add feedback


connections to the network and show that with
these connections the networks are capable to
have memories - content-addressable
memorysystems withbinarythreshold units
Main characteristic - Hopfield network can
memorize and reconstruct a pattern from a
corrupted original (auto-associative
memory)
The networks operates similarly with the feedforward ones

Network Architecture

Single
layered recurrent networks
All the neurons receive feedback from
everybody
The states of neurons are binary -1 and 1
The connections are symmetric

No self connections
The information is stored in
fixpoint attractors

Algorithm
Train the network using a Standard
1.
pattern
2. Update weight vectors of network
according to the next thresholding rule
(activation function):

Also we need to measure the state of


the network via Liapunov Energy
function which is useful in
understanding the dynamics

The updating of the network can be made


sequential or and random
3. Run the trained network with corrupted
pattern
4. Network returns the decrypted pattern
The network always will converge to a
fixpoint attractor the pattern, only if the
connections are symmetric

Examples

Pattern reconstruction

Limitations

Training patterns can represent


approximately 14% of the number of nodes
in the network
If more patterns are used then
the stored patterns become unstable
spurious stable states appear - states which
do not correspond with stored patterns
Sometimes misinterpret the corrupted
pattern.

Bidirectional Associative
Memory (BAM)

BAM Network is a generalization of the Hopfield


Networks
Main characteristic implements a
heteroassociative memory which means that
given a pattern the network can return another
pattern which is potentially of a different size
Strength can quickly recall the original
uncorrupted pattern
Weakness poor internal capacity to hold
information required to perform reasoning

Network Architecture
Network has only two layers input and
output
Training vector takes values (-1,1) and it
is divided in 2 parts: Front part input
layer and Rear part Output layer
BAM rule: network can
remember the relationships
from the Front part to the
Rear part

Algorithm

Learning
Process
1. Set the network parameter
2. Calculate the weight matrix

where: - input value for the front part


- input value for the rear part
p learning data at the p-th
element


Recall
Process
1. Read the weight matrix W
2. Input a test vector X
3. Calculate the output vector Y:

where:

Calculate the vector X at the input layer


4.
as follows:

where:
5. Repeat steps 3 and 4 until the network
converges to the learning rule output
nodes are associated with the input ones

Examples

Applications in databases and signal


processing

Connection between names and phone


numbers stored as vectors

Character recognition

Competitive Learning

Unsupervised learning model where


the outputs are in competition for
the inputs
During training, the output unit that
provides the highest activation to a given
input pattern is declared the winner and is
moved closer to the input pattern, whereas
the rest of the neurons are left unchanged
The strategy is also called winner-take-all
since only the winning neuron is updated

Network architecture

Output units may have lateral inhibitory


connections so that a winner neuron can
inhibit others by an amount proportional
to its activation level

Algorithm
1. Normalize all input patterns to get
values between (0,1)
2. Randomly select a pattern
2a. Find the winner neuron maximum
value given by the activation function

2.b. Update the winner neuron


+

2c. Normalize the winner neuron


3. Go to step 2 until no changes occur

Examples

During an International competition, the


competitive learning exists when only
one student goal is achieved winner of
the 1st place and all other students fail to
reach the goal

Consider bidding in the stock market.


The stock are the input and each broker
competes by bidding with a value. The
most suitable output is the highest
value!

Advantages

Neural networks are robust structures


because of the weights
Improves its performance by learning,
which can continue even after the
training set has been applied
Low error rate means a high degree of
accuracy once the appropriate training
has been performed
More robust in noisy environments

Disadvantages

Difficult to understand for non-technical


users
Generating rule for the neural networks
is not straightforward
Input attributes values must be numeric
May occur the network overfitting
The learning phase may fail to converge

You might also like