Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CIT 6261:

Advanced Artificial Intelligence

Learning in Neural
Networks

Lecturer: Dr. Md. Saiful Islam

Biological inspirations

„ Some numbers…
… The human brain contains about 10 billion nerve
cells (neurons)
… Each neuron is connected to the others through
10000 synapses

„ Properties of the brain


… It can learn, reorganize from experience
… It adapts to the environment
… It is robust and fault tolerant

2
Biological neuron

„ A neuron has
… A branching input (dendrites)
… A branching output (the axon)
„ The information circulates from the dendrites to the axon via
the cell body
„ Axon connects to dendrites via synapses
… Synapses vary in strength
… Synapses may be excitatory or inhibitory
3

What is an artificial neuron ?

„ Definition : Non linear, parameterized function with


restricted output range

⎛ n −1

w0 y = f ⎜ w0 + ∑ wi xi ⎟
⎝ i =1 ⎠

x1 x2 x3

4
Activation functions
20

18

16

14

12
Linear
y=x
10

0
0 2 4 6 8 10 12 14 16 18 20

1.5

1
Logistic
0.5

1
y=
0

1 + exp(− x)
-0.5

-1

-1.5

-2
-10 -8 -6 -4 -2 0 2 4 6 8 10

1.5

1
Hyperbolic tangent
exp( x) − exp(− x)
0.5

y=
0

exp( x) + exp(− x)
-0.5

-1

-1.5

-2
-10 -8 -6 -4 -2 0 2 4 6 8 10

Neural Networks

„ A mathematical model to solve engineering


problems
… Group of highly connected neurons to realize compositions
of non linear functions
„ Tasks
… Classification
… Discrimination
… Estimation
„ 2 types of networks
… Feed forward Neural Networks
… Recurrent Neural Networks

6
Feed Forward Neural Networks
„ Input units/ output units/ hidden
units
Output layer „ The information is propagated
from the inputs to the outputs
2nd hidden
„ Time has no role (NO cycle
layer
between outputs and inputs)
„ No links between units in the
same layer, no links backward
1st hidden
to a previous layer, no links that
layer
skip a layer.
„ Perceptron: no hidden units
„ Multilayer networks: one or
more hidden units
x1 x2 ….. xn

Recurrent Neural Networks

„ Can have arbitrary topologies


„ Delays are associated to a
specific weight
0 1
0 „ Training is more difficult
0 „ Performance may be problematic
1
… Stable Outputs may be more
0 difficult to evaluate
0 1
… Unexpected behavior (oscillation,
chaos, …)
x1 x2

8
Learning

„ The procedure that consists in estimating the


parameters of neurons so that the whole network can
perform a specific task

„ 2 types of learning
… The supervised learning
… The unsupervised learning

Supervised learning

„ The Learning process (supervised)


… Present the network a number of inputs and their
corresponding outputs
… See how closely the actual outputs match the desired ones
… Modify the parameters to better approximate the desired
outputs

„ The desired response of the neural network in


function of particular inputs is well known.
„ A “Professor” may provide examples and teach the
neural network how to fulfill a certain task.

10
Unsupervised learning

„ Idea : group typical input data in function of


resemblance criteria unknown a priori
„ Data clustering
„ No need of a professor
… The network finds itself the correlations between the data
… Examples of such networks :
„ Kohonen feature maps

11

Properties of Neural Networks

„ Supervised networks are universal approximators


(Non recurrent networks)
„ Theorem: Any limited function can be approximated
by a neural network with a finite number of hidden
neurons to an arbitrary precision
„ Type of Approximators
… Linear approximators: for a given precision, the number of
parameters grows exponentially with the number of
variables (polynomials)
… Non-linear approximators: the number of parameters
grows linearly with the number of variables

12
Other properties

„ Adaptivity
… Adapt weights to environment and retrained easily
„ Generalization ability
… May provide against lack of data
„ Fault tolerance
… Graceful degradation of performances if damaged => The
information is distributed within the entire net.

13

Classification (Discrimination)

„ Class objects in defined categories


„ Rough decision OR
„ Estimation of the probability for a certain object to belong
to a specific class
Example : Data mining
„ Applications : Economy, speech and patterns
recognition, sociology, etc.

14
Example

Examples of handwritten postal codes


drawn from a database available from the US Postal service

15

What do we need to use NN ?

„ Determination of pertinent inputs


„ Collection of data for the learning and testing phase of
the neural network
„ Finding the optimum number of hidden nodes
„ Estimate the parameters (Learning)
„ Evaluate the performances of the network
„ IF performances are not satisfactory then review all the
precedent points

16
Classical neural architectures

„ Perceptron
„ Multi-Layer Perceptron
„ Radial Basis Function (RBF)
„ Self Organizing Maps (SOM)
„ Many other architectures

17

Perceptron

„ Rosenblatt (1962) +
++
„ Linear separation + + y = +1
+ +
„ Inputs :Vector of real values + + + ++ +
+
+ + +
+ + ++ +
„ Outputs :1 or -1 + + + ++
+
+ +
+ +
y = sign(v) y = −1 + +
++

∑ v = c0 + c1 x1 + c2 x2 c0 + c1 x1 + c2 x2 = 0
c0 c2
c1

1 x1 x2
18
Learning (The perceptron rule)

„ Minimization of the cost function : J (c) = ∑k∈M − y kp v k

„ J(c) is always >= 0 (M is the ensemble of bad classified


examples)
„ y kp is the target value
„ Partial cost
… If x k is not well classified : J k (c) = − y kp v k
… If x k is well classified J k (c ) = 0

Partial cost gradient ∂J k (c)


„ = − y kp x k
∂c

19

„ The perceptron algorithm


if y kp v k > 0 (x k is well classified): c(k) = c(k-1)
if y kp v k < 0 (x k is not well classified): c(k) = c(k-1) + y kp x k

„ The perceptron algorithm converges if examples are


linearly separable

20
Multi-Layer Perceptron

Output layer
„ One or more hidden
2nd hidden layers
layer „ Sigmoid activations
functions
1st hidden
layer

Input data

21

Learning
„ Back-propagation algorithm
n Credit assignment
net j = w j 0 + ∑ w ji o i ∂E
i δj =−
o j = f j (net ) ∂net j
j

∂E ∂ E ∂ net j
∆ w ji = −α = −α = αδ j o i
∂ w ji ∂ net j ∂ w ji
∂E ∂o j ∂E
δj =− =− f ′( net j )
∂ o j ∂ net j ∂o j
1 ∂E
E = ( t j − o j )² => = − (t j − o j )
2 ∂o j If the jth node is an output unit
δ j = ( t j − o j ) f ' ( net j )

22
∂E κ ∂E ∂netκ
= ∑k = −∑k δ k wkj
κ

∂o j ∂netκ ∂o j
δ j = f ' j (net j )∑k δ k wkj
κ
Momentum term to smooth
The weight changes over time
∆w ji (t ) = αδ j (t )oi (t ) + γ∆w ji (t − 1)
w ji (t ) = w ji (t − 1) + ∆w ji (t )

23

Different non linearly separable problems

Types of Exclusive-OR Classes with Most General


Structure
Decision Regions Problem Meshed regions Region Shapes
Single-Layer Half Plane A B
Bounded By B
A
Hyperplane B A

Two-Layer Convex Open A B


Or B
A
Closed Regions B A

Three-Layer Abitrary
A B
(Complexity B
Limited by No. A
B A
of Nodes)

Neural Networks – An Introduction Dr. Andrew Hunter 24


Radial Basis Functions (RBFs)
„ Features
… One hidden layer
… The activation of a hidden unit is determined by the distance
between the input vector and a prototype vector

Outputs

Radial units

Inputs

25

Decision surfaces of MLP vs RBF

26
„ RBF hidden layer units have a receptive field which has
a centre
„ Generally, the hidden unit function is Gaussian
„ The output Layer is linear
„ Realized function

(
f ( x) = ∑ j =1W j Φ x − c j
K
)
2
⎛ x − cj ⎞
(
Φ x − cj ) = exp− ⎜⎜
⎜ σj

⎟⎟
⎝ ⎠
Where cj is center of a region,
σj is width of the receptive field

27

Learning

„ The training is performed by deciding on


… How many hidden nodes there should be
… The centers and the sharpness of the Gaussians

„ 2 steps
… In the 1st stage, the input data set is used to determine the
parameters of the redial basis functions (center and width)
… In the 2nd stage, functions are kept fixed while the second layer
weights are estimated ( Simple BP algorithm like for MLPs)

28
Learning formula
„ Linear weights (output layer)
∂Ε( n) N
∂E (n)
= ∑ e j (n)G (|| x j − t i (n) ||Ci ) wi (n + 1) = wi (n) − η1 , i = 1,2,..., M
∂wi ( n) j =1 ∂wi (n)

„ Positions of centers (hidden layer)


∂E (n) N
= 2 wi (n)∑ e j (n)G ' (|| x j − t i (n) ||Ci )Σ i−1[x j − t i (n)]
∂t i (n) j =1

∂E ( n)
t i (n + 1) = t i (n) − η2 , i = 1,2,..., M
∂t i (n)

„ Spreads of centers (hidden layer)


∂E (n) N

∂Σ i−1 (n)
= − wi ( n ) ∑
j =1
e j (n)G ' (|| x j − t i (n) ||Ci )Q ji (n) Q ji (n) = [x j − t i (n)][x j − t i (n)]T

∂E (n)
Σ −i 1 (n + 1) = Σ −i 1 (n) − η3
∂Σ i−1 (n)

29

MLPs versus RBFs


„ Classification
… MLPs separate classes via
hyperplanes
… RBFs separate classes via
hyperspheres X2 MLP
… MLPs computes faster
„ Learning
… MLPs use distributed learning
… RBFs use localized learning X1
… RBFs train faster
„ Structure
… MLPs have one or more hidden X2
layers RBF
… RBFs have only one hidden layer
… RBFs require more hidden
neurons
X1

30
Self organizing maps

„ The purpose of SOM is to map a multidimensional input


space onto a topology preserving map of neurons of
lower dimension
… Preserve topological structure so that neighboring neurons
respond to similar input patterns
… The topological structure is often a 2 or 3 dimensional space
„ Each neuron is assigned a weight vector with the same
dimensionality of the input space
„ Input patterns are compared to each weight vector and
the closest wins (Euclidean Distance)

31

„ The activation of the neuron is


spread in its direct neighborhood
=>neighbors become sensitive to the
same input patterns
„ The size of the neighborhood is
initially large but reduce over time =>
Specialization of the network

„ Algorithm
1.initialize w’s 2nd neighborhood
2. find nearest cell
i(x) = argmin || x(n) - wj(n) ||
j
3. update weights of i(x) and its
neighbors
wj(n+1) = wj(n) + η (n)[x(n) - wj(n)]
4. reduce neighbors and η
5. Go to 2

1st neighborhood
32
Adaptation

„ During training, the “winner”


neuron and its neighborhood
adapts to make their weight
vector more similar to the input
pattern that caused the
activation
„ The neurons are moved closer
to the input pattern
„ The magnitude of the
adaptation is controlled via a
learning parameter which
decays over time

33

Neural Networks (Applications)

„ Face recognition
„ Time series prediction
„ Process identification
„ Process control
„ Optical character recognition
„ Adaptative filtering
„ Etc…

34
Conclusion on Neural Networks
„ Neural networks are utilized as statistical tools
… Adjust non linear functions to fulfill a task
… Need of multiple and representative examples but fewer than in
other methods
„ Neural networks enable to model complex static
phenomena (FF) as well as dynamic ones (RNN)
„ NN are good classifiers BUT
… Good representations of data have to be formulated
… Training vectors must be statistically representative of the entire
input space
… Unsupervised techniques can help
„ The use of NN needs a good comprehension of the
problem
35

You might also like