Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Neural Networks

Kernel based Algorithms


Generalization Problem
Generalization is affected by
1. Size of Training Set
2. Architecture of neural net (eg MLP)
3. Physical Complexity of data (uncontrollable)
For given architecture;
size of training set N = O (w/e)
w=> no. of free parameters
e=> fraction of permissible classification error
Different Viewpoint
• Finding out a surface
– Very similar to fitting a curve as
Price

Price

Price
Size Size Size

Remember these are discrete points, job of net is to


find the function (blue line) so that net can
generalize well
Transforming from LD to HD
Example: the XOR problem
x2
• Input space: (0,1) (1,1)
Not Linearly Separable
x1
(0,0) (1,0)

• Output space: 0 1 y

• Construct a pattern classifier such that:


(0,0) and (1,1) are mapped to 0, class C1
(1,0) and (0,1) are mapped to 1, class C2
x1
1 w1
x2
y

wm1
m1

xm

• One hidden layer with activation functions


1...m1
• Output layer with linear activation function.
y  w11 (|| x  t1 ||)  ...  wm1N (|| x  tm1 ||)
|| x  t || distance of x  ( x1,..., xm ) from vector t
Consider 2 activation non linear
functions (Gaussian)
|| x t1 ||2
1 (|| x  t1 ||)  e
|| x t2 ||2
with t1  (1,1) and t2  (0,0)
 2 (|| x  t2 ||)  e
Example: the XOR problem
• In the feature (hidden layer) space:
|| x t1 ||2
1 (|| x  t1 ||)  e
|| x t2 ||2
with t1  (1,1) and t2  (0,0)
 2 (|| x  t2 ||)  e
φ2
(0,0)
1.0 Decision boundary

0.5 (1,1)

0.5 1.0 φ1
(0,1) and (1,0)

• When mapped into the feature space < 1 , 2 > (hidden layer), C1 and C2
become linearly separable. So a linear classifier with 1(x) and 2(x) as inputs
can be used to solve the XOR problem.
NN for the XOR problem
|| x t1 ||2
1 (|| x  t1 ||)  e with t1  (1,1) and t2  (0,0)
|| x t2 ||2
 2 (|| x  t2 ||)  e
|| x t3 ||2
3 (|| x  t3 ||)  e
x1
|| x t4 ||2
t1 -1
4 (|| x  t4 ||)  e
y
x2 t2 -1
+1
|| x t1 ||2 || x t2 ||2 with t3  (1,0) and t4  (0,1)
y  e e 1
If y  0 then class 1 otherwise class 0
But all these 4 phi’s may not be required to get linear separability

|| x t1 ||2


1 (|| x  t1 ||)  e
|| x t2 ||2
with t1  (1,1) and t2  (0,0)
 2 (|| x  t2 ||)  e

• This phi maps data from input space to


hidden/feature space i.e. acts as a BASIS with
radius t1 and t2 ;
• Hence the name Radial Basis Function / RBF
network
Radial Basis Functions
• Transforming from input space (low dimension
m0) to feature space (higher dimension m1)
• m0 < m1

• Transformation will be with the help of Non


Linear function
Cover’s Theorem

“A complex pattern-classification problem,


cast in high dimensional space nonlinearly, is
more likely to be linearly separable than in a
low dimensional space, provided that the
space is not densely populated.” (1965)
• Consider a family of surfaces which divides input
into 2 classes
• N patterns (x1, x2, ..xN) (e.g.. (0,0),(0,1),(1,0),1,1))

 (x)  [1 ( x), 2 ( x),......., m1 ( x)] T

• A Dichotomy (C1,C2) of C is said to be 


separable if there exists an m1 dimensional
vector w such that wT  ( x)  0, x  C1
wT  ( x)  0, x  C2
A Hyperplane wT  ( x)  0 ; Separating Surface
Corollary
“The expected maximum number of randomly
assigned patterns (vectors) that are linearly
separable in space of dimensionality m1 is
equal to 2m1.”
• Separating capacity is 2m1, having m1 degrees
of freedom.
• Loosely: this is related to VC-dimension
 11 12 ...... 1N   w1   d1 
 ..   w   d 
 21     2 
2

 . ..   .   . 
    
 N 1  N 2 ......  NN   wN   d N 

  { } N
ij i , j 1   ( x j  xi )

w  x
1
w x
3rd ed. Haykin
3rd ed Haykin
 11 12 ...... 1N   w1   d1 
 ..   w   d 
 21     2 
2

 . ..   .   . 
    
 N 1  N 2 ......  NN   wN   d N 

  { } N
ij i , j 1   ( x j  xi )

w  x
1
w x Inverse ???
Types of φ
• Multiquadrics:

 (r)  (r  c )
2 2 1
c0
2

• Inverse multiquadrics: r || x  t ||


1
 ( r )  2 2 12 c0
(r  c )
• Gaussian functions (most used):
 r  2
 ( r )  exp   2   0
 2 
Radial-Basis Function Networks
• A function is radial basis (RBF) if its output depends on (is a
non-increasing function of) the distance of the input from a given
stored vector.
• RBFs represent local receptors, as illustrated below, where each
green point is a stored vector used in one RBF.
• In a RBF network one hidden layer uses neurons with RBF
activation functions describing local receptors. Then one output
node is used to combine linearly the outputs of the hidden
w3
neurons.
The output of the blue vector
is “interpolated” using the three
w2 green vectors, where each vector
gives a contribution that depends on
its weight and on its distance from
the blue point. In the picture we have

w1
w1  w3  w2
RBF ARCHITECTURE
x1
w1
1
x2
y

wm1
 m1

xm
• One hidden layer with RBF activation functions
1... m1
• Output layer with linear activation function.
y  w11 (|| x  t1 ||)  ...  wm1 m1 (|| x  tm1 ||)
|| x  t || distance of x  ( x1,..., xm ) from vector t
HIDDEN NEURON MODEL

• Hidden units: use radial basis functions


the output depends on the distance of
φ( || x - t||) the input x from the center t

x1
φ( || x - t||)
x2  t is called center
 is called spread
center and spread are parameters

xm
Hidden Neurons

• A hidden neuron is more sensitive to data points


near its center.
• For Gaussian RBF this sensitivity may be tuned by
adjusting the spread , where a larger spread
implies less sensitivity.
• Example: cochlear cells have locally tuned
frequency responses.
RBF network parameters

• What do we have to learn for a RBF NN with a


given architecture?
– The centers of the RBF activation functions
– the spreads of the Gaussian RBF activation functions
– the weights from the hidden to the output layer
• Different learning algorithms may be used for
learning the RBF network parameters. We
describe three possible methods for learning
centers, spreads and weights.
Learning Algorithm : Centers
• K-clustering algorithm for finding the centers
1 Initialization: tk(0) random k = 1, …, m1
2 Sampling: draw x from input space
3 Similarity matching: find index of center closer to x
k(x)  arg min k x(n)  t k ( n)
4 Updating: adjust centers
t k (n)  x(n)  t k (n) if k  k(x)
t k ( n  1) 
t k ( n) otherwise
5 Continuation: increment n by 1, goto 2 and continue until no
noticeable changes of centers occur
Learning Algorithm 2: summary

• Hybrid Learning Process:


• Clustering for finding the centers.
• Spreads chosen by normalization.
• LMS algorithm for finding the weights.
• Or RLS (Recursive Least Square method)
Comparison with multilayer NN

• Architecture:
• RBF networks have one single hidden layer.
• FFNN networks may have more hidden layers.

• Neuron Model:
• In RBF the neuron model of the hidden neurons is different from the one of the
output nodes.
• Typically in FFNN hidden and output neurons share a common neuron model.
• The hidden layer of RBF is non-linear, the output layer of RBF is linear.
• Hidden and output layers of FFNN are usually non-linear.
Comparison with multilayer NN

• Activation functions:
• The argument of activation function of each hidden neuron in a RBF
NN computes the Euclidean distance between input vector and the
center of that unit.
• The argument of the activation function of each hidden neuron in a
FFNN computes the inner product of input vector and the synaptic
weight vector of that neuron.
• Approximation:
• RBF NN using Gaussian functions construct local approximations to
non-linear I/O mapping.
• FF NN construct global approximations to non-linear I/O mapping.
Pseudo Inverse
• Left inverse
(ATA)-1AT . A = Inxn
Full Column Rank

Right Inverse
A. AT(AAT)-1 = Imxm
Full Row Rank

You might also like