Zhigangyu 2006

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Proceedings of the 6th World Congress on Intelligent Control

and Automation, June 21 - 23, 2006, Dalian, China

The Design of RBF Neural Networks for Solving


Overfitting Problem *
Zhigang Yu, Shenmin Song, Guangren Duan, Run Pei Wenjun Chu
School of Aerospace, Harbin Institute of Technology Heilongjiang Mobile Communications Corporation
Harbin, Heilongjiang Province, China Harbin,Heilongjiang Province, China
yzglgy @163.com laochuhit@163.com

minimises the risk R [ f ] = ∫ ( y − f ( x )) P ( x, y ) dxdy , where


Abstract - One of the biggest problems in designing or 2

training RBF neural networks are the overfitting problem. The


traditional design of RBF neural networks may be pursued in a P ( x, y ) is a probability distribution depended on input vector
variety of ways. In this paper, we present a method for the
design of RBF networks to solve overfitting problem. For a x and output y . It is unknown, but we are given data
practical application, frequency information is usually available
for designing RBF networks by frequency domain analysis,
( x1 , y1 ) ,  , ( xN , yN ) .
which has a sound mathematical basis. We try to include the The structural risk minimization principle is to find the
frequency information into the design of RBF networks, which optimal structural parameters {C * , σ * } , getting an optimal
achieve the task of approximated a function in certain frequency
range and have the property of structural risk minimization. generalization of the RBF network. The empirical risk
After the structure of designed network is determined, the linear minimisation principle is to find an optimal weight {W * } of
weights of the output layer are the only set of adjustable
parameters. The approach of design is verified by approximation RBF network minimises the empirical risk
l
f ( x, W ) = arg min R [ f ] , R [ f ] = (1 l ) ∑ ( y − f ( x ))
cases. * 2
emp emp i .
f ∈L2
i =1
Index Terms - radial basis function, overfitting problem,
According to the structural risk minimization principle, the
structural risk minimization.
goal to design a RBF network is to find a set of parameters
I. INTRODUCTION θ = {C , σ ,W } such that f ( x, θ ) − f ( x, θ * ) ≤ ε ,
In terms of the approximation problems by RBF neural θ −θ ≤ γ .
*

networks, many methods have been presented in literatures


[1]-[3]. There are different learning strategies that we can Empirical risk minimisation makes sense only if
follow in the design of an RBF network, in other word, how lim Remp [ f ] = R [ f ] ѽҏwhich is true from the law of large
l →∞
the centers of the radial basis functions of the network are numbers. In order to solve the limited available samples, in
practice, once the structural parameters {C * , σ * } have
specified. But most of the existing methods may not work
well in practical case with limited samples and easily lead to
the problem of overfitting. determined, it is possible to find an approximation according
With radial basis functions providing the foundation for to the empirical risk minimisation
the design of the hidden units, the theory of RBF neural l
principle Remp [ f ] = (1 l ) ∑ ( y − f ( xi )) , which minimises
2
networks is linked closely with the choice of radial basis
i =1
functions. The technique of structural risk minimization
developed in this paper is an attempt to overcome the problem the empirical risk f ( x, θ *
) = arg min R [ f ] . emp
f ∈L2
of choosing an appropriate the parameters of the RBF network A continuous function can be decomposed into the
based on the frequency information. We safely accept a superposition of multiple sine and cosine functions. Structural
viewpoint that the structure of RBF neural networks risk minimization RBF (SRM-RBF) neural network based on
determines mainly the generalization performance [4]-[5]. Fourier series yields a Fourier series approximation to a given
In a given set of functions f ( x, θ ) ∈ L2 , where θ function. Thus such network possesses all the approximation
denotes all the parameters vector of hidden node, properties of Fourier series representations. It can achieve the
θ = {C , σ ,W } , C is the location of the centers, σ is the approximation to any square integrable function by such a
network. What is more important is that the neural network is
width of the centers, W is the weight of the network. The based on the topological structure of the multiple Fourier
ultimate aim to design a RBF network is that we would like to series possess the property of structural risk minimization. In
find the function f ( x, θ * ) , θ * = {C * , σ * ,W * } , which this paper, according to this viewpoint, the optimal structural
parameters {C * , σ * } of the SRM-RBF neural network can be

*
This work is supported by foundation HIT2002.12.

1-4244-0332-4/06/$20.00 ©2006 IEEE


2752
determined by the approach presented in part D of the section
II. Here, the optimal means the property of structural risk Physical model

Output of NN
minimization. We present a method for design of RBF neural
networks to solve overfitting problem. For a practical
application, frequency information is usually available for the
design of RBF networks by frequency domain analysis, which Neural network model with
has a sound mathematical basis. We try to include the overfitting
frequency information into the design of RBF neural
networks, which achieve the task of approximating a function Input of NN
in certain frequency range. This is the first motivation of this Fig.1 The neural network with a bad generalization property
study by Fourier series. For a problem to be solved, the physical phenomenon
II. THE DESIGN OF STERUCTURAL RISK responsible for generating the training data (e.g., speech,
MINMIZATION OF RBF NETWORK FOR AVOIDING pictures, radar signals, sonar signals, seismic data) is a well-
OVERFITTING posed direct problem. However, learning from such physical
form of data, viewed as multidimensional mapping
A. The Overfitting problem of RBF networks reconstruction problem, is ill-posed inverse problem for the
To develop a deep description of the overfitting problem, some reasons [7].
firstly, the interpolation problem, in a strict sense, may be There is no way to overcome this difficulty unless some
stated as follows: prior information about the input-output mapping is available.
Given a set of N different points {xi ∈ R m | i = 1, , N } In this context, it is rather appropriate that we remind
ourselves of a statement made by Lancoz [8]: “A lack of
and a corresponding set of N real number
information cannot be remedied by any mathematical
{di ∈ R | i = 1, , N } , find a function: R m → R that satisfies trickery.” The important issue of how to utilize the useful
the interpolation condition: f ( X i ) = d i , i = 1, , N . The information available to design RBF networks is discussed in
interpolation function is constrained to pass through all the this section. The idea of avoiding overfitting is that we design
training data point. The RBF technique consists of choosing a a RBF networks to approximate certain frequency ingredient,
but traditional methods cannot ascertain the property of the
function f ( x ) that has the following model: frequency of RBF networks.
h
f ( x, θ ) = ∑ wiφi ( x ) B. The design for RBF network of structural risk
i =1 minimization
where consider m-h-l RBF networks, m is the number of input
neurons, h is the number of hidden neurons. wi is the weight
{ 2
}
In L2 ( R m ) : f ( x ) | ∫ f ( x ) dt < +∞ , let f ( x ) : S → R ,

X = ( x1 , x2 , , xn ) denote the function to be approximated


T
between hidden node i and output node. θ denotes all the
parameters vector of hidden node including the center, width where S ⊂ R n is a closed bounded region. To be specific, if
and the weight. f ( x,θ ) ∈ R denotes the output of networks the n × 1 vectors a and b denote the lower and upper limits of
with parameters θ and the input x . A RBF network can be x , respectively, then
exclusively determined by the parameters of hidden nodes. S = {x ∈ R n | ai ≤ xi ≤ bi , 1 ≤ i ≤ n} .
Let, in general, Gaussian function is used as radial-basis
It is assumptive that n variables function
function, given by
φi ( x ) = exp ( − x − ci ri 2 ) f ( x1 , x2 ,  , xn ) , which satisfies the Dirichlet condition and
its period is T1 , T2 ,  , Tn corresponding with variables
where x ∈ R m is the input, ci = R m is the center of the
x1 , x2 ,  , xn respectively. Let F (ω1 , ω2 ,, ωn ) denote the
ith RBF, and ri ∈ R m is the corresponding width vector.
The strict interpolation procedure described in the above Fourier transform of the function f ( x ) , ∀ f ( x ) ∈ L2 R n , ( )
may not be a good strategy for the training of RBF neural ∃ Bi > 0 , to make F (ωi ) = 0 , ωi ≤ Bi , i = 1, 2, , n . n-
networks for certain classes of tasks because of poor
generalization to new data for the following reason: when the dimensional Fourier series for a single output is introduced as
N1 Nn
number of data point in the training sample is much larger
than the number of degrees of the underlying physical
f (X ) = ∑  ∑ (w
p1 = 0 pn = 0
pc cos ( PX ) + w ps sin ( PX )) , ki ∈ {0,1} ,
process, the problem is overdetermined. Consequently, the
where N0 , N1 , , N n , denote the number of harmonic wave
networks may end up fitting misleading variation due to
idiosyncrasies or noise in the input data, thereby resulting in for the corresponding input variable, and
PX = ( p1 pn )( x1 x2  xn )
T
degraded generalization performance [6]. p2 
.
= p1ω1 x1 +  + pnωn xn

2753
ω0 , ωn is the bases wave for the corresponding input N1 Nn
f (X ) = ∑  ∑ ( wpc cos ( PX ) + wps sin ( PX ))
variable. Nonlinear functions f ( x ) usually are not period p1 = 0 pn = 0

function. In order to represent f ( x ) as Fourier series, we can N1


⎛ Nn

= ∑  ∑ ⎜ ∑ wkp cos ( PX + kπ 2 ) ⎟
pn = 0 ⎝ k = 0,1
solve this problem by periodic extension. p1 = 0 ⎠

(
N1 Nn
= ∑  ∑ ⎜ ∑ wkp exp − x1 − ckp,11ω1 σ p1ω1 ( )
2
C. Cosine function approximated by Gaussian function
In order to develop the theory of RBF network of p1 = 0 pn = 0 ⎝ k =0,1

structural risk minimization, we firstly study the relationship


between sine function or cosine function and Gaussian (
−,, − xn − ckp,nnωn )
2
σp ω n n ))
( )) ⎟
function in aspect of function approximation. N1 Nn
⎛ ⎞
∑ ∑ ⎜ ∑ w (σ
2
A sine function or cosine function in one period can be = p
k exp − x − c p p1ω1 σ pnωn
⎝ k = 0,1 ⎠
c
p1 = 0 pn =0
approximated by RBF network. Fig.2 shows the fitting result
of cosine function, y = cos ( x ) , with half of period is where the center vector c P = c ( p1ω1
k1 , rm1 ,c p2ω2
k2 , rm 2 , , c pnωn
kn , rmn ),
approximated by a 1 − 3 − 1 RBF network. The output of RBF pω
network is ck ,irmij is the rmi th center corresponding to function
( )
z = − exp − ( x − π ) 2.38 + exp (− x2 2.38) − exp − ( x + π ) 2.38 .
2
( 2
) cos ( p j ωk xi + kπ 2 ) , rmi is a constant related with harmonic
wave p jωk , 0 ≤ rmi ≤ ⎢⎣ p jωk (bi − ai ) π ⎥⎦ , ⎣⎢•⎦⎥ denotes the
1

integral part of • . x - c p denotes a weighted norm of the


y=cos(x) c
0.5
z=-exp(-(x-3.14).2/2.38+exp(-(x.2)/2.38-exp(-(x+3.14).2/2.38 input vector, the squared form of which is defined by
= ( C p x ) ( C p x ) = xT C p T C p x
2 T
0 x - cp
c

= σ 2 σ n ( x1 − c1 ) + σ 1σ 3 σ n ( x0 − c2 )
2 2

-0.5
+ , , +σ 0σ 1 σ n −1 ( x0 − cn )
2

-1
-5 -4 -3 -2 -1 0 1 2 3 4 5
where Ci is a n × n norm weighting matrix, and n is the
dimension of the input vector X ,
CpT Cp = diag (c11p c22p  cnnp ) ,
Fig. 2 The fitting result of a cosine function with period
by three Gaussian radial-basis functions
c jjp = σ1σ 2 σ j −1σ j +1 σ n , j = 1, 2, , n .
100
The sum of squares error is ε = ∑ d p − y p ( )
2
= 0.027 .
p =1 Specifically, we can get
It is obvious that a single-period cosine function within (1) The number of hidden units is
[−3,3] can be fitted quite well by three Gaussian radial basis n Ni
N = ∏ N i ∑ ⎢⎣2 ( bi − ai ) j Ti ⎥⎦ , where N i = ⎢⎣ Bi T j 2π ⎥⎦ .
i =1 j=0
functions. For a cosine function with period T , the locations
of the centers of RBF are −3T 4 , 0 , 3T 4 respectively and (2) The centers are located at the point

their width satisfy the following empirical formula ck ,jrmik = π rmi p j ωk + π kl 4 p j ωk ,
σ p2 ω = 0.35 (T 2 ) , T ∈ [1.04,3.14] . Moreover after the where for cosine function cos ( p j ωk xi + kπ 2 ) , k = 0,1 . And
1.75
i j

initially selecting the centers and widths of RBF the the corresponding width is σ p2 jωk = 0.35 (T 2 )
1.75
, where
approximation accuracy will be further improved with
gradient descent algorithm. Using the initial values of τ jk = 2π p jωk .
weighting parameters in the vicinity of these values can
E. Learning algorithm for RBF neural network
accelerate their convergence.
In general, a good learning algorithm must have fast
D. The structure of Gaussian radial basis function neural learning as well as good computational capacity and
network based on N-dimensional Fourier series generalization capacity. The number of adjusted weight is
We have an insight into the property of Gaussian n

function to approximate cosine function. According to this ∏N


i =1
i that is least than the number of hidden units, because
view, the following tasks are to determine the number,
location and width of the radial basis function. To be specific, some hidden units hold the same weight. The sole coefficients
we can design a Gaussian RBF network with the n dimension of the Fourier series made the following special algorithm
of input variables as follow. According to n-dimensional convergence to the sole solution. The parameters adjustable
Fourier series, we have are the weights between the hidden layer and the output layer.
Under the condition of definited performance index, getting

2754
optimum parameters of a linear mapping is a linear Bx = 2.51 , By = 2.51 , Tx = 2 π B x = 2.5 , Ty = 2 π B y = 1.67 .
optimization problem.
For the sine function sin ( 0.8π x ) , the number of centers is
Here, the weights wkp between the hidden layer and
output layer are initialized randomly. The training goal is to rx = 2 by rx ≥ 2 (b − a ) Tx . For cosine function cos (1.2π y ) ,
minimize the following objective function defined by the number of centers is ry = 3 by ry ≥ 2 (b − a ) Ty . The total
l l 2
1 1
E= ∑ ei 2 = l ∑
l i =1
( yi − f ( xi ,θ )) number of the center is r = rx ry = 6 . The center vectors are
c1 = ( c0,1 ),
i =1 0.8π 1.2π
, c1,1
where yi is the desired output. Adaptation formulas for the
linear weights of RBF network is c2 = (c0,1
0.8π 1.2 π
, c1,2 ),
M 2 (bi − ai ) Tij
ei ( n )φ ( x j )
... ,
∂E ( n ) ∂wkp ( n ) = ∑ ∑
i =1 j =0 c6 = ( c0,2
0.8π 1.2π
, c1,3 ).
,
where ck ,jrmik = π rmi p jωk + π kl 4 p jωk , piω j = {0.8π ,1.2π } ,

wkp ( n + 1) = wkp ( n ) − η ∂E ( n ) ∂wkp ( n )
where 0 ≤ pi ≤ N i , i = {1, 2, , n} . k = {0,1} . According to formula σ 2piω j = 0.35 (τ ij 2 )
1.75
,
Therefore, we can obtain the correct mapping
relationship of the Fourier neural networks after getting where τ ij ={2.5,1.67} , these widths are σ 0.8π = 0.52 ,
correct weights between the hidden layer and the output layer σ 1.2π = 0.26 . The initial structure of network is 2-6-1
units. The linear weights of the output layer are the only set of network. The approximation accuracy of RBF networks is
adjustable parameters. In doing so, the likelihood of evaluated by the formula:
converging to an undesirable local minimum in center space 128

∑ e(x)
2
and the width space is reduced. These centers and the width of ε= ,
RBF were kept fixed during the learning process to avoid x =1

substantial fluctuations of the error. where e ( x ) = y − y * , y is the output of network, y* is the


The training of neural-network models utilized in
desired output. The 2-6-1 network reaches the error
experimental study was terminated according to the stopping
criterion described below: the algorithm is considered to have ε = 0.148 for training set and ε = 0.163 for test set.
converged when the absolute rate of change in the average Remark: It is inevitably that the number of the centers
squared error ε per epoch is sufficiently small or the iterative will increase with the input dimensionality, because
computational amount and effect is a pair of contradictions.
time is not less than a pre-defined maximum number of
We must reach a compromise between them.
iteration. After learning iteration based on the stopping
criterion, the network can satisfy the pre-defined Example 2: Nonlinear system identification using SRM-RBF
generalization performance, if we have selected the neural network
appropriate value of ε , because SRM-RBF network has the The examples are adopted to evaluate the generalization
property of structural risk minimization. of the SRM-RBF network. The plant is assumed to be of the
To verify the performance of our network, two examples form
comparing it with the RBF neural network with supervised y ( k + 1) = f ⎣⎡ y (k ), y ( k − 1) , y ( k − 2 ) , u ( k ) , u ( k − 1)⎦⎤ .
selection of centers are presented below.
This example has been taken from Narendra and Parthasarthy
III. COMPUTER EXPERIMATION [9] for the purpose of comparison. The unknown function has
the form
Example 1. f ( x1 , x2 , x3 , x4 , x5 ) = ( x1 x2 x3 x5 ( x3 − 1) + x4 ) (1 + x22 + x32 ) .
In view of the discussion above, in the example, the
approach using the frequency information is usually available The SRBF network shown in Fig.2 is used. The centers
for design RBF networks is embodied. and weights are initialized appropriately, and then tuned and
A periodic function approximated is adjusted presented method to improve the performance of
network. During the training for 20 learning epochs, the input
z = 0.2sin ( 0.8π x ) cos (1.2π y ) x, y ∈ ( 0, 2.0 )
to the model and the plant consists of uniformly distributed
The sampling points are random numbers in the interval [−1,1] . Each learning epoch
{( xi , yi ) | xi = 2 (i − 1) 128; yi = 2 (i − 1) 128} , consists of 800 training patterns. The input to the plant and
for the training set, and identified model during testing the performance of the
{( xi , yi ) | xi = 2 (i − 0.5) 128; yi = 2 (i − 0.5 ) 128} network is given by
u ( k ) = sin ( 2π k 250 ) , k ≤ 500
for the test set, i = 1, 2, ,128 . There are 128 × 128 patterns in
the training and the test sets respectively. The training set is u ( k ) = 0.8sin ( 2π k 250 ) + 0.2 sin ( 2π k 25) , k > 500 .
added 0.2N(0,1) normal noise. In this case,

2755
Let Ti = 2 , extend f ( x1 , x2 , , x5 ) curves for variable IV. CONCLUSIONS
x1 , x2 , , x5 on ( −∞, ∞ ) periodically. So f ( x1 , x2 , , x5 ) is The RBF neural network restructured by means of
turn into a periodical function. structural risk minimization result in better generalization
The simulation result for SRM-RBF neural network is properties to minimize the risk of overfitting. If designing of
shown in Fig. 3(a). For comparison, the experimental RBF networks lacks the necessary information of frequency,
examples above are repeated by using the multi-layer the reconstructed input-output mapping to avoid overfitting
perceptron (MLP) as identification model, with delta-bar-delta has nothing to do with the true solution by learning algorithm.
with adaptive gain of neurons and the standard back- There is no way to overcome this difficulty unless some prior
propagation algorithm [9]. The identified output and the frequency information about the input-output mapping is
actual output of the plant are given in Fig.3 (b) and (c) for the available. The approach using the frequency information is
multi-layer perceptron mentioned above. As shown in Fig.3, usually available for design RBF networks is contrast with
the SRM-RBF neural network model outperforms MLP model traditional approach. Simulation results demonstrate that the
in term of accuracy measured in MSE. SRM-RBF network presented is powerful properties of
avoiding overfitting properties at least as good as Foruier
0.5 approximation.
0 REFERENCES
output

a
-0.5 [1] Robert J. Schilling, James J. Carroll, Approximation of nonlinear systems
with radial basis function neural networks, IEEE Transactions on Neural
-1
0 100 200 300 400 500 600 700 800 Networks, vol.12, no.1, pp.1-15, January 2001.
1 [2] Irwin W. Sandberg. Gaussian radial basis functions and the
approximation of input-output maps. Proceedings of the 42nd IEEE
Conference on Decision and Control. 2003, 3635-3639.
output

0 b [3] Tianping Chen and Hong Chen. Approximation Capability to Functions


of Several Variables, Nonlinear Functions, and Operators by Radial Basis
-1 Function Neural Networks. IEEE Transactions on Neural Networks,
0 100 200 300 400 500 600 700 800 vol.6, no.4, pp.904-910, July 1995.
1 [4] Chitra P., Marimuthu P., Effects of moving the centers in an RBF
network. IEEE Transactions on Neural Networks, vol.13, no. 6, pp.1299-
output

1307, Nov 2002.


0
c [5] Ying-hua Lu, Chun-guo Wu, Center selection for RBF neural network in
prediction of nonlinear time series, Proceedings of the Second
-1 International Conference on Machine Learning and Cybernetics, Wan,
0 100 200 300 400 500 600 700 800
pp.1355-1359, Nov 2003.
time steps
[6] Broomhead, D.S., and D. Lowe, Multivariable functional interpolation
and adaptive networks, Complex System, vol.2, no.2, pp.321-355, 1988.
Fig.3. The simulation result for SRM-RBF network (a), and MLP of with [7] Kirsch,A., An introduction to the mathematical theory of inverse
delta-bar-delta with adaptive gain of neurons (b), and MLP of with the problems. New York: springer-Verlag,1996.
standard back-propagation algorithm(c). The actual output is denoted by solid [8] Lancoz, C., Linear differential operators, London: Van Nostrand, 1964.
line and the output of network is dented by dotted line. [9] K.S.Narendra, K.Parthasarthy, Identification and control of dynamical
system using neural networks, IEEE trans. Neural networks, vol.1, no.1,
pp. 4-27, Mar 1990.

2756

You might also like