Download as pdf or txt
Download as pdf or txt
You are on page 1of 220

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/287165592

Biological Clues for Up-to-Date Artificial Neurons

Article · January 2008


DOI: 10.1007/0-387-37452-3_6

CITATIONS READS

22 9,070

2 authors:

Francisco Javier Ropero Peláez J.R.C. Piqueira


Universidade Federal do ABC (UFABC) University of São Paulo
48 PUBLICATIONS   298 CITATIONS    227 PUBLICATIONS   1,598 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Chaotic oscillator and PLLs for information encoding and decoding View project

Nanotechnology applied to cure of complex diseases View project

All content following this page was uploaded by J.R.C. Piqueira on 06 February 2015.

The user has requested enhancement of the downloaded file.


Computational Intelligence
Computational
Intelligence
for Engineering and Manufacturing

Edited by

Diego Andina
Technical University of Madrid (UPM), Spain

Duc Truong Pham


Manufacturing Engineering Center, Cardiff University, Cardiff
A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN-10 0-387-37450-7 (HB)


ISBN-13 978-0-387-37450-5 (HB)
ISBN-10 0-387-37452-3 (e-book)
ISBN-13 978-0-387-37452-9 (e-book)

Published by Springer,
P.O. Box 17, 3300 AA Dordrecht, The Netherlands.

www.springer.com

Printed on acid-free paper

All Rights Reserved


© 2007 Springer
No part of this work may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, microfilming, recording
or otherwise, without written permission from the Publisher, with the exception
of any material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work.
This book is dedicated to the
memory of
Roberto Carranza E., who
induced the authors the
enthusiasm to jointly prepare
this book.
CONTENTS

Contributing Authors ix

Preface xi

Acknowledgements xiii

1. Soft Computing and its Applications in Engineering and Manufacture 1


D. T. Pham, P. T. N. Pham, M. S. Packianather, A. A. Afify

2. Neural Networks Historical Review 39


D. Andina, A. Vega-Corona, J. I. Seijas,
J. Torres-García

3. Artificial Neural Networks 67


D. T. Pham, M. S. Packianather, A. A. Afify

4. Application of Neural Networks 93


D. Andina, A. Vega-Corona, J. I. Seijas,
M. J. Alarcón

5. Radial Basis Function Networks and their Application


in Communication Systems 109
Ascensión Gallardo Antolín, Juan Pascual García,
José Luis Sancho Gómez

6. Biological Clues for Up-to-Date Artificial Neurons 131


Javier Ropero Peláez, Jose Roberto Castillo Piqueira

7. Support Vector Machines 147


Jaime Gómez Sáenz de Tejada, Juan Seijas Martínez-Echevarría

8. Fractals as Pre-Processing Tool for Computational Intelligence


Application 193
Ana M. Tarquis, Valeriano Méndez, Juan B. Grau, José M. Antón,
Diego Andina

vii
CONTRIBUTING AUTHORS

D. Andina, J. I. Seijas, J. Torres-García, M. J. Alarcón, A. Tarquis, J. B. Grau


and J. M. Antón work for Technical University of Madrid (UPM), Spain, where
they form the Group for Automation and Soft Computing (GASC).

D. T. Pham, P. T. N. Pham, M. S. Packianather and A. A. Afify work for Cardiff


University .

Javier Ropero Peláez, José Roberto Castillo Piqueira work for Escola Politecnica
da Universidade de Sao Paulo Departamento de Engenharia de Telecomunicaçoes
e Controle, Brazil.
A. Gallardo Antolín, J. Pascual García and J. L. Sancho Gómez work for
University Carlos III of Madrid, Spain,

A. Vega-Corona, V. Méndez and J. Gómez Sáenz de Tejada work for University


of Guanajuato, Mexico, Technical University of Madrid and Universidad Autónoma
of Madrid, Spain, respectively.

ix
PREFACE

This book presents a selected collection of contributions on a focused treatment


of important elements of Computational Intelligence. Unlike traditional comput-
ing, Computational Intelligence (CI) is tolerant of imprecise information, partial
truth and uncertainty. The principle components of CI that currently have frequent
application in Engineering and Manufacturing are: Neural Networks (NN), fuzzy
logic (FL) and Support Vector Machines (SVM). In CI, NN and SVM are concerned
with learning, while FL with imprecision and reasoning.
This volume mainly covers a key element of Computational Intelligence∗
learning. All the contributions in this volume have a direct relevance to neural
network learning∗ from neural computing fundamentals to advanced networks such
as Multilayer Perceptrons (MLP), Radial Basis Function Networks (RBF), and their
relations with fuzzy set and support vector machines theory. The book also dis-
cusses different applications in Engineering and Manufacturing. These are among
applications where CI have excellent potentials for use.
Both novice and expert readers should find this book a useful reference in the field
of Computational Intelligence. The editors and the authors hope to have contributed
to the field by paving the way for learning paradigms to solve real-world problems

D. Andina

xi
ACKNOWLEDGEMENTS

This document has been produced with the financial assistance of the European
Community, ALFA project II-0026-FA. The views expressed herein are those of
the Authors and can therefore in no way be taken to reflect the official opinion of
the European Community.
The editors wish to thank Dr A. Afify of Cardiff University and Mr A. Jevtic of
the Technical University of Madrid for their support and helpful comments during
the revision of this text.
The editors also wish to thank Nagib Callaos, President of the International
Institute of Informatics and Systemics, IIIS, for his permission and freedom to
reproduce in Chapters 2 and 4 of this book contents from the book by D.Andina
and F.Ballesteros (Eds), “Recent Advances in Neural Networks” Ed. IIIS press,
ILL, USA (2000).

xiii
CHAPTER 1
SOFT COMPUTING AND ITS APPLICATIONS
IN ENGINEERING AND MANUFACTURE

D. T. PHAM, P. T. N. PHAM, M. S. PACKIANATHER, A. A. AFIFY


Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION
Soft computing is a recent term for a computing paradigm that has been in existence
for almost fifty years. This chapter reviews five soft computing tools. They are:
knowledge-based systems, fuzzy logic, inductive learning, neural networks and
genetic algorithms. All of these tools have found many practical applications.
Examples of applications in engineering and manufacture will be given in the
chapter.

1. KNOWLEDGE-BASED SYSTEMS
Knowledge-based systems, or expert systems, are computer programs embodying
knowledge about a narrow domain for solving problems related to that domain.
An expert system usually comprises two main elements, a knowledge base and an
inference mechanism. The knowledge base contains domain knowledge which may
be expressed as any combination of “IF-THEN” rules, factual statements (or asser-
tions), frames, objects, procedures and cases. The inference mechanism is that part
of an expert system which manipulates the stored knowledge to produce solutions
to problems. Knowledge manipulation methods include the use of inheritance and
constraints (in a frame-based or object-oriented expert system), the retrieval and
adaptation of case examples (in a case-based expert system) and the application of
inference rules such as modus ponens (If A Then B; A Therefore B) and modus
tollens (If A Then B; NOT B Therefore NOT A) according to “forward chaining” or
“backward chaining” control procedures and “depth-first” or “breadth-first” search
strategies (in a rule-based expert system). With forward chaining or data-driven
inferencing, the system tries to match available facts with the IF portion of the
1
D. Andina and D.T. Pham (eds.), Computational Intelligence, 1–38.
© 2007 Springer.
2 CHAPTER 1

IF-THEN rules in the knowledge base. When matching rules are found, one of them
is “fired”, i.e. its THEN part is made true, generating new facts and data which in
turn causes other rules to “fire”. Reasoning stops when no more new rules can fire.
In backward chaining or goal-driven inferencing, a goal to be proved is specified.
If the goal cannot be immediately satisfied by existing facts in the knowledge base,
the system will examine the IF-THEN rules for rules with the goal in their THEN
portion. Next, the system will determine whether there are facts that can cause any
of those rules to fire. If such facts are not available they are set up as subgoals.
The process continues recursively until either all the required facts are found and
the goal is proved or any one of the subgoals cannot be satisfied, in which case
the original goal is disproved. Both control procedures are illustrated in Figure 1.
Figure 1a shows how, given the assertion that a lathe is a machine tool and a set of
rules concerning machine tools, a forward-chaining system will generate additional
assertions such as “a lathe is power driven” and “a lathe has a tool holder”. Figure 1b
details the backward-chaining sequence producing the answer to the query “does a
lathe require a power source?”.
In the forward chaining example of Figure 1a, both rules R2 and R3 simulta-
neously qualify for firing when inferencing starts as both their IF parts match the
presented fact F1. Conflict resolution has to be performed by the expert system to
decide which rule should fire. The conflict resolution method adopted in this exam-
ple is “first come, first served”: R2 fires as it is the first qualifying rule encountered.
Other conflict resolution methods include “priority”, “specificity” and “recency”.
The search strategies can also be illustrated using the forward chaining example of
Figure 1a. Suppose that, in addition to F1, the knowledge base also initially contains
the assertion “a CNC turning centre is a machine tool”. Depth-first search involves
firing rules R2 and R3 with X instantiated to “lathe” (as shown in Figure 1a) before
firing them again with X instantiated to “CNC turning centre”. Breadth-first search
will activate rule R2 with X instantiated to “lathe” and again with X instantiated to
“CNC turning centre”, followed by rule R3 and the same sequence of instantiations.
Breadth-first search finds the shortest line of inferencing between a start position
and a solution if it exists. When guided by heuristics to select the correct search
path, depth-first search might produce a solution more quickly, although the search
might not terminate if the search space is infinite [Jackson, 1999].
For more information on the technology of expert systems, see [Pham and Pham,
1988; Durkin, 1994; Giarratano and Riley, 1998; Darlington, 1999; Jackson, 1999;
Badiru and Cheung, 2002; Nurminen et al., 2003].
Most expert systems are nowadays developed using programs known as “shells”.
These are essentially ready-made expert systems complete with inferencing and
knowledge storage facilities but without the domain knowledge. Some sophisticated
expert systems are constructed with the help of “development environments”. The
latter are more flexible than shells in that they also provide means for users to
implement their own inferencing and knowledge representation methods. More
details on expert systems shells and development environments can be found in
[Price, 1990].
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 3

KNOWLEDGE BASE
(Initial State)
Fact :
F1 - A lathe is a machine tool
Rules :
R1 - If X is power driven Then X requires a power source
R2 - If X is a machine tool Then X has a tool holder
R3 - If X is a machine tool Then X is power driven

F1 & R2 match

KNOWLEDGE BASE
(Intermediate State)
Fact :
F1 - A lathe is a machine tool
F2 - A lathe has a tool holder
Rules :
R1 - If X is power driven Then X requires a power source
R2 - If X is a machine tool Then X has a tool holder
R3 - If X is a machine tool Then X is power driven

F1 & R3 match

KNOWLEDGE BASE
(Intermediate State)
Fact :
F1 - A lathe is a machine tool
F2 - A lathe has a tool holder
F3 - A lathe is power driven
Rules :
R1 - If X is power driven Then X requires a power source
R2 - If X is a machine tool Then X has a tool holder
R3 - If X is a machine tool Then X is power driven

F3 & R1 match

KNOWLEDGE BASE
(Final State)
Fact :
F1 - A lathe is a machine tool
F2 - A lathe has a tool holder
F3 - A lathe is power driven
F4 - A lathe requires a power source
Rules :
R1 - If X is power driven Then X requires a power source
R2 - If X is a machine tool Then X has a tool holder
R3 - If X is a machine tool Then X is power driven

Figure 1a. An example of forward chaining


4 CHAPTER 1

KNOWLEDGE BASE KNOWLEDGE BASE


(Initial State) (Final State)
Fact : Fact :
F1 -A lathe is a machine tool F1 -A lathe is a machine tool
Rules : F2 -A lathe is power driven
R1 - If X is power driven Then X requires a power F3 -A lathe requires a power source
source Rules :
R2 - If X is a machine tool Then X has a tool holder R1 - If X is power driven Then X requires a power
R3 - If X is a machine tool Then X is power driven source
GOAL STACK R2 - If X is a machine tool Then X has a tool holder
Goal : Satisfied R3 - If X is a machine tool Then X is power driven
G1 - A lathe requires a power source ? GOAL STACK
Goal : Satisfied
G1 - A lathe requires a power source Yes

G1 & R1 F2 & R1

KNOWLEDGE BASE KNOWLEDGE BASE


(Intermediate State) (Intermediate State)
Fact : Fact :
F1 -A lathe is a machine tool F1 -A lathe is a machine tool
Rules : F2 -A lathe is power driven
R1 - If X is power driven Then X requires a power Rules :
source R1 - If X is power driven Then X requires a power
R2 - If X is a machine tool Then X has a tool holder source
R3 - If X is a machine tool Then X is power driven R2 - If X is a machine tool Then X has a tool holder
GOAL STACK R3 - If X is a machine tool Then X is power driven
Goal : Satisfied GOAL STACK
G1 - A lathe requires a power source ? Goal : Satisfied
G2 - A lathe is a power driven ? G1 - A lathe requires a power source ?
G2 - A lathe is a power driven Yes

G2 & R3 F1 & R3

KNOWLEDGE BASE KNOWLEDGE BASE


(Intermediate State) (Intermediate State)
Fact : Fact :
F1 -A lathe is a machine tool F1 -A lathe is a machine tool
Rules : Rules :
R1 - If X is power driven Then X requires a power R1 - If X is power driven Then X requires a power
source source
R2 - If X is a machine tool Then X has a tool holder R2 - If X is a machine tool Then X has a tool holder
R3 - If X is a machine tool Then X is power driven R3 - If X is a machine tool Then X is power driven
GOAL STACK GOAL STACK
Goal : Satisfied Goal : Satisfied
G1 - A lathe requires a power source ? G1 - A lathe requires a power source ?
G2 - A lathe is a power driven ? G2 - A lathe is a power driven ?
G3 - A lathe is a machine tool ? G3 - A lathe is a machine tool Yes

Figure 1b. An example of backward chaining


SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 5

Among the five tools considered in this chapter, expert systems are probably
the most mature, with many commercial shells and development tools available
to facilitate their construction. Consequently, once the domain knowledge to be
incorporated in an expert system has been extracted, the process of building the
system is relatively simple. The ease with which expert systems can be developed
has led to a large number of applications of the tool. In engineering, applications
can be found for a variety of tasks including selection of materials, machine ele-
ments, tools, equipment and processes, signal interpreting, condition monitoring,
fault diagnosis, machine and process control, machine design, process planning,
production scheduling and system configuring. Some recent examples of specific
tasks undertaken by expert systems are:
• identifying and planning inspection schedules for critical components of an
offshore structure [Peers et al., 1994];
• automating the evaluation of manufacturability in CAD systems [Venkatachalam,
1994];
• choosing an optimal robot for a particular task [Kamrani et al., 1995];
• monitoring the technical and organisational problems of vehicle maintenance in
coal mining [Streichfuss and Burgwinkel, 1995];
• configuring paper feeding mechanisms [Koo and Han, 1996];
• training technical personnel in the design and evaluation of energy cogeneration
plants [Lara Rosano et al., 1996];
• storing, retrieving and adapting planar linkage designs [Bose et al., 1997];
• designing additive formulae for engine oil products [Shi et al., 1997];
• carrying out automatic remeshing during a finite-elements analysis of forging
deformation [Yano et al., 1997];
• designing of products and their assembly processes [Zha et al., 1998];
• modelling and control of combustion processes [Kalogirou, 2003];
• optimising the transient performances in the adaptive control of a planar robot
[De La Sen et al., 2004].

2. FUZZY LOGIC

A disadvantage of ordinary rule-based expert systems is that they cannot handle


new situations not covered explicitly in their knowledge bases (that is, situations
not fitting exactly those described in the “IF” parts of the rules). These rule-based
systems are completely unable to produce conclusions when such situations are
encountered. They are therefore regarded as shallow systems which fail in a “brittle”
manner, rather than exhibit a gradual reduction in performance when faced with
increasingly unfamiliar problems, as human experts would.
The use of fuzzy logic [Zadeh, 1965] which reflects the qualitative and inexact
nature of human reasoning can enable expert systems to be more resilient. With
fuzzy logic, the precise value of a variable is replaced by a linguistic description,
the meaning of which is represented by a fuzzy set, and inferencing is carried
6 CHAPTER 1

out based on this representation. Fuzzy set theory may be considered an extension
of classical set theory. While classical set theory is about “crisp” sets with sharp
boundaries, fuzzy set theory is concerned with “fuzzy” sets whose boundaries are
“grey”.
In classical set theory, an element ui can either belong or not belong to a set A, i.e.

the degree to which element u belongs to set A is either 1 or 0. However, in fuzzy

set theory, the degree of belonging of an element u to a fuzzy set A is a real number

between 0 and 1. This is denoted by A ui , the grade of membership of ui in A. Fuzzy
∼ ∼
set A is a fuzzy set in U, the “universe of discourse” or “universe” which includes all

objects to be discussed. A ui  is 1 when ui is definitely a member of A and A ui  is
∼ ∼ ∼
0 when ui is definitely not a member of A. For instance, a fuzzy set defining the term

“normal room temperature” might be:-

normal room temperature ≡ 00/below10 C + 03/10 C–16 C


(1) + 08/16 C–18 C + 10/18 C–22 C + 08/22 C–24 C
+ 03/24 C–30 C + 00/above 30 C

The values 0.0, 0.3, 0.8 and 1.0 are the grades of membership to the given
fuzzy set of temperature ranges below 10 C (above 30 C), between 10 C and
16 C24 C–30 C, between 16 C and 18 C22 C–24 C and between 18 C and
22 C. Figure 2(a) shows a plot of the grades of membership for “normal room
temperature”. For comparison, Figure 2(b) depicts the grades of membership for a
crisp set defining room temperatures in the normal range.
Knowledge in an expert system employing fuzzy logic can be expressed as
qualitative statements (or fuzzy rules) such as “If the room temperature is normal,
then set the heat input to normal”, where “normal room temperature” and “normal
heat input” are both fuzzy sets.
A fuzzy rule relating two fuzzy sets A and B is effectively the Cartesian product
∼ ∼
A × B which can be represented by a relation matrix R. Element Rij of R is the
∼ ∼ ∼ ∼
membership to A × B of pair ui  vj  ui ∈ A and vj ∈ B. Rij is given by:
∼ ∼ ∼ ∼

(2) Rij = minA ui  B vj 


∼ ∼

For example, with “normal room temperature” defined as before and “normal
heat input” described by:

(3) normal heat input ≡ 02/1 kW + 09/2 kW + 02/3 kW


SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 7

0.5

10 20 30 40
Temperature ( ˚C )
(a)

10 20 30 40
Temperature ( ˚C )
(b)

Figure 2. (a) Fuzzy set of “normal temperature” (b) Crisp set of “normal temperature”

R can be computed as:


⎡ ⎤
00 00 00
⎢02 03 02⎥
⎢ ⎥
⎢02 08 02⎥
⎢ ⎥
(4) R = ⎢
⎢02 09 02⎥⎥
∼ ⎢02 08 02⎥
⎢ ⎥
⎣02 03 02⎦
00 00 00

A reasoning procedure known as the compositional rule of inference, which


is the equivalent of the modus-ponens rule in rule-based expert systems, enables
conclusions to be drawn by generalisation (extrapolation or interpolation) from the
qualitative information stored in the knowledge base. For instance, when the room
8 CHAPTER 1

temperature is detected to be “slightly below normal”, a temperature-controlling


fuzzy expert system might deduce that the heat input should be set to “slightly
above normal”. Note that this conclusion might not be contained in any of the
fuzzy rules stored in the system. A well-known compositional rule of inference is

the max-min rule. Let R represent the fuzzy rule “If A Then B” and a ≡ i /ui
∼ ∼ ∼ ∼ i
a fuzzy assertion. A and a are fuzzy sets in the same universe of discourse. The
∼ ∼ 
max-min rule enables a fuzzy conclusion b ≡
j /vj to be inferred from a and R
∼ j ∼ ∼
as follows:

(5) b = a oR
∼ ∼ ∼

(6) j = maxmin i  Rij 


i

For example, given the fuzzy rule “If the room temperature is normal, then set the
heat input to normal” where “normal room temperature” and “normal heat input”
are as defined previously, and a fuzzy temperature measurement of

temperature ≡ 00/below10 C + 04/10 C–16 C + 08/16 C–18 C


(7) + 08/18 C–22 C + 02/22 C–24 C + 00/24 C–30 C
+ 00/above30 C

the heat input will be deduced as:

heat input = temperature oR


(8) = 02/1 kW + 08/2 kW + 02/3 kW

For further information on fuzzy logic, see [Kaufmann, 1975; Klir and Yuan,
1995; 1996; Ross, 1995; Zimmermann, 1996; Dubois and Prade, 1998].
Fuzzy logic potentially has many applications in engineering where the domain
knowledge is usually imprecise. Notable successes have been achieved in the area
of process and machine control although other sectors have also benefited from this
tool. Recent examples of engineering applications include:
• controlling the height of the arc in a welding process [Bigand et al., 1994];
• controlling the rolling motion of an aircraft [Ferreiro Garcia, 1994];
• controlling a multi-fingered robot hand [Bas and Erkmen, 1995];
• analysing the chemical composition of minerals [Da Rocha Fernandes and Cid
Bastos, 1996];
• monitoring of tool-breakage in end-milling operations [Chen and Black, 1997];
• modelling of the set-up and bend sequencing process for sheet metal bending
[Ong et al., 1997];
• determining the optimal formation of manufacturing cells [Szwarc et al., 1997;
Zülal and Arikan, 2000];
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 9

• classifying discharge pulses in electrical discharge machining [Tarng et al.,


1997];
• modelling an electrical drive system [Costa Branco and Dente, 1998];
• improving the performance of hard disk drive final assembly [Zhao and De
Souza, 1998; 2001];
• analysing chatter occurring during a machine tool cutting process [Kong et al.,
1999];
• addressing the relationships between customer needs and design requirements
[Sohen and Choi, 2001; Vanegas and Labib, 2001; Karsak, 2004];
• assessing and selecting advanced manufacturing systems [Karsak and Kuz-
gunkaya, 2002; Bozdağ et al., 2003; Beskese et al., 2004; Kulak and Kahraman,
2004];
• evaluating cutting force uncertainty in turning [Wang et al., 2002];
• reducing defects in automotive coating operations [Lou and Huang, 2003].

3. INDUCTIVE LEARNING

The acquisition of domain knowledge to build into the knowledge base of an expert
system is generally a major task. In some cases, it has proved a bottleneck in
the construction of an expert system. Automatic knowledge acquisition techniques
have been developed to address this problem. Inductive learning is an automatic
technique for knowledge acquisition. The inductive approach produces a structured
representation of knowledge as the outcome of learning. Induction involves gener-
alising a set of examples to yield a selected representation which can be in terms
of a set of rules, concepts or logical inferences or a decision tree.
An inductive learning program usually requires as input a set of examples. Each
example is characterised by the values of a number of attributes and the class
to which it belongs. In one approach to inductive learning, through a process of
“dividing-and-conquering” where attributes are chosen according to some strategy
(for example, to maximise the information gain) to divide the original example set
into subsets, the inductive learning program builds a decision tree that correctly
classifies the given example set. The tree represents the knowledge generalised
from the specific examples in the set. This can subsequently be used to handle
situations not explicitly covered by the example set.
In another approach known as the “covering approach”, the inductive learning
program attempts to find groups of attributes uniquely shared by examples in given
classes and forms rules with the IF part as conjunctions of those attributes and the
THEN part as the classes. The program removes correctly classified examples from
consideration and stops when rules have been formed to classify all examples in
the given set.
A new approach to inductive learning, “inductive logic programming”, is a
combination of induction and logic programming. Unlike conventional inductive
learning which uses propositional logic to describe examples and represent new
concepts, inductive logic programming (ILP) employs the more powerful predicate
10 CHAPTER 1

logic to represent training examples and background knowledge and to express new
concepts. Predicate logic permits the use of different forms of training examples
and background knowledge. It enables the results of the induction process, that is
the induced concepts, to be described as general first-order clauses with variables
and not just as zero-order propositional clauses made up of attribute-value pairs.
There are two main types of ILP systems, the first, based on the top-down general-
isation/specialisation method, and the second, on the principle of inverse resolution
[Muggleton, 1992; Lavrac, 1994].
A number of inductive learning programs have been developed. Some of the
well known programs are CART [Breiman et al., 1998], ID3 and its descen-
dants C4.5 and C5.0 [Quinlan, 1983; 1986; 1993; ISL, 1998; RuleQuest, 2000]
which are divide-and-conquer programs, the AQ family of programs [Michalski,
1969; 1990; Michalski et al., 1986; Cervone et al., 2001; Michalski and Kaufman,
2001] which follow the covering approach, the FOIL program [Quinlan, 1990;
Quinlan and Cameron-Jones, 1995] which is an ILP system adopting the gener-
alisation/specialisation method and the GOLEM program [Muggleton and Feng,
1990] which is an ILP system based on inverse resolution. Although most pro-
grams only generate crisp decision rules, algorithms have also been developed to
produce fuzzy rules [Wang and Mendel, 1992; Janikow, 1998; Hang and Chen,
2000; Baldwin and Martin, 2001; Wang et al., 2001; Baldwin and Karale, 2003;
Wang et al., 2003].
Figure 3 shows the main steps in RULES–3 Plus, an induction algorithm in the
covering category [Pham and Dimov, 1997] and belonging to the RULES family of
rule extraction systems [Pham and Aksoy, 1994; 1995a; 1995b; Pham et al., 2000;
Pham et al., 2003; Pham and Afify; 2005a]. The simple problem of detecting the
state of a metal cutting tool is used to explain the operation of RULES-3 Plus.
Three sensors are employed to monitor the cutting process and, according to the
signals obtained from them (1 or 0 for sensors 1 and 3; −1, 0, or 1 for sensor 2),
the tool is inferred as being “normal” or “worn”. Thus, this problem involves three
attributes which are the states of sensors 1, 2 and 3 and the signals that they emit
constitute the values of those attributes. The example set for the problem is given
in Table 1.

Table 1. Training set for the Cutting Tool problem

Example Sensor_1 Sensor_2 Sensor_3 Tool State

1 0 −1 0 Normal
2 1 0 0 Normal
3 1 −1 1 Worn
4 1 0 1 Normal
5 0 0 1 Normal
6 1 1 1 Worn
7 1 −1 0 Normal
8 0 −1 1 Worn
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 11

Step 1. Take an unclassified example and form array SETAV.


Step 2. Initialise arrays PRSET and T_PRSET (PRSET and T_PRSET will consist of
mPRSET expressions with null conditions and zero H measures) and set nco = 0.
Step 3. IF nco < na
THEN nco = nco + 1 and set m = 0;
ELSE the example itself is taken as a rule and STOP.
Step 4. DO
m = m + 1;
Specialise expression m in PRSET by appending to it a condition from SETAV
that differs from the conditions already included in the expression;
Compute the H measure for the expression;
IF its H measure is higher than the H measure of any expression in T_PRSET
THEN replace the expression having the lowest H measure with the newly
formed expression;
ELSE discard the new expression;
WHILE m < mPRSET .
Step 5. IF there are consistent expressions in T_PRSET
THEN choose as a rule the expression that has the highest H measure and discard
the others;
ELSE copy T_PRSET into PRSET;
initialise T_PRSET and go to step 3.

Figure 3. Rule forming procedure of RULES-3 Plus

Notes: nco – number of conditions; na -number of attributes;


mPRSET – number of expressions stored in PRSET (mPRSET is user-provided);
T_PRSET - a temporary array of partial rules of the same dimension as PRSET

In step 1, example 1 is used to form the attribute-value array SETAV which will
contain the following attribute-value pairs: [Sensor_1 = 0 Sensor_2 = −1 and
Sensor_3 = 0.
In step 2, the partial rule set PRSET and T_PRSET, the temporary version of
PRSET used for storing partial rules in the process of rule construction, are ini-
tialised. This creates for each of these sets three expressions having null conditions
and zero H measures. The H measure for an expression is defined as:


Ec Eic Ei Eic Ei
(9) H= 2−2 −2 1− c 1−
E Ec E E E

where E c is the number of examples covered by the expression (the total number
of examples correctly classified and misclassified by a given rule), E is the total
number of examples, Eic is the number of examples covered by the expression and
belonging to the target class i (the number of examples correctly classified by a
given rule), and Ei is the number of examples in the training set belonging to the
12 CHAPTER 1

target class i. In Equation (9), the first term



Ec
(10) G=
E
relates to the generality of the rule and the second term

Eic Ei Eic Ei
(11) A = 2−2 − 2 1 − 1 −
Ec E Ec E

indicates its accuracy.


In steps 3 and 4, by specialising PRSET using the conditions stored in SETAV,
the following expressions are formed and stored in T_PRSET:

1 Sensor_3 = 0 ⇒ Alarm = OFF H = 02565


2 Sensor_2 = −1 ⇒ Alarm = OFF H = 00113
3 Sensor_1 = 0 ⇒ Alarm = OFF H = 00012

In step 5, a rule is produced as the first expression in T_PRSET applies to only


one class:

Rule1 IF Sensor_3 = 0 THEN Alarm = OFF H = 02565

Rule 1 can classify examples 2 and 7 in addition to example 1. Therefore, these


examples are marked as classified and the induction proceeds.
In the second iteration, example 3 is considered. T_PRSET, formed in step 4
after specialising the initial PRSET, now consists of the following expressions:

1 Sensor_3 = 1 ⇒ Alarm = ON H = 00406


2 Sensor_2 = −1 ⇒ Alarm = ON H = 00079
3 Sensor_1 = 1 ⇒ Alarm = ON H = 00005

As none of the expressions cover only one class, T_PRSET is copied into PRSET
(step 5) and the new PRSET has to be specialised further by appending the existing
expressions with conditions from SETAV. Therefore the procedure returns to step
3 for a new pass. The new T_PRSET formed at the end of step 4 contains the
following three expressions:

1 Sensor_2 = −1Sensor_3 = 1 ⇒ Alarm = ON H = 03876


2 Sensor_1 = 1Sensor_3 = 1 ⇒ Alarm = ON H = 00534
3 Sensor_1 = 1Sensor_2 = −1 ⇒ Alarm = ON H = 00008
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 13

As the first expression applies to only one class, the following rule is obtained:

Rule 2 IF Sensor_2 = −1 AND Sensor_3 = 1 THEN


Alarm = ON H = 03876

Rule 2 can classify examples 3 and 8, which again are marked as classified.
In the third iteration, example 4 is used to obtained the next rule:

Rule 3 IF Sensor_2 = 0 THEN Alarm = OFF H = 02565

This rule can classify examples 4 and 5 and so they are also marked as classified.
In iteration 4, the last unclassified example 6 is employed for rule extraction,
yielding:

Rule 4 IF Sensor_2 = 1 THEN Alarm = ON H = 02741

There are no remaining unclassified examples in the example set and the proce-
dure terminates at this point.
Due to its requirement for a set of examples in a rigid format (with known
attributes and of known classes), inductive learning has found rather limited appli-
cations in engineering as not many engineering problems can be described in terms
of such a set of examples. Another reason for the paucity of applications is that
inductive learning is generally more suitable for problems where attributes have
discrete or symbolic values than for those with continuous-valued attributes as in
many engineering problems. Some recent examples of applications of inductive
learning are:
• controlling a laser cutting robot [Luzeaux, 1994];
• controlling the functional electrical stimulation of spinally-injured humans
[Kostov et al., 1995];
• modelling job complexity in clothing production systems [Hui et al., 1997];
• analysing the constructability of a beam in a reinforced-concrete frame
[Skibniewski et al., 1997];
• analysing the results of tests on portable electronic products to discover useful
design knowledge [Zhou, 2001];
• accelerating rotogravure printing [Evans and Fisher, 2002];
• predicting JIT factory performance from past data that includes both good and
poor factory performance [Mathieu et al., 2002];
• developing an intelligent monitoring system for improving the reliability of a
manufacturing process [Peng, 2004].
• analysing data in a steel bar manufacturing company to help intelligent decision
making [Pham et al., 2004];
More information on inductive learning techniques and their applications in
engineering and manufacture can be found in [Pham et al., 2002; Pham and Afify,
2005b].
14 CHAPTER 1

4. NEURAL NETWORKS
Like inductive learning programs, neural networks can capture domain knowledge
from examples. However, they do not archive the acquired knowledge in an explicit
form such as rules or decision trees and they can readily handle both continuous
and discrete data. They also have a good generalisation capability as with fuzzy
expert systems.
A neural network is a computational model of the brain. Neural network models
usually assume that computation is distributed over several simple units called
neurons which are interconnected and which operate in parallel (hence, neural
networks are also called parallel-distributed-processing systems or connectionist
systems). Figure 4 illustrates a typical model of a neuron. Output signal yj is a
function f of the sum of weighted input signals xi . The activation function f can
be a linear, simple threshold, sigmoidal, hyberbolic tangent or radial basis function.
Instead of being deterministic, f can be a probabilistic function, in which case yj
will be a binary quantity, for example, +1 or −1. The net input to such a stochastic
neuron – that is, the sum of weighted input signals xi – will then give the probability
of yj being +1 or −1.
How the inter-neuron connections are arranged and the nature of the connections
determine the structure of a network. How the strengths of the connections are
adjusted or trained to achieve a desired overall behaviour of the network is governed
by its learning algorithm. Neural networks can be classified according to their
structures and learning algorithms.
In terms of their structures, neural networks can be divided into two types:
feedforward network and recurrent networks. Feedforward networks can perform a
static mapping between an input space and an output space: the output at a given
instant is a function only of the input at that instant. The most popular feedforward
neural network is the multi-layer perceptron (MLP): all signals flow in a single
direction from the input to the output of the network. Figure 5 shows an MLP with
three layers: an input layer, an output layer and an intermediate or hidden layer.
Neurons in the input layer only act as buffers for distributing the input signals xi to
neurons in the hidden layer. Each neuron j in the hidden layer operates according
to the model of Figure 4. That is, its output yj is given by:
(12) yj = f wji xi 

x1 wj1

wji yj
xi ∑ f(.)
wjn

xn

Figure 4. Model of a neuron


SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 15

Output Layer

y1 yn

Hidden Layer
w1m
w12
w11

Input Layer

x1 x2 xm

Figure 5. A multi-layer perceptron

The outputs of neurons in the output layer are computed similarly.


Other feedforward networks [Pham and Liu, 1999] include the learning vector
quantisation (LVQ) network, the cerebellar model articulation control (CMAC)
network and the group-method of data handling (GMDH) network.
Recurrent networks are networks where the outputs of some neurons are fedback
to the same neurons or to neurons in layers before them. Thus signals can flow
in both forward and backward directions. Recurrent networks are said to have
a dynamic memory: the output of such networks at a given instant reflects the
current input as well as previous inputs and outputs. Examples of recurrent networks
[Pham and Liu, 1999] include the Hopfield network, the Elman network and the
Jordan network. Figure 6 shows a well-known, simple recurrent neural network,
the Grossberg and Carpenter ART-1 network. The network has two layers, an input
layer and an output layer. The two layers are fully interconnected, the connections
are in both the forward (or bottom-up) direction and the feedback (or top-down)
direction. The vector Wi of weights of the bottom-up connections to an output
neuron i forms an exemplar of the class it represents. All the Wi vectors constitute
the long-term memory of the network. They are employed to select the winning
neuron, the latter again being the neuron whose Wi vector is most similar to the
current input pattern. The vector Vi of the weights of the top-down connections
from an output neuron i is used for vigilance testing, that is, determining whether
an input pattern is sufficiently close to a stored exemplar. The vigilance vectors Vi
form the short-term memory of the network. Vi and Wi are related in that Wi is a
normalised copy of Vi , viz.

Vi
(13) Wi = 
+ Vji
16 CHAPTER 1

output layer

bottom up
weights W
top down weights V

input layer

Figure 6. An ART-1 network

where  is a small constant and Vji , the jth component of Vi (i.e. the weight of the
connection from output neuron i to input neuron j).
Implicit “knowledge” is built into a neural network by training it. Neural networks
are trained and categorised according to two main types of learning algorithms:
supervised and unsupervised. In addition, there is a third type, reinforcement learn-
ing, which is a special case of supervised learning. In supervised training, the neural
network can be trained by being presented with typical input patterns and the cor-
responding expected output patterns. The error between the actual and expected
outputs is used to modify the strengths, or weights, of the connections between the
neurons. The backpropagation (BP) algorithm, a gradient descent algorithm, is the
most commonly adopted MLP training algorithm. It gives the change wji in the
weight of a connection between neurons i and j as follows:-
(14) wji = j xi
where  is a parameter called the learning rate and j is a factor depending on
whether neuron j is an output neuron or a hidden neuron. For output neurons,
 
f t
(15) j = yj − yj
net j
and for hidden neurons,

f 
(16) j = w 
net j q qj q
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 17

In Equation (15), netj is the total weighted sum of input signals to neuron j and
t
yj is the target output for neuron j.
As there are no target outputs for hidden neurons, in Equation (16), the difference
between the target and actual output of a hidden neuron j is replaced by the
weighted sum of the q terms already obtained for neurons q connected to the output
of j. Thus, iteratively, beginning with the output layer, the  term is computed
for neurons in all layers and weight updates determined for all connections. The
weight updating process can take place after the presentation of each training pattern
(pattern-based training) or after the presentation of the whole set of training patterns
(batch training). In either case, a training epoch is said to have been completed
when all training patterns have been presented once to the MLP.
For all but the most trivial problems, several epochs are required for the MLP to
be properly trained. A commonly adopted method to speed up the training is to add
a “momentum” term to Equation (14) which effectively lets the previous weight
change influence the new weight change, viz:
(17) wji k + 1 = j xi + wji k
where wji k + 1 and wji k are weight changes in epochs k + 1 and k
respectively and  is the “momentum” coefficient.
Some neural networks are trained in an unsupervised mode where only the input
patterns are provided during training and the networks learn automatically to cluster
them in groups with similar features. For example, training an ART-1 network
involves the following steps:
(i) initialising the exemplar and vigilance vectors Wi and Vi for all output neurons
by setting all the components of each Vi to 1 and computing Wi according
to Equation (13). An output neuron with all its vigilance weights set to 1
is known as an uncommitted neuron in the sense that it is not assigned to
represent any pattern classes;
(ii) presenting a new input pattern x;
(iii) enabling all output neurons so that they can participate in the competition for
activation;
(iv) finding the winning output neuron among the competing neurons, i.e. the
neuron for which x. Wi is largest; a winning neuron can be an uncommitted
neuron as is the case at the beginning of training or if there are no better
output neurons;
(v) testing whether the input pattern x is sufficiently similar to the vigilance
vector Vi of the winning neuron. Similarity is measured by the fraction r of
bits in x that are also in Vi , viz.
xV
(18) r= i
xi
x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance
threshold
0 <  ≤ 1
18 CHAPTER 1

(vi) going to step (vii) if r ≥  (i.e. there is resonance); else disabling the winning
neuron temporarily from further competition and going to step (iv) repeating
this procedure until there are no further enabled neurons;
(vii) adjusting the vigilance vector Vi of the most recent winning neuron by log-
ically ANDing it with x, thus deleting bits in Vi that are not also in x;
computing the bottom-up exemplar vector Wi using the new Vi according to
Equation (13); activating the winning output neuron;
(viii) going to step (ii).
The above training procedure ensures that if the same sequence of training pat-
terns is repeatedly presented to the network, its long-term and short-term memories
are unchanged (i.e. the network is stable). Also, provided there are sufficient output
neurons to represent all the different classes, new patterns can always be learnt, as
a new pattern can be assigned to an uncommitted output neuron if it does not match
previously stored exemplars well (i.e. the network is plastic).
In reinforcement learning, instead of requiring a teacher to give target outputs
and using the differences between the target and actual outputs directly to modify
the weights of a neural network, the learning algorithm employs a critic only to
evaluate the appropriateness of the neural network output corresponding to a given
input. According to the performance of the network on a given input vector, the
critic will issue a positive or negative reinforcement signal. If the network has
produced an appropriate output, the reinforcement signal will be positive (a reward).
Otherwise, it will be negative (a penalty). The intention of this is to strengthen the
tendency to produce appropriate outputs and to weaken the propensity for generating
inappropriate outputs. Reinforcement learning is a trial-and-error operation designed
to maximise the average value of the reinforcement signal for a set of training input
vectors. An example of a simple reinforcement learning algorithm is a variation
of the associative reward-penalty algorithm [Hassoun, 1995]. Consider a single
stochastic neuron j with inputs x1  x2  x3      xn . The reinforcement rule may be
written as [Hassoun, 1995]

(19) wji k + 1 = wji k + lrkyj k − Eyj kxi k

wji is the weight of the connection between input i and neuron j, l is the learning
coefficient, r (which is +1 or −1) is the reinforcement signal, yj is the output of
neuron j, Eyj  is the expected value of the output, and xi k is the ith component
of the kth input vector in the training set. When learning converges, wji k + 1 =
wji k and so Eyj k = yj k = +1 or −1. Thus, the neuron effectively becomes
deterministic. Reinforcement learning is typically slower than supervised learning.
It is more applicable to small neural networks used as controllers where it is difficult
to determine the target network output.
For more information on neural networks, see [Michie et al., 1994; Hassoun,
1995; Pham and Liu, 1999; Yao, 1999; Jiang et al., 2002; Duch et al., 2004].
Neural networks can be employed as mapping devices, pattern classifiers or
pattern completers (auto-associative content addressable memories and pattern asso-
ciators). Like expert systems, they have found a wide spectrum of applications in
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 19

almost all areas of engineering, addressing problems ranging from modelling, pre-
diction, control, classification and pattern recognition, to data association, clustering,
signal processing and optimisation. Some recent examples of such applications are:
• predicting the tensile strength of composite laminates [Teti and Caprino, 1994];
• controlling a flexible assembly operation [Majors and Richards, 1995];
• choosing sheet metal working conditions [Lin and Chang, 1996];
• determining suitable cutting conditions in operation planning [Park et al., 1996;
Schultz et al., 1997];
• recognising control chart patterns [Pham and Oztemel, 1996];
• analysing vibration spectra [Smith et al., 1996];
• deducing velocity vectors in uniform and rotating flows by tracking the move-
ment of groups of particles [Jambunathan et al., 1997];
• setting the number of kanbans in a dynamic JIT factory [Wray et al., 1997;
Markham et al., 2000];
• generating knowledge for scheduling a flexible manufacturing system [Kim et al.,
1998; Priore et al., 2003];
• modelling and controlling dynamic systems including robot arms [Pham and Liu,
1999];
• acquiring and refining operational knowledge in industrial processes [Shigaki
and Narazaki, 1999];
• improving yield in a semiconductor manufacturing company [Shin and Park,
2000];
• identifying arbitrary geometric and manufacturing categories in CAD databases
[Ip et al., 2003];
• minimising the makespan in a flowshop scheduling problem [Akyol, 2004].

5. GENETIC ALGORITHMS

Conventional search techniques, such as hill-climbing, are often incapable of opti-


mising non-linear or multi modal functions. In such cases, a random search method
is generally required. However, undirected search techniques are extremely inef-
ficient for large domains. A genetic algorithm (GA) is a directed random search
technique, invented by Holland [Holland, 1975], which can find the global optimal
solution in complex multi-dimensional search spaces. A GA is modelled on natural
evolution in that the operators it employs are inspired by the natural evolution
process. These operators, known as genetic operators, manipulate individuals in a
population over several generations to improve their fitness gradually. Individuals
in a population are likened to chromosomes and usually represented as strings of
binary numbers.
The evolution of a population is described by the “schema theorem” [Holland,
1975; Goldberg, 1989]. A schema represents a set of individuals, i.e. a subset of the
population, in terms of the similarity of bits at certain positions of those individuals.
For example, the schema 1∗ 0∗ describes the set of individuals whose first and
third bits are 1 and 0, respectively. Here, the symbol ∗ means any value would be
20 CHAPTER 1

acceptable. In other words, the values of bits at positions marked ∗ could be either
0 or 1. A schema is characterised by two parameters: defining length and order.
The defining length is the length between the first and last bits with fixed values.
The order of a schema is the number of bits with specified values. According to
the schema theorem, the distribution of a schema through the population from one
generation to the next depends on its order, defining length and fitness.
GAs do not use much knowledge about the optimisation problem under study
and do not deal directly with the parameters of the problem. They work with codes
which represent the parameters. Thus, the first issue in a GA application is how to
code the problem, i.e. how to represent its parameters. As already mentioned, GAs
operate with a population of possible solutions. The second issue is the creation
of a set of possible solutions at the start of the optimisation process as the initial
population. The third issue in a GA application is how to select or devise a suitable
set of genetic operators. Finally, as with other search algorithms, GAs have to know
the quality of the solutions already found to improve them further. An interface
between the problem environment and the GA is needed to provide this information.
The design of this interface is the fourth issue.

5.1 Representation

The parameters to be optimised are usually represented in a string form since this
type of representation is suitable for genetic operators. The method of representation
has a major impact on the performance of the GA. Different representation schemes
might cause different performances in terms of accuracy and computation time.
There are two common representation methods for numerical optimisation prob-
lems [Blickle and Thiele, 1995, Michalewicz, 1996]. The preferred method is the
binary string representation method. The reason for this method being popular is
that the binary alphabet offers the maximum number of schemata per bit compared
to other coding techniques. Various binary coding schemes can be found in the
literature, for example, Uniform coding, Gray scale coding, etc. The second repre-
sentation method is to use a vector of integers or real numbers with each integer or
real number representing a single parameter.
When a binary representation scheme is employed, an important step is to decide
the number of bits to encode the parameters to be optimised. Each parameter should
be encoded with the optimal number of bits covering all possible solutions in the
solution space. When too few or too many bits are used the performance can be
adversely affected.

5.2 Creation of Initial Population

At the start of optimisation, a GA requires a group of initial solutions. There are


two ways of forming this initial population. The first consists of using randomly
produced solutions created by a random number generator, for example. This method
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 21

is preferred for problems about which no a priori knowledge exists or for assessing
the performance of an algorithm.
The second method employs a priori knowledge about the given optimisation
problem. Using this knowledge, a set of requirements is obtained and solutions
which satisfy those requirements are collected to form an initial population. In this
case, the GA starts the optimisation with a set of approximately known solutions
and therefore convergence to an optimal solution can take less time than with the
previous method.

5.3 Genetic Operators

The flowchart of a simple GA is given in Figure 7. There are basically four genetic
operators, selection, crossover, mutation and inversion. Some of these operators
were inspired by nature. In the literature, many versions of these operators can be
found. It is not necessary to employ all of these operators in a GA because each
operates independently of the others. The choice or design of operators depends
on the problem and the representation scheme employed. For instance, operators
designed for binary strings cannot be directly used on strings coded with integers
or real numbers.

5.3.1 Selection
The aim of the selection procedure is to reproduce more of individuals whose fitness
values are higher than those whose fitness values are low. The selection procedure
has a significant influence on driving the search towards a promising area and
finding good solutions in a short time. However, the diversity of the population

Initial Population

Evaluation

Selection

Crossover

Mutation

Inversion

Figure 7. Flowchart of a basic genetic algorithm


22 CHAPTER 1

must be maintained to avoid premature convergence and to reach the global optimal
solution. In GAs there are mainly two selection procedures: proportional selection,
also called stochastic selection, and ranking-based selection [Whitely, 1989].
Proportional selection is usually called “Roulette Wheel” selection, since its
mechanism is reminiscent of the operation of a Roulette Wheel. Fitness values of
individuals represent the widths of slots on the wheel. After a random spinning of
the wheel to select an individual for the next generation, slots with large widths
representing individuals with high fitness values will have a higher chance to be
selected.
One way to prevent premature convergence is to control the range of trials
allocated to any single individual, so that no individual produces too many offspring.
The ranking system is one such alternative selection algorithm. In this algorithm,
each individual generates an expected number of offspring which is based on the
rank of its performance and not on the magnitude [Baker, 1985].

5.3.2 Crossover
This operation is considered the one that makes the GA different from other algo-
rithms, such as dynamic programming. It is used to create two new individuals
(children) from two existing individuals (parents) picked from the current popula-
tion by the selection operation. There are several ways of doing this. Some common
crossover operations are one-point crossover, two-point crossover, cycle crossover
and uniform crossover.
One-point crossover is the simplest crossover operation. Two individuals are
randomly selected as parents from the pool of individuals formed by the selection
procedure and cut at a randomly selected point. The tails, which are the parts after
the cutting point, are swapped and two new individuals (children) are produced.
Note that this operation does not change the values of bits. An example of one-point
crossover is shown in Figure 8.

5.3.3 Mutation
In this procedure, all individuals in the population are checked bit by bit and the bit
values are randomly reversed according to a specified rate. Unlike crossover, this is

Parent 1 100|010011110

Parent 2 001|011000110

New string 1 100|011000110

New string 2 001|010011110

Figure 8. Crossover
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 23

Old string 1100|0|1011101

New string 1100|1|1011101

Figure 9. Mutation

a monadic operation. That is, a child string is produced from a single parent string.
The mutation operator forces the algorithm to search new areas. Eventually, it helps
the GA to avoid premature convergence and find the global optimal solution. An
example is given in Figure 9.

5.3.4 Inversion
This operator is employed for a group of problems, such as the cell placement
problem, layout problem and travelling salesman problem. It also operates on one
individual at a time. Two points are randomly selected from an individual and the
part of the string between those two points is reversed (see Figure 10).

5.4 Control Parameters

Important control parameters of a simple GA include the population size (number


of individuals in the population), crossover rate, mutation rate and inversion rate.
Several researchers have studied the effect of these parameters on the performance
of a GA [Schaffer et al., 1989; Grefenstette, 1986; Fogarty, 1989; Mahfoud, 1995;
Smith and Fogarty, 1997]. The main conclusions are as follows. A large population
size means the simultaneous handling of many solutions and increases the compu-
tation time per iteration; however since many samples from the search space are
used, the probability of convergence to a global optimal solution is higher than with
a small population size.
The crossover rate determines the frequency of the crossover operation. It is
useful at the start of optimisation to discover promising regions in the search space.
A low crossover frequency decreases the speed of convergence to such areas. If the
frequency is too high, it can lead to saturation around one solution. The mutation
operation is controlled by the mutation rate. A high mutation rate introduces high
diversity in the population and might cause instability. On the other hand, it is
usually very difficult for a GA to find a global optimal solution with too low a
mutation rate.

Old string 10|1100|11101

New string 10|0011|11101

Figure 10. Inversion of a binary string segment


24 CHAPTER 1

5.5 Fitness Evaluation Function

The fitness evaluation unit in a GA acts as an interface between the GA and the
optimisation problem. The GA assesses solutions for their quality according to the
information produced by this unit and not by directly using information about their
structure. In engineering design problems, functional requirements are specified to
the designer who has to produce a structure which performs the desired functions
within predetermined constraints. The quality of a proposed solution is usually
calculated depending on how well the solution performs the desired functions
and satisfies the given constraints. In the case of a GA, this calculation must be
automatic and the problem is how to devise a procedure which computes the quality
of solutions.
Fitness evaluation functions might be complex or simple depending on the opti-
misation problem at hand. Where a mathematical equation cannot be formulated
for this task, a rule-based procedure can be constructed for use as a fitness function
or in some cases both can be combined. Where some constraints are very important
and cannot be violated, the structures or solutions which do so can be eliminated in
advance by appropriately designing the representation scheme. Alternatively, they
can be given low probabilities by using special penalty functions.
For further information on genetic algorithms, see [Holland, 1975; Goldberg,
1989; Davis, 1991; Mitchell, 1996; Pham and Karaboga, 2000; Freitas, 2002].
Genetic algorithms have found applications in engineering problems involving
complex combinatorial or multi-parameter optimisation. Some recent examples of
those applications are:
• configuring transmission systems [Pham and Yang, 1993];
• designing the knowledge base of fuzzy logic controllers [Pham and Karaboga,
1994];
• generating hardware description language programs for high-level specifi-
cation of the function of programmable logic devices [Seals and Whapshott,
1994];
• planning collision-free paths for mobile and redundant robots [Ashiru et al.,
1995; Wilde and Shellwat, 1997; Nearchou and Aspragathos, 1997];
• scheduling the operations of a job shop [Cho et al., 1996; Drake and Choudhry,
1997; Lee et al., 1997; Chryssolouris and Subramaniam, 2001; Pérez et al.,
2003];
• generating dynamic schedules for the operation and control of a flexible manu-
facturing cell [Jawahar et al., 1998];
• optimising the performance of an industrially designed inventory control system
[Disney, 2000];
• forming manufacturing cells and determining machine layout information for
cellular manufacturing [Wu et al., 2002];
• optimising assembly process plans to improve productivity [Li et al., 2003];
• improving the convergence speed and reducing the computational complexity of
neural networks [Öztürk and Öztürk, 2004].
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 25

6. SOME APPLICATIONS IN ENGINEERING


AND MANUFACTURE

This section briefly reviews five engineering applications of the aforementioned


soft computing tools.

6.1 Expert Statistical Process Control

Statistical process control (SPC) is a technique for improving the quality of pro-
cesses and products through closely monitoring data collected from those processes
and products and using statistically-based tools such as control charts.
XPC is an expert system for facilitating and enhancing the implementation of sta-
tistical process control [Pham and Oztemel, 1996]. A commercially available shell
was employed to build XPC. The shell allows a hybrid rule-based and pseudo object-
oriented method of representing the standard SPC knowledge and process-specific
diagnostic knowledge embedded in XPC. The amount of knowledge involved is
extensive, which justifies the adoption of a knowledge-based systems approach.
XPC comprises four main modules. The construction module is used to set up
a control chart. The capability analysis module is for calculating process capabil-
ity indices. The on-line interpretation and diagnosis module assesses whether the
process is in control and determines the causes for possible out-of-control situa-
tions. It also provides advice on how to remedy such situations. The modification
module updates the parameters of a control chart to maintain true control over a
time-varying process. XPC has been applied to the control of temperature in an
injection moulding machine producing rubber seals. It has recently been enhanced
by integrating a neural network module with the expert system modules to detect
abnormal patterns in the control chart (see Figure 11).

6.2 Fuzzy Modelling of a Vibratory Sensor for Part Location

Figure 12 shows a six-degree-of-freedom vibratory sensor for determining the


coordinates of the centre of mass xG  yG  and orientation  of bulky rigid parts.
The sensor is designed to enable a robot to pick up parts accurately for machine
feeding or assembly tasks. The sensor consists of a rigid platform (P) mounted on
a flexible column (C). The platform supports one object (O) to be located at a time.
O is held firmly with respect to P. The static deflections of C under the weight of
O and the natural frequencies of vibration of the dynamic system comprising O,
P and C are measured and processed using a mathematical model of the system
to determine xG , yG and  for O. In practice, the frequency measurements have
low repeatability, which leads to inconsistent location information. The problem
worsens when  is in the region 80 -90 relative to a reference axis of the sensor
because the mathematical model becomes ill-conditioned. In this “ill-conditioning”
region, an alternative to using a mathematical model to compute  is to adopt an
experimentally derived fuzzy model. Such a fuzzy model has to be obtained for
26 CHAPTER 1

Range Chart Mean Chart


UCL : 9 CL : 4 LCL : 0.00 UCL : 93 CL : 78 LCL : 63

15 30 45 60 75 98 15 30 45 60 75 98
Mean : 4.5 St. Dev : 1.5 PCI: 1.7 Mean : 72.5 St. Dev : 4.4 PSD : 4.0
State of the process: in-control State of the process: in-control

Warning !!!!!!

Process going out of control!

press any key to continue

the pattern is normal (%) : 0.00


the pattern is inc. trend (%) : 0.00
the pattern is dec. trend (%) : 100.00
the pattern is up. shift (%) : 0.00
the pattern is down. shift (%) : 0.00
the pattern is cyclic (%) : 0.00

press 999 to exit

Figure 11. XPC output screen

each specific object through calibration. A possible calibration procedure involves


placing the object at different positions xG  yG  and orientations  and recording
the periods of vibration T of the sensor. Following calibration, fuzzy rules relating
xG , yG and T to  could be constructed to form a fuzzy model of the behaviour of the
sensor for the given object. A simpler fuzzy model is achieved by observing that xG
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 27

Platform Object O
P
Z Orientation y
z
yG Y

Column C

End of robot
arm

Figure 12. Schematic diagram of a vibratory sensor mounted on a robot wrist

and yG only affect the reference level of T and, if xG and yG are employed to define
that level, the trend in the relationship between T and  is the same regardless of
the position of the object. Thus, a simplified fuzzy model of the sensor consists of
rules such as “IF T-Tref  is small THEN -ref  is small” where Tref is the value
of T when the object is at position xG  yG  and orientation ref . ref could be chosen
as 80 , the point at which the fuzzy model is to replace the mathematical model.
Tref could be either measured experimentally or computed from the mathematical
model. To counteract the effects of the poor repeatability of period measurements
which are particularly noticeable in the “ill-conditioning” region, the fuzzy rules
are modified so that they take into account the variance in T. An example of a
modified fuzzy rule is:

“IF T-Tref  is small and T is small, THEN  − ref  is small”

In the above rule, T denotes the standard deviation in the measurement of T.


Fuzzy modelling of the vibratory sensor is detailed in Pham and Hafeez (1992).
Using a fuzzy model, the orientation  can be determined to ±2 accuracy in the
region 80 -90 . The adoption of fuzzy logic in this application has produced a
compact and transparent model from a large amount of noisy experimental data.

6.3 Induction of Feature Recognition Rules in a Geometric Reasoning


System for Analysing 3D Assembly Models

Pham et al. (1999) have described a concurrent engineering approach involving


generating assembly strategies for a product directly from its 3D CAD model.
28 CHAPTER 1

A feature-based CAD system is used to create assembly models of products.


A geometric reasoning module extracts assembly-oriented data for a product from
the CAD system after creating a virtual assembly tree that identifies the compo-
nents and sub-assemblies making up the given product (Figure 13a). The assembly
information extracted by the module includes: placement constraints and dimen-
sions used to specify the relevant position of a given component or sub-assembly;
geometric entities (edges, surfaces, etc) used to constrain the component or sub-
assembly; and the parents and children of each entity employed as a placement
constraint. An example of the information extracted is shown in Figure 13b.
Feature recognition is applied to the extracted information to identify each feature
used to constrain a component or sub-assembly. The rule-based feature recognition
process has three possible outcomes:
1. The feature is recognised as belonging to a unique class.
2. The feature shares attributes with more than one class (see Figure 13c).
3. The feature does not belong to any known class.
Cases 2 and 3 require the user to decide the correct class of the feature and the
rule base to be updated. The updating is implemented via a rule induction program.
The program employs RULES-3 Plus which automatically extracts new feature
recognition rules from examples provided to it in the form of characteristic vectors
representing different features and their respective class labels.
Rule induction is very suitable for this application because of the complex-
ity of the characteristic vectors and the difficulty of defining feature classes
manually.

Figure 13a. An assembly model


SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 29

Bolt:
• Child of Block
• Placement constraints:
1: alignment of two axes
2: mating ofthe bottom surface of the bolt
head and the upper surface ofthe block
• No child part in the assembly hierarchy

Block:
• No parents
• No constraints (root component)
• Next part in the assembly: Bolt

Figure 13b. An example of assembly information

Partial Round Non-


New Form Feature through Slot (BSL_2)

Detected Similar
Feature Classes

Rectangular Non-
through Slot (BSL_1)

Figure 13c. An example of feature recognition

6.4 Neural-network-based Automotive Product Inspection

Figure 14 depicts an intelligent inspection system for engine valve stem seals
[Pham and Oztemel, 1996]. The system comprises four CCD cameras connected
to a computer that implements neural-network-based algorithms for detecting and
classifying defects in the seal lips. Faults on the lip aperture are classified by a
multilayer perceptron. The inputs to the network are a 20-component vector, where
30 CHAPTER 1

Ethernet link
Vision system Host
PC

4 CCD cameras
512 x 512 resolution

Lighting Databus
ring

Good
Chute Seal Material handling
Bowl and lighting
Feeder controller
Reject

Rework
Indexing machine

Figure 14. Valve stem seal inspection system

the value of each component is the number of times a particular geometric feature
is found on the aperture being inspected. The outputs of the network indicate
the type of defect on the seal lip aperture. A similar neural network is used to
classify defects on the seal lip surface. The accuracy of defect classification in both
perimeter and surface inspection is in excess of 80%. Note that this figure is not
the same as that for the accuracy in detecting defective seals, that is differentiating
between good and defective seals. The latter task is also implemented using a
neural network which achieves an accuracy of almost 100%. Neural networks are
necessary for this application because of the difficulty of describing precisely the
various types of defects and the differences between good and defective seals.
The neural networks are able to learn the classification task automatically from
examples.

6.5 GA-based Conceptual Design


TRADES is a system using GA techniques to produce conceptual designs of trans-
mission units [Pham and Yang, 1993]. The system has a set of basic building
blocks, such as gear pairs, belt drives and mechanical linkages, and generates con-
ceptual designs to satisfy given specifications by assembling the building blocks
into different configurations. The crossover, mutation and inversion operators of
the GA are employed to create new configurations from an existing population of
configurations. Configurations are evaluated for their compliance with the design
specifications. Potential solutions should provide the required speed reduction ratio
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 31

and motion transformation while not containing incompatible building blocks or


exceeding specified limits on the number of building blocks to be adopted. A fitness
function codifies the degree of compliance of each configuration. The maximum
fitness value is assigned to configurations that satisfy all functional requirements
without violating any constraints. As in a standard GA, information concerning
the fitness of solutions is employed to select solutions for reproduction thus guid-
ing the process towards increasingly fitter designs as the population evolves. In
addition to the usual GA operators, TRADES incorporates new operators to avert
premature convergence to non-optimal solutions and facilitate the generation of a
variety of design concepts. Essentially, these operators reduce the chances of any
one configuration or family of configurations dominating the solution population
by avoiding crowding around very fit configurations and preventing multiple copies
of a configuration particularly after it has been identified as a potential solution.
TRADES is able to produce design concepts from building blocks without requir-
ing much additional a priori knowledge. The manipulation of the building blocks to
generate new concepts is carried out by the GA in a stochastic but guided manner.
This enables good conceptual designs to be found without the need to search the
design space exhaustively.
Due to the very large size of the design space and the quasi random operation
of the GA, novel solutions not immediately evident to a human designer are some-
times generated by TRADES. On the other hand, impractical configurations could also
arise. TRADES incorporates a number of heuristics to filter out such design proposals.

7. CONCLUSION

Over the past fifty years, the field of soft computing has produced a number of
powerful tools. This chapter has reviewed five of those tools, namely, knowledge-
based systems, fuzzy logic, inductive learning, neural networks and genetic algo-
rithms. Applications of the tools in engineering and manufacture have become
more widespread due to the power and affordability of present-day computers. It is
anticipated that many new applications will emerge and that, for demanding tasks,
greater use will be made of hybrid tools combining the strengths of two or more
of the tools reviewed here [Michalski and Tecuci, 1994; Medsker, 1995]. Other
technological developments in soft computing that will have an impact in engi-
neering include data mining, or the extraction of information and knowledge from
large databases [Limb and Meggs, 1994; Witten and Frank, 2000, Braha, 2001; Han
and Kamber, 2001; Pham and Afify, 2002; Klösgen and Żytkow, 2002; Giudici,
2003], and multi-agent systems, or distributed self-organising systems employing
entities that function autonomously in an unpredictable environment concurrently
with other entities and processes [Wooldridge and Jennings, 1994; Rzevski, 1995;
Márkus et al., 1996; Tharumarajah et al., 1996; Bento and Feijó, 1997; Monostori,
2002]. The appropriate deployment of these new soft computing tools and of the
tools presented in this chapter will contribute to the creation of more competitive
engineering systems.
32 CHAPTER 1

8. ACKNOWLEDGEMENTS

This work was carried out within the ALFA project “Novel Intelligent Automation
and Control Systems II” (NIACS II), the ERDF (Objective One) projects “Inno-
vation in Manufacturing Centre (IMC)”, “Innovative Technologies for Effective
Enterprises” (ITEE) and “Supporting Innovative Product Engineering and Respon-
sive Manufacturing” (SUPERMAN) and within the project “Innovative Production
Machines and Systems” (I∗ PROMS).

REFERENCES
Akyol D E, (2004), “Application of neural networks to heuristic scheduling algorithms”, Computers Ind.
Engng, 46, 679–696.
Ashiru I, Czanecki C and Routen T, (1995), “Intelligent operators and optimal genetic-based path
planning for mobile robots”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey,
1018–1023.
Badiru A B and Cheung J Y, (2002), Fuzzy Engineering Expert Systems with Neural Network Appli-
cations, John Wiley & Sons, New York.
Baker J E, (1985), “Adaptive selection methods for genetic algorithms”, Proc. 1st Int. Conf. on Genetic
Algorithms and Their Applications, Pittsburgh, PA, 101–111.
Baldwin J F and Karale S B, (2003), “New concepts for fuzzy partitioning, defuzzification and derivation
of probabilistic fuzzy decision trees”, Proc. 22nd Int. Conf. of the North American Fuzzy Information
Processing Society (NAFIPS-03), Chicago, Illinois, USA, 484–487.
Baldwin J F and Martin T P, (2001), “Towards inductive support logic programming”, Proc. Joint 9th
IFSA World Congress and 20th NAFIPS Int. Conf., Vancouver, Canada, 4, 1875–1880.
Bas K and Erkmen A M, (1995), “Fuzzy preshape and reshape control of Anthrobot-III 5-fingered robot
hand”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, 673–677.
Bento J and Feijó B, (1997), “An agent-based paradigm for building intelligent CAD systems”, Artificial
Intelligence in Engineering, 11 (3), 231–244.
Beskese A, Kahraman C and Irani Z, (2004), “Quantification of flexibility in advanced manufacturing
systems using fuzzy concepts”, Int. J. Production Economics, 89 (1), 45–56.
Bigand A, Goureau P and Kalemkarian J, (1994), “Fuzzy control of a welding process”, Proc. IMACS
Int. Symp. on Signal Processing, Robotics and Neural Networks (SPRANN 94), Villeneuve d’Ascq,
France, 379–342.
Blickle T and Thiele L, (1995), “A comparison of selection schemes used in genetic algorithms”,
Computer engineering and Communication Networks Lab (TIK)-Report, No. 11, Version 1.1b, Swiss
Federation Institute of Technology (ETH), Zurich, Switzerland.
Bose A, Gini M and Riley D, (1997), “A case-based approach to planar linkage design”, Artificial
Intelligence in engineering, 11 (2), 107–119.
Bozdağ C E, Kahraman C and Ruan D, (2003), “Fuzzy group decision making for selection among
computer integrated manufacturing systems”, Computers in Industry, 15 (1), 13–29.
Braha D, (2001), Data Mining for Design and Manufacturing: Methods and Applications. Kluwer
Academic Publishers, Boston.
Breiman L, Friedman J H, Olshen R A and Stone C J, (1984), Classification and Regression Trees,
Belmont, Wadsworth.
Cervone G, Panait L A and Michalski R S, (2001), “The development of the AQ20 learning system and
initial experiments”, Proc. 10th Inter. Symposium on Intelligent Information Systems, Poland.
Chen J C and Black J T, (1997), “A fuzzy-nets in-process (FNIP) system for tool-breakage monitoring
in end-milling operations”, Int. J Machine Tools Manufacturing, 37 (6), 783–800.
Cho B J, Hong S C and Okoma S, (1996), “Job shop scheduling using genetic algorithm”, Proc. 3rd
World Congress on Expert Systems, Seoul, Korea, 351–358.
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 33

Chryssolouris G and Subramaniam V, (2001), “Dynamic scheduling of manufacturing job shops using
genetic algorithms”, J. Intelligent Manufacturing, 12, 281–293.
Costa Branco P J and Dente J A, (1998), “An experiment in automatic modelling an electrical drive
system using fuzzy logic”, IEEE Trans on Systems, Man, and Cybernetics, 28 (2), 254–262.
Da Rocha Fernandes A M and Cid Bastos R, (1996), “Fuzzy expert systems for qualitative analysis of
minerals”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 673–680.
Darlington K W, (1999), The Essence of Expert Systems, Prentice Hall.
Davis L, (1991), Handbook of Genetic Algorithms, Van Nostrand, New York, NY.
De La Sen M, Miñambres J J, Garrido A J, Almansa A and Soto J C, (2004), “Basic theoretical results
for expert systems: Application to the supervision of adaptation transients in planar robots”, Artificial
Intelligence, 152 (2), 173–211.
Disney S M, Naim M M and Towill D R, (2000), “Genetic algorithm optimisation of a class of inventory
control systems”, Inter. J. Production Economics, 68, 259–278.
Drake P R and Choudhry I A, (1997), “From apes to schedules”, Manufacturing Engineer, 76 (1), 43–45.
Dubois D and Prade H, (1998), “An introduction to fuzzy systems”, Clinica Chimica Acta, 270 (1),
3–29.
Duch W, Setiono R and Zurada J M, (2004), “Computational intelligence methods for rule-based data
understanding”, Proc. IEEE, 92 (5), 771–805.
Durkin J, (1994), Expert Systems Design and Development, Macmillan, New York.
Evans B and Fisher D, (2002), “Using decision tree induction to minimize process delays in printing
industry”, In: Handbook of Data Mining and Knowledge Discovery (W. Klösgen and J.M. Zytkow
(Eds.)), Oxford University Press.
Kong F, Yu J and Zhou X, (1999), “Analysis of fuzzy dynamic characteristics of machine cutting
process: Fuzzy stability analysis in regenerative-type-chatter”, Int. J. Machine Tools and Manufacture,
39 (8), 1299–1309.
Ferreiro Garcia R, (1994), “FAM rule as basis for poles shifting applied to the roll control of an aircraft”,
SPRANN 94 (ibid), 375–378.
Fogarty T C, (1989), “Varying the probability of mutation in the genetic algorithm”, Proc. Third Int.
Conf. on Genetic Algorithms and Their Applications, George Mason University, 104–109.
Freitas A A, (2002), Data mining and knowledge discovery with evolutionary algorithms, Springer-
Verlag, Berlin, New York.
Giarratano J C and Riley G D, (1998), Expert Systems: Principles and Programming, 3rd Edition, PWS
Publishing Company, Boston, MA.
Giudici P, (2003), Applied Data Mining: Statistical Methods for Business and Industry, John Wiley &
Sons, England.
Goldberg D E, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Addison
Wesley, Reading, MA.
Grefenstette J J, (1986), “Optimization of control parameters for genetic algorithms”, IEEE Trans on
Systems, Man and Cybernetics, 16 (1), 122–128.
Han J and kamber M, (2001), Data Mining: Concepts and Techniques, Academic Press, USA.
Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA.
Holland J H, (1975), Adaptation in Natural and Artificial Systems, The University of Michigan Press,
Ann Arbor, MI.
Hong T P and Chen J B, (2000), “Processing individual fuzzy attributes for fuzzy rule induction”, Fuzzy
Sets and Systems, 112 (1), 127–140.
Hui P C L, Chan K C K and Yeung K W, (1997), “Modelling job complexity in garment manufacturing
by inductive learning”, Inter. J. Clothing Science and Technology, 9 (1), 34–44.
Ip C Y, Regli W C, Sieger L and Shokoufandeh A, (2003), “Automated learning of model classification.
Proc. 8th ACM Symposium on Solid Modeling and Applications, Seattle, Washington, USA, ACM
Press, 322–327.
ISL, (1998), Clementine Data Mining Package. SPSS UK Ltd., 1st Floor, St. Andrew’s House, West
Street, Woking, Surrey GU21 1EB, United Kingdom.
Jackson P, (1999), Introduction to Expert Systems, 3rd Edition, Addison-Wesley, Harlow, Essex.
34 CHAPTER 1

Jambunathan K, Fontama V N, Hartle S L and Ashforth-Frost S, (1997), “Using ART 2 networks to


deduce flow velocities”, Artificial Intelligence in Engineering, 11 (2), 135–141.
Janikow C Z, (1998), “Fuzzy decision trees: Issues and methods”, IEEE Trans on System, Man, and
Cybernetic, 28 (1), 1–14.
Jawahar N, Aravindan P, Ponnambalam S G and Karthikeyan A A, (1998), “A genetic algorithm-based
scheduler for setup-constrained FMC”, Computers in Industry, 35, 291–310.
Jiang Y, Zhou Z-H and Chen Z-Q, (2002), “Rule learning based on neural network ensemble”, Proc.
Inter. Joint Conf. on Neural Networks, Honolulu, HI, 1416–1420.
Kalogirou S A, (2003), “Artificial Intelligence for modelling and control of combustion processes:
A review”, Progress in Energy and Combustion Science, 29 (6), 515–566.
Kamrani A K, Shashikumar S and Patel S, (1995), “An intelligent knowledge-based system for robotic
cell design”, Computers Ind. Engng, 29 (1–4), 141–145.
Karsak E E, (2004), “Fuzzy multiple objective programming framework to prioritize design requirements
in quality function deployment”, Computers Ind. Engng, (Submitted and accepted).
Karsak E E and Kuzgunkaya O, (2002), “A fuzzy multiple objective programming approach for the
selection of a flexible manufacturing system”, Int. J. Production Economics, 79 (2), 101–111.
Kaufmann A, (1975), Introduction to the Theory of Fuzzy Subsets, Vol.1, Academic Press, New York.
Kim C-O, Min Y-D and Yih Y, (1998), “Integration of inductive learning and neural networks for
multi-objective FMS scheduling”, Inter. J. Production Research, 36 (9), 2497–2509.
Klir G J and Yuan B, (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall,
Upper Saddle River, NJ.
Klir G J and Yuan B, (Eds.), (1996), Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems – selected papers by
L A Zadeh, World Scientific, Singapore.
Klösgen W and Żytkow J M, (2002), Handbook of Data Mining and Knowledge Discovery, Oxford
University Press, New York.
Koo D Y and Han S H, (1996), “Application of the configuration design methods to a design expert
system for paper feeding mechanism”, Proc. 3rd World Congress on Expert Systems, Seoul, Korea,
February, 49–56.
Kostov A, Andrews B, Stein R B, Popovic D and Armstrong W W, (1995), “Machine learning in control
of functional electrical stimulation systems for locomotion”, IEEE Trans on Biomedical Engineering,
44 (6), 541–551.
Kulak O and Kahraman C, (2004), “Multi-attribute comparison of advanced manufacturing systems
using fuzzy vs. crisp axiomatic design approach”, Int. J. Production Economics, (Submitted and
accepted).
Lara Rosano F, Kemper Valverde N, De La Paz Alva C and Alcántara Zavala J, (1996), “Tutorial expert
system for the design of energy cogeneration plants”, Proc. 3rd World Congress on Expert Systems,
Seoul, Korea, February, 300–305.
Lavrac N and Dzeroski S, (1994), Inductive Logic Programming: Techniques and Applications, Ellis
Horwood, New York.
Lee C-Y, Piramuthu S and Tsai Y-K, (1997),“Job shop scheduling with a genetic algorithm and machine
learning”, Inter. J. Production Research, 35 (4), 1171–1191.
Li J R, Khoo L P and Tor S B, (2003), “A Tabu-enhanced genetic algorithm approach for assembly
process planning”, J. Intelligent Manufacturing, 14, 197–208.
Limb P R and Meggs G J, (1994), “Data mining tools and techniques”, British Telecom Technology
Journal, 12 (4), 32–41.
Lin Z-C and Chang D-Y, (1996), “Application of a neural network machine learning model in the
selection system of sheet metal bending tooling”, Artificial Intelligence in Engineering, 10, 21–37.
Lou H H and Huang Y L, (2003), “Hierarchical decision making for proactive quality control: System
development for defect reduction in automotive coating operations”, Engineering Applications of
Artificial Intelligence, 16, 237–250.
Luzeaux D, (1994), “Process control and machine learning: rule-based incremental control”, IEEE Trans
on Automatic Control, 39 (6), 1166–1171.
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 35

Mahfoud S W, (1995), Niching Methods for Genetic Algorithms, Ph.D. Thesis, Department of General
Engineering, University of Illinois at Urbana-Champaign.
Majors M D and Richards R J, (1995), “A topologically-evolving neural network for robotic flexible
assembly control”, Proc. Int. Conf. on Recent Advances in Mechatronics, Istanbul, Turkey, August,
894–899.
Markham I S, Mathieu RG and Wray B A, (2000), “Kanban setting through artificial intelligence:
A comparative study of artificial neural networks and decision trees”, Integrated Manufacturing
Systems, 11 (4), 239–246.
Márkus A, Kis T, Váncza J and Monostori L, (1996), “A market approach to holonic manufacturing”,
CIRP Annals, 45 (1), 433–436.
Mathieu R G, Wray B A and Markham I S, (2002), “An approach to learning from both good and
poor factory performance in a kanban-based just-in-time production system”, Production Planning &
Control, 13 (8), 715–724.
Medsker L R, (1995), Hybrid Intelligent Systems, Kluwer Academic Publishers, Boston, 298 pp.
Michalewicz Z, (1996), Genetic Algorithms + Data Structures = Evolution Programs, 3rd Edition,
Springer-Verlag, Berlin.
Michalski R S, (1990), “A theory and methodology of inductive learning”, in Readings in Machine
Learning, Eds. Shavlik J W and Dietterich T G, Kaufmann, San Mateo, CA, 70–95.
Michalski R S and Kaufman KA, (2001), “The AQ19 system for machine learning and pattern discovery:
A general description and user guide”, Reports of the Machine Learning and Inference Laboratory,
MLI 01-2, George Mason University, Fairfax, VA, USA.
Michalski R S and Larson J B, (1983), “Incremental generation of VL1 hypotheses: The underlying
methodology and the descriptions of program AQ11”, ISG 83–5, Department of Computer Science,
University of Illinois at Urbana-Champaign, Urbana, Illinois.
Michalski R S, Mozetic I, Hong J and Lavrac N, (1986), “The multi-purpose incremental learning
system AQ15 and its testing application to three medical domains”, American Association of Artificial
Intelligence, Los Altos, CA, Morgan Kaufmann, 1041–1045.
Michalski R and Tecuci G, (1994), Machine Learning: A Multistrategy Approach, 4, Morgan Kaufmann
Publishers, San Francisco, CA, USA.
Michie D, Spiegelhalter D J and Taylor C C, (1994), Machine Learning, Neural and Statistical Classi-
fication, Ellis Horwood, New York.
Mitchell M, (1996), An Introduction to Genetic Algorithms, MIT Press.
Monostori L, (2002), “AI and machine learning techniques for managing complexity, changes and
uncertainties in manufacturing”, Proc. 15th Triennial World Congress, Barcelona, Spain, 119–130.
Muggleton S (ed), (1992), Inductive Logic Programming, Academic Press, London, 565 pp.
Muggleton S and Feng C, (1990), “Efficient induction of logic programs”, Proc. 1st Conf. on Algorithmic
Learning Theory, Tokyo, Japan, 368–381.
Nearchou A C and Aspragathos N A, (1997), “A genetic path planning algorithm for redundant articulated
robots”, Robotica, 15 (2), 213–224.
Nurminen J K, Karonen O and Hatonen K, (2003), “What makes expert systems survive over 10-years –
empirical evaluation of several engineering applications”, Expert Systems with Applications, 24 (2),
199–211.
Ong S K, De Vin L J, Nee A Y C and Kals H J J, (1997), “Fuzzy set theory applied to bend sequencing
for sheet metal bending”, Int. J. Materials Processing Technology, 69, 29–36.
Öztürk N and Öztürk F, (2004), “Hybrid neural network and genetic algorithm based machining feature
recognition”, J. Intelligent Manufacturing, 15, 278–298.
Park M-W, Rho H-M and Park B-T, (1996), “Generation of modified cutting conditions using neural
networks for an operation planning system”, Annals of the CIRP, 45 (1), 475–478.
Peers S M C, Tang M X and Dharmavasan S, (1994), “A knowledge-based scheduling system for
offshore structure inspection”, Artificial Intelligence in Engineering IX (AIEng 9), Eds. Rzevski G,
Adey R A and Russell D W, Computational Mechanics, Southampton, 181–188.
Peng Y, (2004), “Intelligent condition monitoring using fuzzy inductive learning”, J. Intelligent Manu-
facturing, 15 (3), 373–380.
36 CHAPTER 1

Pérez E, Herrera F and Hernández C, (2003), “Finding multiple solutions in job shop scheduling by
niching genetic algorithms”, J. Intelligent Manufacturing, 14, 323–339.
Pham D T and Afify A A, (2002), “Machine learning: Techniques and trends”, Proc. 9th Inter. Workshop
on Systems, Signals and Image Processing (IWSSIP – 02), Manchester Town Hall, UK, World
Scientific, 12–36.
Pham D T and Afify A A, (2005a), “RULES-6: A simple rule induction algorithm for handling large
data sets”, Proc. of the Institution of Mechanical Engineers, Part (C), 219 (10), 1119–1137 .
Pham D T and Afify A A, (2005b), “Machine learning techniques and their applications in manufactur-
ing”, Proc. of the Institution of Mechanical Engineers, Part B, 219 (5), 395–412.
Pham D T, Afify A A and Dimov S S, (2002), “Machine learning in manufacturing”, Proc. 3rd CIRP
Inter. Seminar on Intelligent Computation in Manufacturing Engineering – (ICME 2002), Ischia, Italy,
III–XII.
Pham D T and Aksoy M S, (1994), “An algorithm for automatic rule induction”, Artificial Intelligence
in Engineering, 8, 277–282.
Pham D T and Aksoy M S, (1995a), “RULES : A rule extraction system”, Expert Systems with
Applications, 8, 59–65.
Pham D T and Aksoy M S, (1995b), “A new algorithm for inductive learning”, Journal of Systems
Engineering, 5, 115–122.
Pham D T, Bigot S and Dimov S S, (2003), “RULES-5: A rule induction algorithm for problems
involving continuous attributes”, Proc. of the Institution of Mechanical Engineers, 217 (Part C),
1273–1286.
Pham D T and Dimov S S (1997), “An efficient algorithm for automatic knowledge acquisition”, Pattern
Recognition, 30(7), 1137–1143.
Pham D T, Dimov S S and Salem Z, (2000), “Technique for selecting examples in inductive learning”,
ESIT 2000 European Symposium on Intelligent Techniques, Erudit Aachen Germany, 119–127.
Pham D T, Dimov S S and Setchi RM (1999), “Concurrent engineering: a tool for collaborative working”,
Human Systems Management, 18, 213–224.
Pham D T and Hafeez K, (1992), “Fuzzy qualitative model of a robot sensor for locating three-
dimensional objects”, Robotica, 10, 555–562.
Pham D T and Karaboga D, (1994), “Some variable mutation rate strategies for genetic algorithms”,
SPRANN 94 (ibid), 73–96.
Pham D T and Karaboga D, (2000), Intelligent Optimisation Techniques: Genetic Algorithms, Tabu
Search, Simulated Annealing and Neural Networks, Springer-Verlag, London, Berlin and Heidelberg,
2nd printing, 302 pp.
Pham D T and Liu X, (1999), Neural Networks for Identification, Prediction and Control, Springer
Verlag, London, Berlin and Heidelberg, 4th printing, 238 pp.
Pham D T and Oztemel E, (1996), Intelligent Quality Systems, Springer Verlag, London, Berlin and
Heidelberg, 201 pp.
Pham D T, Packianather M S, Dimov S, Soroka A J, Girard T, Bigot S. and Salem Z, (2004),
“An application of data mining and machine learning techniques in the metal industry”, Proc. 4th
CIRP Inter. Seminar on Intelligent Computation in Manufacturing Engineering (ICME-04), Sorrento
(Naples), Italy.
Pham D T and Pham P T N, (1988), “Expert systems in mechanical and manufacturing engineering”,
Int. J. Adv. Manufacturing Technology, Special Issue on Knowledge Based Systems, 3(3), 3–21.
Pham D T and Yang Y, (1993), “A genetic algorithm based preliminary design system”, Proc. IMechE,
Part D: J. Automobile Engineering, 207, 127–133.
Price C J, (1990), Knowledge Engineering Toolkits, Ellis Horwood, Chichester.
Priore P, Fuente D, Pino R and Puente J, (2003), “Dynamic scheduling of manufacturing systems using
neural networks and inductive learning”, Integrated Manufacturing Systems, 14 (2), 160–168.
Quinlan J R, (1983), “Learning efficient classification procedures and their application to chess
endgames”, In: Machine Learning: An Artificial Intelligence Approach (Michalski R S, Carbonell
J G and Mitchell T M (Eds.)), I, Tiogo Publishing Co., 463–482.
Quinlan J R, (1986), “Induction of decision trees”, Machine Learning, 1, 81–106.
SOFT COMPUTING AND ITS APPLICATIONS IN ENGINEERING AND MANUFACTURE 37

Quinlan J R, (1990), “Learning logical definitions from relations”, Machine Learning, 5, 239–266.
Quinlan J R, (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA.
Quinlan J R and Cameron-Jones R M, (1995), “Induction of logic programs: FOIL and related systems”,
New Generation Computing, 13, 287–312.
Ross T J, (1995), Fuzzy Logic with Engineering Applications, McGraw-Hill, New York.
RuleQuest, (2001), Data Mining Tools C5.0, Pty Ltd, 30 Athena Avenue, St Ives NSW 2075, Australia.
Available from: http://www.rulequest.com/see5-info.html.
Rzevski G, (1995), “Artificial intelligence in engineering : past, present and future”, Artificial Intelligence
in Engineering X, Eds Rzevski G, Adey R A and Tasso C, Computational Mechatronics, Southampton,
3–16.
Schaffer J D, Caruana R A, Eshelman L J and Das R, (1989), “A study of control parameters affecting
on-line performance of genetic algorithms for function optimisation”, Proc. Third Int. Conf. on Genetic
Algorithms and Their Applications, George Mason University, 51–61.
Schultz G, Fichtner D, Nestler A and Hoffmann J, (1997), “An intelligent tool for determination of
cutting values based on neural networks”, Proc. 2nd World Congress on Intelligent Manufacturing
Processes and Systems, Budapest, Hungary, 66–71.
Seals R C and Whapshott G F, (1994), “Design of HDL programmes for digital systems using genetic
algorithms”, AI Eng 9 (ibid), 331–338.
Shi Z Z, Zhou H and Wang J, (1997), “Applying case-based reasoning to engine oil design”, Artificial
Intelligence in Engineering, 11 (2), 167–172.
Shigaki I and Narazaki H, (1999), “A machine-learning approach for a sintering process using a neural
network”, Production Planning & Control, 10 (8), 727–734.
Shin C K and Park S C, (2000), “A machine learning approach to yield management in semiconductor
manufacturing”, Inter. J. Production Research, 38 (17), 4261–4271.
Skibniewski M, Arciszewski T and Lueprasert K, (1997), “Constructability analysis : machine learning
approach”, ASCE J of Computing in Civil Engineering, 12 (1), 8–16.
Smith J E and Fogarty T C, (1997), “Operator and parameter adaptation in genetic algorithms”, Soft
Computing, 1 (2), 81–87.
Smith P, MacIntyre J and Husein S, (1996), “The application of neural networks in the power industry”,
Proc. 3rd World Congress on Expert Systems, Seoul, Korea, February, 321–326.
Sohen S Y and Choi I S, (2001), “Fuzzy QFD for supply chain management with reliability considera-
tion”, Reliability Eng. and Systems Safety, 72, 327–334.
Streichfuss M and Burgwinkel P, (1995), “An expert-system-based machine monitoring and maintenance
management system”, Control Eng. Practice, 3 (7), 1023–1027.
Szwarc D, Rajamani D and Bector C R, (1997), “Cell formation considering fuzzy demand and machine
capacity”, Int. J. Advanced Manufacturing Technology, 13 (2), 134–147.
Tarng Y S, Tseng C M and Chung L K, (1997), “A fuzzy pulse discriminating systems for electrical
discharge machining”, Int. J. Machine Tools and Manufacture, 37 (4), 511–522.
Teti R and Caprino G, (1994), “Prediction of composite laminate residual strength based on a neural
network approach”, AI Eng 9 (ibid), 81–88.
Tharumarajah A, Wells A J and Nemes L, (1996), “Comparison of the bionic, fractal and holonic
manufacturing system concepts”, Int. J. Computer Integrated Manfacturing, 9 (3), 217–226.
Vanegas L V and Labib A W, (2001), “A fuzzy quality function deployment (FQFD) model for deriving
optimum targets”, Int. J. Production Research, 39 (1), 99–120.
Venkatachalam A R, (1994), “Automating manufacturability evaluation in CAD systems through expert
systems approaches”, Expert Systems with Applications, 7 (4), 495–506.
Wang L X and Mendel M, (1992), “Generating fuzzy rules by learning from examples”, IEEE Trans on
Systems, Man and Cybernetics, 22 (6), 1414–1427.
Wang W P, Peng Y H and Li X Y, (2002), “Fuzzy-grey prediction of cutting force uncertainty in
turning”, J Materials Processing Technology, 129, 663–666.
Wang C-H, Tsai C-J, Hong T-P and Tseng S-S, (2003), “Fuzzy Inductive Learning Strategies”, Applied
Intelligence, 18 (2), 179–193.
38 CHAPTER 1

Wang X Z, Wang Y D, Xu X F, Ling W D and Yeung D S, (2001), “A new approach to fuzzy rule
generation: Fuzzy extension matrix”, Fuzzy Sets and Systems, 123 (3), 291–306.
Whitely D, (1989), “The GENITOR algorithm and selection pressure: why rank-based allocation of
reproductive trials is best”, Proc. Third Int. Conf. on Genetic Algorithms and Their Applications,
George Mason University, 116–123.
Wilde P and Shellwat H, (1997), “Implementation of a genetic algorithm for routing an autonomous
robot”, Robotica, 15 (2), 207–211.
Witten I H and Frank E, (2000), Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann Publishers, USA.
Wooldridge M J and Jennings N R, (1994), “Agent theories, architectures and languages : a survey”,
Proc. ECAI 94 Workshop on Agent Theories, Architectures and Languages, Amsterdam, 1–32.
Wray B A, Rakes T R and Rees L, (1997), “Neural network identification of critical factors in dynamic
just-in-time kanban environment”, J. Intelligent Manufacturing, 8, 83–96.
Wu X, Chu C-H, Wang Y and Yan W, (2002), “A genetic algorithm for integrated cell formation and
layout decisions”, Proc. of the 2002 Congress on Evolutionary Computation (CEC-02), 2, 1866–1871.
Yano H, Akashi T, Matsuoka N, Nakanishi K, Takata O and Horinouchi N, (1997), “An expert system
to assist automatic remeshing in rigid plastic analysis”, Toyota Technical Review, 46 (2), 87–92.
Yao X, (1999), “Evolving artificial neural networks”, Proceedings of the IEEE, 87 (9), 1423–1447.
Zadeh L A, (1965), “Fuzzy Sets”, Information Control, 8, 338–353.
Zha X F, Lim S Y E and Fok S C, (1998), “Integrated knowledge-based assembly sequence planning”,
Int. J. Adv. Manufacturing Technology, 12 (3), 211–237.
Zha X F, Lim S Y E and Fok S C, (1999), “Integrated knowledge-based approach and system for product
design and assembly”, Int. J. Computer Integrated Manufacturing, 14, 50–64.
Zhao Z Y and De Souza R, (1998), “On improving the performance of hard disk drive final assembly
via knowledge intensive simulation”, J. Electronics Manufacturing, 1, 23–25.
Zhao Z Y and De Souza R, (2001), “Fuzzy rule learning during simulation of manufacturing resources”,
Fuzzy Sets and Systems, 122, 469–485.
Zhou C, Nelson P C, Xiao W, Tirpak T M and Lane S A, (2001), “An intelligent data mining system
for drop test analysis of electronic products”, IEEE Trans on Electronics Packaging Manufacturing,
24 (3), 222–231.
Zimmermann H-J, (1996), Fuzzy Set Theory and its Applications, 3nd Edition, Kluwer Academic
Publishers, Boston.
Zülal G and Arikan F, (2000), “Application of fuzzy decision making in part-machine grouping”,
Int. J. Production Economics, 63, 181–193.
CHAPTER 2
NEURAL NETWORKS HISTORICAL REVIEW

D. ANDINA1 , A. VEGA-CORONA2 , J. I. SEIJAS3 ,


J. TORRES-GARCÍA
1
Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica
de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. andina@gc.ssr.upm.es
2
Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato
(UG), Salamanca, Gto., México. tono@salamanca.ugto.mx
3
Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica
de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. jseijas@gc.ssr.upm.es

Abstract: This chapter starts with a historical summary of the evolution of Neural Networks from
the first models which are very limited in application capabilities to the present ones that
make possible to think in applying automatic process to tasks that formerly had been
reserved to the human intelligence. After the historical review, Neural Networks are dealt
from a computational point of view. This perspective helps to compare Neural Systems
with classical Computing Systems and leads to a formal and common presentation that
will be used throughout the book

INTRODUCTION

Computers used nowadays can make a great variety of tasks (whenever they are
well defined) at a higher speed and with more reliability than those reached by the
human beings. None of us will be, for example, able to solve complex mathematical
equations at the speed that a personal computer will. Nevertheless, mental capacity
of the human beings is still higher than the one of machines in a wide variety of tasks.
No artificial system of image recognition is able to compete with the capacity of
a human being to discern between objects of diverse forms and directions; in fact it
would not even be able to compete with the capacity of an insect. In the same way,
whereas a computer performs an enormous amount of computation and restrictive
conditions to recognize, for example, phonemes, an adult human recognizes without
no effort words pronounced by different people, at different speeds, accents and
intonations, even in the presence of environmental noise.
It is observed that, by means of rules learned from the experience, the human
being is much more effective than the computers in the resolution of imprecise
39
D. Andina and D.T. Pham (eds.), Computational Intelligence, 39–65.
© 2007 Springer.
40 CHAPTER 2

problems (ambiguous problems), or problems that require great amount of infor-


mation. Our brain reaches these objectives by means of thousands of millions of
simple cells, called neurons, which are interconnected to each other.
However, it is estimated that the operational amplifiers and logical gates can
make operations several orders of magnitude faster than the neurons. If the same
processing technique of biological elements were implemented with operational
amplifiers and logical gates, one could construct machines relatively cheap and
able to process as much information, at least, as the one that processes a biological
brain. Of course, we are too far from knowing if these machines will be constructed
one day.
Therefore, there are strong reasons to think about the viability to tackle certain
problems by means of parallel systems that process information and learn by means
of principles taken from the brain systems of living beings. Such systems are called
Artificial Neural Networks, connexionist models or distributed parallel process
models. Artificial Neural Networks (ANNs or, simply, NNs) come then from the
man’s intention of simulating the biological brain system in an artificial way.

1. HISTORICAL PERSPECTIVE

The science of Artificial Neural Networks did his first significant appearance dur-
ing the 1940’s. Researchers who tried to emulate the functions of the human brain
developed physical models (later, simulations by means of programs) of the bio-
logical neurons and their interconnections. As the neurobiologists were deepening
in the knowledge of the human neural system, these first models were being con-
sidered more and more rudimentary approaches. Nevertheless, some of the results
obtained in these first times were impressive, which encouraged future research and
developments of sophisticated and powerful Artificial Neural Networks.

1.1 First Computational Model of Nervous Activity: The Model of


McCulloch and Pitts

McCulloch and Pitts published the first systematic studies of the artificial neural
networks [McCulloch and Pitts, 1943] [Pitts and McCulloch, 1947]. This study
appeared in terms of a computational model of the nervous activity of the human
nervous system cells.
Most of their work is focused on the behavior of a simple neuron, whose mathe-
matical or computational model is shown in Figure 1. Inside the artificial neuron, the
sum of each input xi multiplied by a scale factor (or weight wi ) is made. The inputs
emulate the excitations received by the biological neurons. The weights represent
the force of the synaptic union: a positive weight represents an excitatory effect,
and a negative weight an inhibitory effect. If the result of the sum is higher than
a certain threshold value or bias (represented by the weight w0 ), the cell activates
providing a positive value (normally +1); in the opposite case, the output presents
a negative value (usually −1) or zero. Therefore, it is a binary output. In general,
NEURAL NETWORKS HISTORICAL REVIEW 41

⎡ x1 ⎤ w0
⎢ ⎥ w1
x2
X =⎢ ⎥ ∑wi xi f (Z )
⎢ ⎥
⎢ ⎥ Z O
⎢⎣ xm⎥⎦ wM Activation
O function
1
O = f (Z )
−1

Figure 1. Artificial model [McCulloch and Pitts, 1943] of a biological neuron. As it can be observed,
the relation between the input and output follows a nonlinear function called activation function. In the
first model shown in this figure, the activation is a hard threshold function

the model follows the neurobiologic behavior: the nervous cells produce nonlinear
answers when provided of an excitation by a certain input. In particular, McCulloch
and Pitts proposed an activation function, that represents the nonlinearity of the
model, called hard threshold function (see Figure 1).
Although this first model can only perform very simple tasks, as it will be
described below, the potentiality of the neural systems is essentially in the inter-
connection between neurons to form networks. This interconnection is normally
arranged forming layers of nodes (artificial neurons). This kind of neural networks
are called Multi-Layer Perceptron (MLP). In general, it is possible to speak about
Feed Forward Neural Networks like those in which the information always is trans-
mitted in the direction of input layer to output layer. Or Feedback Neural Networks,
where the information can be transmitted in both directions; that is, connections
between nodes of higher layers with nodes of lower layers are allowed. Figure 2
shows a Feed Forward Neural Network of two layers: a hidden layer (located right
after the input layer) and the output layer. The input layer is usually not consid-
ered as being properly a layer of the network. Each component of the input vector
x = 1x1    xM T is connected to all the nodes of the first hidden layer. The forces
of these connections are determined by the weight associated with each one of
them. When the same philosophy is applied to the rest of the network’s layers, it is
said that full connectivity exists.
Trying to proceed chronologically, we will leave the Multilayer Neural Networks
(MLP) by the moment. The first artificial neural networks proposals were networks
of a single layer as the one shown in Figure 2 but eliminating the output layer.

INPUT w01
1
⎡1 ⎤ o2
⎢x ⎥ 2
X= ⎢ ⎥
1 OUTPUT
wM2
⎢ ⎥
⎢ ⎥ o1m w
⎣⎢xM ⎦⎥
m
w1m m

Figure 2. Two Layer Feed Forward Neural Networks


42 CHAPTER 2

This joint disposition of the first model of neuron (see Figure 1) in parallel was
suggested in order to solve the limitations of a neuron acting alone. It is easy to
verify that the model of McCulloch and Pitts divides the input space into two parts
by means of the hyperplane described by the equation

M
(1) hx = wj xj + wo = 0
j=1

This effect can be observed in Figure 3 that shows this hyperplane for the particular
case of M = 2.
A simple neuron can solve two-class classification problems of M-dimensional
data, assuming that they are linearly separable. That is, it can assign an output
equal to 1 to all the data of class “A” (that fall in the same side of the hyperplane),
whereas it assigns a value equal to −1 to the rest of the data that fall in the opposite
side. Mathematically, we can express this classification as

M
CB
(2) wj xj > − wo
<
j=1 CA

where CA and CB denotes class A and class B, respectively.


We have now a very simple neuron behavior model, that does not consider
many of the biological characteristics that tries to emulate. For example, it does
not consider the real delays existing in all inter-neural transmission –that have an
important effect on the dynamic system–, or, more importantly, it does not include
effects of synchronism or frequency modulation features, which is considered crucial
by many researchers.

x2
h(x) = 0

h(x) > 0 Class1


w0
w2

h(x) < 0 x1
Class 2
w1 w0
x2 x1
w2 w2

Figure 3. The hyperplane determined by the McCulloch and Pitts neuron model for the case of two
dimensional inputs. This hyperplane depends on the neuron’s parameters (weights wj , and threshold

value w0 ) according with the mathematical expression Mj=1 wj xj + w0 = 0
NEURAL NETWORKS HISTORICAL REVIEW 43

In spite of their limitations, the networks designed in this way have characteristics
classically restricted to the biologic systems. Perhaps researchers have been able to
shape the main biological neuron operations in this model, or perhaps the similarities
in some applications are mere coincidence. Only the necessary time to continue this
research will solve this question.

1.2 Training of Neural Networks: Hebbian Learning

The equation of the hyperplane border that characterizes the operation of the arti-
ficial neuron depends on the synaptic weights w1 , wM and on the threshold value
wo , which is normally considered as another weight of the network. The remaining
problem consists in the way of choosing, determining or looking for the appropriate
value of these weights that solve the problem in hand. This task is called learning
or training of the network.
From a historical point of view, the Hebbian Learning is the oldest and one of the
most studied learning procedures. In 1961, Hebb proposed a learning model that has
given rise to many of the learning systems which nowadays exist for training neural
networks. Hebb proposed that the value of the synaptic union would be increased
whenever the input and the output of a neuron were simultaneously activated. In
this way, the network’s connections used more frequently are reinforced, emulating
the biological phenomenon of the habit and the learning by means of repetition.
It is said that a neural network uses Hebbian learning when it increases the value
of its weights accordingly with the product of the levels of excitation of the source
and destiny neurons.
The Hebbian learning of the network is performed by means of successive
iterations using only the information of the input and output network, it never used
never the desired output or target. For this reason, this type of learning is called
unsupervised learning. It distinguishes it from other models of learning that use the
additional information of the desired values of the output, as a teacher, and that we
will expose next.

1.3 Supervised Learning: Rosenblatt and Widrow

Although many learning methods following the Hebbian model have been devel-
oped, it seems logical to expect that the most efficient results can be achieved by
those methods that use information of the network output (supervised learning.
Learning is so guided in order to perform a given function. About 1960, Rosenblatt
[Rosenblatt, 1962] dedicated his efforts in developing supervised learning algo-
rithms to train a neural network that called perceptron. A perceptron is a Feed
Forward neural network as that shown in Figure 2, where the nonlinearities of the
neurons are of the hard type. Some of the common functions used as alternatives to
the hard threshold functions will be shown later on. In this way, the Mcculloch and
Pitts model can be considered as the simplest kind of hard threshold perceptron.
44 CHAPTER 2

Concretely, Rosenblatt showed that a one layer perceptron is able to learn


many practical functions. He proposed a learning rule for the perceptron called
the perceptron rule. Let us consider the simplest case of a one layer percep-
tron composed by one single neuron, that is, the model proposed by McCulloch
and Pitts. If certain pairs of input and corresponding output is known, DN =
x1  d1  x2  d2      xN  dN  , then, at a given input pattern xk of the input data
set, the perceptron rule updates the network weights w = wo  w1      wM T in the
following way

(3) wk + 1 = wk +


dk − ok xk

The parameter
controls the updating magnitude values, and so the speed of the
algorithm convergence. It is called the learning rate and it usually takes values in
the range between 0 and 1. The DN set is called learning set and, as it includes
values of the desired outputs, it is of the supervised type.
If the linear separability is accomplished by the training data set, Rosenblatt
showed that the algorithm always converge in a finite number of steps, indepen-
dently of the
value. On the contrary, if the problem is not linearly separable, it will
have to be forced to stop, as always there will be at least one pattern erroneously
classified.
Usually, the training starts giving small random values to the perceptron weights.
In each step of the algorithm, a new input xk is applied to the network, then the
corresponding output is calculated, ok , and the weights are updated only if error
dk − ok  is not equal to zero. It is interesting to note that if the learning rate
has
a value close to 0, the weights will have a little variation with each new input, and
the learning is slow; if the value is next to 1 there can be large differences between
weight values for one iteration and the following one, reducing the influence of past
iterations and the algorithm could not converge. This problem is called instability.
Therefore, the gain rate should be adapted to the distribution changes on the input
patterns, satisfying the conflict between training time and stable updating of weights.
Also at early 1960’s, Widrow and Hoff [Widrow and Hoff, 1960] performed
several demonstrations on perceptron-like systems, that called ADALINE (“ADAp-
tive LINear Elements”), proposing a learning rule called LMS algorithm (“Least
Mean Square” algorithm) or Widrow-Hoff algorithm. This rule minimizes the Sum
of Square Errors (SSE, “sum-of-square errors”) between the desired output and the
output given by the perceptron before the hard threshold activation function. That
is, it minimizes the error function

1 N
(4) Ew = d − zj 2
2 j=1 j

through a gradient algorithm. The linear output z can be observed in Figure 1.


When the gradient to w is applied in Equation (4) and actualized in the opposite
NEURAL NETWORKS HISTORICAL REVIEW 45

direction to the gradient one, the LMS rule is obtained.


N
(5) wk + 1 = wk +
dj − zj kxj
j=1

where zj k = wT kxj . This “block” (in the sense that it uses all training patterns
in each iteration) version of the LMS is usually substituted for an “estocastic
approximation” (pattern by pattern) as shown in equation

(6) wk + 1 = wk +


dk − zk xk

Unlike the perceptron rule, the application of LMS delivers reasonable results (the
best that can be achieved through a linear discriminator in the SSE sense) when the
training set is not linearly separable.
During these years, researchers all around the world become enthusiastic with
the application possibilities that these systems promised.

1.4 Partial eclipse of Neural Networks: Minsky and Papert

The initial euphoria aroused in the early sixties was substituted by disappointment
when Minsky and Papert [Minsky and Papert, 1969] rigorously analyzed the prob-
lem and showed that there exists severe restrictions in the class of functions that
a perceptron can perform. One of their results shows how a one layer perceptron
with two inputs and one output is unable of performing a simple function as the
or-exclusive (Xor). The inputs of this function are of the type 1 or −1 being the
output −1 when the two inputs are different and 1 if they are equal. In the Figure 4
this problem is illustrated. It can be observed how a linear discriminator is unable
of separating the patterns of the two classes.
This limitation was well known by the end of the sixties and it was also known that
the problem could be solved adding more layers to the system. As an example, let us
analyze a two layer perceptron. The first layer can classify input vectors separated

x2
X d
Class B
Class A 1
⎡(1,1) ⎤ ⎡1 ⎤
⎢(1,–1)⎥ ⎢–1⎥
−1 1 ⎢ ⎥ ⎢ ⎥
⎢(–1,1)⎥ ⎢–1⎥
x1 ⎢ ⎥ ⎢ ⎥
⎣(–1,–1)⎦ ⎣1 ⎦
−1

Figure 4. The or-exclusive (Xor) problem


46 CHAPTER 2

by hyperplanes. The second layer can implement the logical functions AND and
OR, because both problems are linearly separable. In this way, a perceptron as the
one shown in Figure 5 (a) can implement boundaries as the one shown in Figure 5
(b) and, so, solve the Xor problem. In the general case, it can be shown that a two
layer perceptron can implement simply convex and connex regions –a region is
said to be convex if any straight line that joins two points of its boundary goes only
through points included in the region limited by the boundary. Convex regions are
limited by the (hyper)planes performed by each node in the first layer, and can be
open or closed.
It has to be noted that the possibilities of Multi Layer Perceptrons rely on the
nonlinearities of their neurons. If the activation function performed by these neurons
was linear, then the MLP capabilities would be the same as those of the single layer
perceptron. For example, let us think of a two layer perceptron with a threshold
value, wo = zero and with a linear activation function, fz = z (see Figure 1). In
this case, the outputs of the first layer can be easily expressed through a matrix
O1 = W1T X, and those of the second layer as O2 = W2T O1 . Then, the output as a
function of the input is obtained as

(7) O2 = W2T O1 = W2T W1T X = Wtotal


T

1
x1 O
X=
x2
2

x2 Decision boundary
(node 1)

Class A 1 Class B

–1
1 x1
Class B Class A
–1
Decision boundary
(node 2)

Figure 5. (a) Two layers perceptron, able to solve the Xor problem, implementing a boundary as
shown in (b)
NEURAL NETWORKS HISTORICAL REVIEW 47

This function could be performed by a single layer perceptron whose layer weights
were Wtotal . Therefore, if the nodes are linear elements, the performance of the
structure is not improved by adding new layers, as an equivalent one layer perceptron
can be found.
In spite of the possibilities opened by the MLP, Minsky and Papert, prestigious
scientists of their time, emphasized that algorithms to train such structures were
not known, and showed their scepticism on the possibilities of they would ever be
developed. The book of Minsky and Papert [Minsky and Papert, 1969], showed
some critical examples of the disadvantages of NNs vs classical computers in terms
of their capabilities for storing information, was a strong punch on the NNs research
enthusiasm, eclipsing their developing for the next twenty years.

1.5 Backpropagation algorithm: Werbos, Rumelhart et al. and Parker

It is true that the single layer perceptron has the limitation of being a simple
discriminator. There are reasons to affirm that it is only able of solving “toy”
problems. Although their limitations reduce when the number of layers raises, it is
difficult to find the adequate weights to solve a given problem.
This problem was solved with the incorporation of “soft”, derivable, nonlinearities
in the neurons in the place of the classical hard threshold. Concretely, the sigmoidal
function is very appropriate Figure 6.
Among others, there exists an specially relevant theorem on the capabilities of
the MLP with soft activation functions, Cybenko’s Theorem [Cybenko, 1989]: it is
sufficient with a two layers perceptron with the nodes (in indefinite number) in the
first layers performing sigmoidal activation functions to establish any correspon-
dence between No and −1 1NL (therefore, it will also be possible to establish any
classification). For a first revision on the perceptron capabilities as “approximators”
in the case of soft nonlinearities, it is worth to mention the work of Hornik et al.,
[Hornik et al., 1989, Hornik et al., 1990].
But, again, we must come back on the question of how to train the network
weights. In a completely analogous form to the LMS algorithm previously described,
the retropropagation algorithm updates the network weights (in this case of a MLP)
in the opposite direction of the error function gradient that we aim to minimize
(i.e. SSE). For that purpose, the chain rule is applied as many times as required

1
⎡ x1 ⎤
⎢x ⎥ 2
O2

X= ⎢ ⎥
2
⎢ ⎥ O1m 1
⎢ ⎥ O1m
⎣⎢ xm ⎦⎥ m MPL-NODE wTmx
–1

Figure 6. Multilayer Perceptron with sigmoidal nonlinearities


48 CHAPTER 2

to calculate that gradient for all the weights in the network. As the output is a
derivable function, this calculation is relatively easy [Haykin, 1994].
The backpropagation algorithm was proposed independently and consecutively
by Werbos [Werbos, 1974], Rumelhart et al., [Rumerlhart et al., 1986] and Parker
[Parker, 1985]. It can be said that the pessimism aroused by the book by Min-
sky and Papert had its counterpart twenty years later with the developing of the
backpropagation algorithm.

2. NEURAL NETWORKS VS CLASSICAL COMPUTERS

Classical digital computers process the information at two basic levels: hardware
and software. The computations performed are algorithmic and sequential. Each
problem is solved through an algorithm coded in a program, physically located
in the computer memory. Problems are solved one after the other. Algorithms are
performed as many times as needed, with the same reliability and at electronic
speed. Nevertheless, there are many real problems where computers cannot be
successfully applied yet. For example, let us think of a little mosquito finding its
way to survive in the world. Such a problem is a not-solved challenge to any
automatic device. But the difference probably relies on the fact that living beings
do not follow the computer processing scheme.
Biological brains process information in a massive, parallel, not sequential way.
Problems are solved by the cooperative participation of millions of highly inter-
connected elemental processors, called neurons. The neurons do not need to be
programmed. From the stimulus they receive from other neurons, they are able to
modify, adapt or learn its functioning. The system does not need a central process-
ing unit to control the activities of the system. It is interesting to note that biological
neural systems work at a speed several orders of magnitude lower than electronic
systems.
Therefore, brain is an adaptive, non-linear, sophisticated processing system.
Knowledge is distributed in the neurons activation state and memory is not addressed
through fixed labels. Their architecture tries to emulate the basic neural features
of brains and are designed by learning from examples. They could be defined as
networks that massively connect simple units (usually adaptive units), hierarchically
organized, that try to interact with the real world objects in the biological systems
fashion.
Advantages of NNs over classical computers are:
1 Adaptive Learning: they are able to learn and to perform tasks by an adaptive
procedure.
2 Self-organized: they are able to build their own internal organization or repre-
sentation of the information provided in a learning phase.
3 Fault Tolerance: ability of performing the same function despite of the partial
destruction of the Network.
4 Real time operation: its hardware architecture is oriented to massive parallel
processing of the information.
NEURAL NETWORKS HISTORICAL REVIEW 49

5 Simplicity of integration with present technology: these systems can be easily


simulated using the present computers and are also implemented in specific neural
hardware, that allows their modular integration in present systems.

3. BIOLOGICAL AND ARTIFICIAL NEURONS

3.1 The Biological Neuron


The biological neuron, whose basic operation is not yet completely known nor
understood, is composed of a cellular body and series of ramifications that are in
branches, called dendrites. Among all these branches, one of them is particularly
long and receives the name of axon. It starts from the cellular body and ends in
another series of dendrites. These last nervous terminals are used by the neurons
to be in contact with each other by means of the synaptic connections. When a
cell receives signals of other cells, (these can be excitatory or inhibitory signals)
the global effect is an excitation that exceeds a certain threshold value. Then it
responds transmitting a certain nervous signal through the axon to the adjacent cells
by means of the synapse of the nervous terminations.
Human nervous system is made up of these cells and is of a fascinating complex-
ity. It is estimated that 1011 neurons participate in more than 1015 interconnections
on channels that can measure more of a meter. Studies on the human brain anatomy
conclude that there are more than 1000 synapses in the input and output of each
neuron. It is important to note that, although the commutation time of a neuron
(few milliseconds) is almost a million times lower than the one of the actual com-
puter elements, the biological neurons have a very higher connectivity (thousands
of times) than the actual supercomputers.
Neurons are composed of the cell core, soma, and several branches called the
axon and the dendrites. The dendrites of different neurons are connected in what are
called sinapses and play the role of establishing the connection with the neighbor
neurons in order to make possible the communication among them.
Each neuron has two basic states: activation and rest. When a neuron is activated
it emits through the axon a chain of electrical excitements, of different frequencies
depending of its level of activation. Information is coded in the frequency of
generation and not in its amplitude.
The signal produced in the neuron body propagates to other neurons from the
axon to other neurons through chemical interchanges that take place in the synapses
of the dendrites.
The chemical components liberated by the dendrites are called neurotransmitters
and contribute to increase or inhibit the activation level of the neuron that receives
the neurotransmitters. Due to the action of the neurotransmitters – that are basically
chemical signals – ionic channels are opened in the receiver neuron and electrical
ions are received, contributing to the overall electrical charge of the neuron or
excitation level.
When the excitation level surpasses a certain activation level, the neuron is
activated. The efficiency of the synapse depends on several factors: the number of
50 CHAPTER 2

the neurotransmitter glands, concentration in the membrane of the neighbor neuron,


efficiency of the ionic channels and other physical and chemical variables.
As to the learning procedure, last discoveries make believe that it is also of
electrochemical nature, taking place among neighbor neurons, hierarchically close
in a layered structure. The chemical liberated in the learning process seems to be
nitric oxide (NO). Its molecules are able to go through the membrane and route to
the neighbor neurons controlling the efficiency of the connection by reactions with
other chemicals in this last neuron.
This efficiency regulation of the electrochemical connection among neighbor
neurons is the responsible of the learning procedure.

3.2 The Artificial Neuron


The simplest model of artificial neuron, as presented in Figure 1, is obtained through
approximating the action of all neuron inputs by a linear function. Let us call this
function Base Function, u·. In this case, the Base Function is a weighted sum

ni
u = w0 + w j xj 
j=1

where w0 is a threshold and wj are the synaptic weights, that correspond to the
effect of the inputs on the activation function.
The output function of an artificial neuron can be expressed as
 

ni
y = fx = f w0 + wj xj
j=1

In an artificial neuron, this function can be computed in three steps: the calculation
of the base function value, u·, as the sum of the input values xj weighted by
the synaptic weights wij plus the threshold value w0 and a non-linear activation
function fu.
Typical activation functions are explained in Figure 7:
• Step function

0 si t < 0
ut =
1 si t ≥ 0

• Sign function

−1 si t < 0
sgnt =
1 si t ≥ 0

• Gaussian function
x2
fx = ae− 2
NEURAL NETWORKS HISTORICAL REVIEW 51

f (x−a) f (x)

1 1

x x
a
−1
−1
Sign function Hyperbolic function
⎧1, if x ≥ a f (x) = tanh(βx), β > 0
f (x−a) = ⎨
⎩–1, if x ≤ a

Figure 7. Some typical activation functions

• Exponential function
1
fx =  > 0
1 + e− x
• Hyperbolic Function

fx = tanh x > 0

Hyperbolic and exponential functions are classified as sigmoids or sigmoidal


functions. They are real class functions, limited and monotonic f  x > 0.
In the case of sigmoidal functions, the mean value of the slope in the origin
is called gain and such a value represents a measurement of the transition slope
steepness.Therefore, if the gain tends to an infinite value, the sigmoid tends to a
Sign function. According to this, Exponential and Hyperbolic functions have a gain
of 4 and , respectively.
As assumed in the previous point, the activation function of a neuron is non-
linear. If the function fu is linear, fu = u, then the artificial neuron is called
Linear Neuron or Linear Node of the NN.

4. NEURAL NETWORKS: CHARACTERISTICS AND TAXONOMY


A Neural Network can be represented as an oriented pair G E, composed of a
set of nodes or basic processing elements G, also called processing units, artificial
neurons or nodes, and a set of interconnections, E, among them. The nodes set G
is partitioned in different sets called layers.
Each processing unit can also have a local memory and always a transfer function.
Depending upon this function of the weighted input values and the values stored in
the local memory, the output y is computed.
There are four main aspects that can characterize all NNs:
a) Data Representation. According to the input-output form, ANNs can be clas-
sified as: continuous type NNs, digital NNs or hybrid NNs.
In the continuous type, input-output data are of analogic nature. Their values
are real and continuous. In digital NNs, input-output data is of digital nature. In
the hybrid case, inputs are analogic and outputs are binary.
52 CHAPTER 2

b) Topology. Architecture or Topology of the NN refers to the way that the nodes
are physically disposed in the network. The nodes form layers or groups of
nodes that share a common input and feed their output to common nodes.
Only neurons in the input and output layers interact with the external systems.
The rest of nodes in the network present internal connections, forming what is
called hidden layers.
Therefore, topology of the NNs is characterized by the number of layers, number
of neurons inside the layers, connectivity degree and type of connections among
the nodes.
c) Input-Output Association. With respect to the input-output association type
NNs can be classified as heteroassociative or autoassociative.
Heteroassociative NNs: implement a certain function, frequently of difficult
analytical expression. They associate a set of inputs with a set of outputs in such
a way that each input has a corresponding output.
Autoassociative networks: outputs have the purpose to rebuild a certain input
information that has been corrupted by associating to each input data the more
similar stored data.
d) Learning Procedure. All the connections or synapsis of the nodes in a NN have
an associated synaptic weight efficiency factor. Each connection or synapsis
between the node i and the node j is weighted by wji . This weight is responsible
of the learning of the neural network.
In the learning phase, the NN modifies its weights as a result of a new input
information. Weights are modified following a convergent algorithm in such a
way that when all the weight values are stabilized to a certain value and the
learning phase ends, it is said that the NN has“learnt”.
For the learning process it is crucial to establish the weights updating algorithm
for the NN to correctly learn the new input information. According to the
learning criteria NNs can be classified as neural networks of supervised learning
or unsupervised learning NNs.
Figure 8 represents the most common way of NNs classification.

5. FEED FORWARD NEURAL NETWORKS: THE PERCEPTRON

First presented in section 1.1 Feed Forward Neural Networks are generally defined
as those networks composed of one or more layers whose nodes are connected in
such a way that their input comes only from nodes in the previous layer and their
outputs connect exclusively to neurons of the following layer. Their name comes
from the fact that the output of each layer feeds to the units of the following layer.
Of all feed forward NNs the most popular, is the Multilayer Perceptron, devel-
oped as an extension to the Perceptron proposed by Rossenblatt in 1962 [Rosenblatt,
1962].
In this type of networks, the learning is supervised because it uses information
of the output that the network must provide to the current input. Learning phase
NEURAL NETWORKS HISTORICAL REVIEW 53

Figure 8. Neural Networks basic taxonomy

or training phase consists in presenting to the network an input-output pair, called


training pattern

DN = x1  d1  x2  d2      xM  dM 

in such a way that the weights are adjusted by xi ∈ p and di ∈ k , i = 1 2     N.


Once the training phase is completed, the network is designed and ready to work
in what is called the direct mode phase. In this phase, the network classifies the
54 CHAPTER 2

inputs by the following binary decision rule



1 if x w > 0
g=
0 if x w < 0
where x w is the discriminating function, that is, the space p is divided into
two regions by the decision boundary x w = 0.
Logically, the choice of the discriminating function x w depends on the
distribution of the training patterns.

5.1 One Layer Perceptron


It basically consists in a set of nodes whose activation is produced for the action
of the weighted sums of the input values and, consequently, the discriminating
function takes the form
p
(8) x w w = wi xi +  = 0
i=1

Also, if we make  = w0 and we consider the inputs in the space p+1 such as x =
x1  x2      xp  1 and w = w1  w2      wp  w0 , Equation (8) can be expressed
as
x w = wxT = 0
Among other things, it serves to perform the pattern classification task, through
a discriminating function of the form [Karayiannis and Venetsanopoulos, 1993],
[Hush and Horne, 1993]:

N
uk xn  = wkj xnj
j=0

The classification rule is based on the assignment of class k to the input pattern
if the kth network output is the highest of all outputs. The network must be trained
following an appropriate algorithm, to produce the desired output for each pattern
uk xn  ≥ uj xn  ∀j = k −→ xn ∈ Wk
This decision rule is, sometimes, substituted by a binary decision rule with a
decision threshold. The Perceptron is a system that operates in such a way. After
the learning or training, the Perceptron structure can separates the classification
space in regions, one region for each class.
The decision boundaries are composed by hyperplane segments defined as:
uk xn  − uj xn  = 0
The Perceptron was initially proposed by Rosenblatt and a group of his students. In
their work, the Perceptron versatility was shown. Unfortunately, the fact problem
of the linear separability made its use out of interest.
NEURAL NETWORKS HISTORICAL REVIEW 55

5.1.1 Perceptron Training


It can be summarized in five steps:
1 Weights and Threshold initialization. Each one of the weights wi  has to be
initialized to low random values w0 = .
2 For i = 1 2     N, presenting the training pattern (a new E/S training pair
is composed by a new input Xp = x1  x2      xN i = 1 2     N and its
corresponding desired output dt.
3 Computing present output
   

M 
M
yi t = f wij xj t − i = f wij xj t = fNeti 
j=1 j=1

4 Weight adaptation: Wi =


dt − ytxi t.

: learning rate 0 <
< 1.
• dt: desired output, yt: present output.
• This process is repeated till the error et = dt − yt for each one of the
patterns is zero or less than a preset value.
5 back to step 2
The convergence of the perceptron training is established by the following
theorem: If the training set of a multiple classification problem is linearly sepa-
rable then the perceptron training algorithm converges to a correct solution in a
finite number of iterations. The mathematical proof of this theorem can be found
in [Rosenblatt, 1962] and its significance relays in the fact that a multiple class
problem can be reduced to a binary classification. Two typical examples of this
situation are shown in the Figure 9.

6. LMS LEARNING RULE


Nevertheless, even with the simple Perceptron structure, a reasonable solution can
be achieved for a set that does not accomplish the linear separability property, by

x2 x2
01 01
11 11

x1 x1
00 10 00 10
OR -function AND -function

Figure 9. Logical functions OR and AND reduced to a binary classification problem


56 CHAPTER 2

the use of the Least Mean Square convergence algorithm (LMS) to update the NN
weights during the learning phase. In general, the error function Equation (4), also
called cost function or objective function, to be minimized by the LMS algorithm
can be expressed as follows [Hush and Horne, 1993]:


M 
E= uxn  − k 
k=1 xn ∈Wl

where k is a k elements vector with all its components of zero value, except those
of k order, that corresponds to the correct classification.
Therefore, for a given training set DN where dk represents the computed value,
if the desired output to the k-th input vector is yk , then the Mean Square Error
(MSE) corresponding to the input-output pair is given by

1 N
1 N
< k2 >= k2 = d − yk 2
N i=1 N i=1 k

or, in vectorial notation,

< k2 >=< dk2 > −2dk < wT x > +w < xk xkT >

The minimum square error corresponds to the matrix w that satisfies the equation


= 0
w
In the case N = 2 the equation is an error paraboloid as shown in Figure 10.
From Figure 10 it can be observed that the optimum value for the weights of the
network is the one that makes the gradient null.
A possible search procedure is the maximum step descent. The gradient direction
is perpendicular to the contour lines in each point of the error surface. At the
algorithm starting point, the weight vector does not derives to a minimum except
in the case of spherical level curves. The weight updates in each iteration step must
be small or the weight vector could wander over the hypersurface without never
reaching the searched minimum.

6.1 The Multilayer Perceptron

A Perceptron of n layers is composed of n + 1 layers Ll l = 0 1     n, of several


processing units in each one, corresponding L0 to the input layer and Ln to the
output layer and Ll l = 1     n − 1 to the hidden layers.
The nodes in the hidden and output layers are individual processing units. The
overall output is obtained by adding all weighted inputs and passing the result
through a non-linear function of sigmoidal type (see Figure 6).
NEURAL NETWORKS HISTORICAL REVIEW 57

80
70
60
50
40
30
20
10
0
2
1 2
0 1
0
y –1 –1 x
–2 –2

Figure 10. Error Paraboloid of the LMS learning

Usually, in a Multilayer Perceptron, the nodes in each layer are fully intercon-
nected with the neurons in the adjacent layer. This fact is repeated layer by layer
through all the network.

6.1.1 Learning Algorithm (“Backpropagation”)


Before detailing the learning algorithm, let us introduce the following nomenclature:

ulj : output of the j-th node in layer l.


wlji : weight that connects the i-th node in layer l − 1 to the node j-th in layer l.
xp : p-th training pattern.
u0i : i-th component of the input vector.
dj xp : desired output of the j-th node in the output layer when a p-th pattern
is presented at the network input.
NL : number of nodes in a given layer.
L: number of layers.
P: number of training patterns.

Obviously, in a Perceptron-like structure, outputs depend upon the synaptic


weights that connect neurons in the different layers. Such weights are actualized in
the following way
1. Associating a set of input patterns to a set of desired outputs. In a pattern
classification problem it is the same as making a primary classification on them
by the designer (supervised training).
58 CHAPTER 2

2. Presenting all training patterns to the network. The network then processes all
patterns and presents an output. The classification offered by the network can
be an erroneous one, thus the error is easily quantified.
3. Defining an objective function. For example, the Mean Square Error (MSE)
between the desired and real outputs of the units in the output layer [Hush and
Horne, 1993]:

1 NL
Jp w = u x  − dq xn 
2 q=1 Lq n

This objective function represents an error function in a parametric hyperspace.


The training or learning then consists in the search for the minimum of that surface
through a gradient descent algorithm in the opposite direction of the surface gradient
by examining a set of weights that minimizes the error.
Each weight is modified or adapted in each iteration step in an amount that is
proportional to the partial derivative of the function to that weight
Jp w
(9) wlji k + 1 = wlji k −

wlji

In Equation (9), constant


is the learning rate. The speed of the convergence
of the algorithm depends on
because the amount of the weight modification in
each iteration step is proportional to the gradient in the weight direction, but it
is weighted by the constant value of the learning rate. In this point, the training
algorithm can be designed if we know how to calculate the partial derivative to
each weight of the network. This derivative can be easily calculated using the chain
rule:
Jp w Jp w ulj
= 
wlji ulj wlji

that is,
 
Nl−1 −1
Jp w Jp w  
= f wljm ul−1m ul−1i 
wlji ulj m=0

where f·, represents the sigmoidal function previously defined. This function has
a very simple derivative:
f
f   = = f1 − f
d
when the parameter is of unit value.
In this expression we can observe that the sensibility of the objective function to
each weight depends on the sensibility of this function to the output of the neuron
that is fed by the synaptic weight input.
NEURAL NETWORKS HISTORICAL REVIEW 59

This last sensibility can be in its turn calculated from the objective function
sensibilities with respect to the node outputs of the following layer, and so on [Hush
and Horne, 1993]. This process is repeated till we reach to the output layer.
The sensibility of the objective function to each node output can be calculated
from the output layer in a recursive from. The sensibilities to the outputs of nodes
in hidden layers are also denominated “error”, although, strictly speaking, they do
not represent a real error. In order to calculate the error in the hidden layers, the
error in the output layer must be computed and backpropagated to previous layers.
That is performed by the Backpropagation algorithm. In this algorithm, training
usually is started with random small values of the synaptic weights in order to
provide a safe to the backpropagation algorithm. Once the structure of the network
is chosen, the key parameter to be controlled is the learning rate. A too small value
will slow the learning process. A too high value will accelerate the learning, but
can produce loosing the minimum of the error surface. To find the optimal value of
this parameter, an empirical method has to be used. Once the learning has started,
it must continue till a minimum error is found, or till no variation in weight values
is achieved. In that point, the network is said to have finished learning. It is not
always practical to wait till this point of the learning and several other criteria are
adopted, among them:
1. When the value of the gradient error surface is sufficiently small, it means
that the gradient learning algorithm has found a set of weight values in a local
minimum of the error surface.
2. When the error between the real network output and the desired one is under cer-
tain tolerable value for our application. Obviously, this case needs the knowledge
of the maximum tolerable error for the given application.
3. In pattern classification problems, when all the learning patterns have been
correctly classified, the training procedure can be stopped.
4. Training can be stopped after a fixed number of iterations.
5. Finally, a more appropriate and developed procedure is to train the network with
a set of patterns and supervise the error over a different set called test set. The
training phase is stopped when a minimum error on the test set is found.
This last method prevents the overspecialization of the network on the training
set, a phenomenon that happens when the error on the training set is lower than the
error over other set of patterns of the same application, showing that the network
has lost generalization capabilities. The method needs to use a double number of
patterns, a fact that can be expensive or even not possible. Therefore, in order to
efficiently apply neural networks to real problems it is very important to have a
number of patterns in sufficient number.

6.2 Acceleration of the training procedure

The training procedure described in the previous section presents two main prob-
lems: in one hand the convergence or training phase is very slow, and, on the other
hand, it is not easy to precisely elect the appropriate learning rate. A simple solution
60 CHAPTER 2

to accelerate the network training is the usage of second order methods that use the
information contained in the second matrix of derivates (Hessian). These methods
reduce the number of iterations needed in the training phase in order to achieve
a local or global minimum of the error surface. Nevertheless, they cost a higher
amount of computation and this increases the time of training. For this reason, only
the diagonal matrix of the Hessian is usually used.
Another solution is to rise the gradient value by adding a term that is a fraction
of the past changes in the weights. This term, usually known as momentum term,
is the weight by a new constant value, usually designated by :
Jw
wkj k + 1 = wkj k −
+ wkj k
wkj
This term tends to smooth the changes in the weights, leading to increase the
learning speed by avoiding divergent learning fluctuations.
It has been shown that adding noise to the training patterns, decreases the training
time and helps to avoid local minima in the learning process.
Another way to decrease the training time consists in the use of alternative
transfer functions in the network nodes. When allowing a function to take positive
and negative values in a symmetric dynamical range, it is probable that several
activations will be next to zero and their corresponding weights will not need to be
actualized. An example of this type of activation function is the hyperbolic one.
In Table 1, typical parameters of this kind of networks and their influence in the
processing are summarized.

6.3 On-Line and Off-Line training


During the training, weight update can be carried out in two different ways [Bour-
land and Morgan, 1994]:
• Off-line or “Block training”: in this case, modifications on the weights over the
whole training set are accumulated. The weights are modified only when all the
training patterns are presented to the network.

Table 1. Design Properties of NNs

Transfer Function Sign Exponential  = 1 Hyperbolic = 1


Derivate of Transfer f  x =
f  x = fx1 − fx f  x = 1 − f 2 x
Function
Learning rate =1  = 0 1  = 0 01
Effects on the NN Learning not Quick but not precise Precise and slow
guaranteed convergence convergence
Moment With a small value, the With a big value, the
vectors of weights vectors of weights
increment take very increment take similar
divergent directions directions, helping to
the convergence of the
training
NEURAL NETWORKS HISTORICAL REVIEW 61

• On-line training: the network weights are modified each time that a new training
pattern is presented to the network. It can be proved that this method leads to
the same result as that of the off-line training [Widrow and Stearns, 1985]. In
practice, this method shows some advantages that make it much more attractive
to be used: it converges much more quickly to a minimum of the error surface
and usually avoids the local minima. A possible explanation is that with the
on-line training some “noise” is introduced over the set of training patterns.

6.4 Selection of the Network size

The selection of the appropriate network size is a task of the utmost importance:
if the network is too small, it will not be able to achieve an efficient solution for
the problem that is representing, while if its size is too big it can happen that the
network can represent too many solutions to solve the problem over the training
patterns but none of them is optimum to the application problem.
If there is no preliminary experience, the dimension of the network size is a trial
and error problem. To start with, an option is to try a small network and to increase
the size progressively in order to find an efficient dimension for the network. The
other option is to try a big network and reduce the size progressively, removing the
nodes or weights that do not have significance on the overall output of the NN.
Several studies have settled some size limits that should not be exceeded. In
this sense, a proposal is that the number of nodes in the hidden layer should not
exceed the number of training patterns. In practice, this is always accomplished,
as the number of nodes will always be much lower than the number of training
patterns. In fact, big networks can be able to memorize the whole training set
loosing generalization capabilities.

7. KOHONEN NETWORKS

A main principle in biological brain organization is that neurons group in such a


way that those that are physically close collaborate in the same stimulus that is
being processed. That is the way that nerve connections are organized. For example,
to each level of the auditive path, nerve cells and fibers are disposed in relation
to the frequency that is responsible of a higher output for each neuron [Lipmann,
1987]. Therefore, the physical disposition of the neurons in the brain structure is in
somehow related to the function they perform.
Kohonen [Kohonen, 1984] proposed an algorithm to adjust the weights of a
network whose input is a vector of N components and its output is another vector
of different dimension, MM < N. In this way, the dimension of the input subspace
is reduced, physically grouping the data.
Vectors defined over a continuous variable are used as input to the network. The
network is trained without supervision in a way that the network itself establishes
the input data grouping criteria, extracting regularities and correlations. When a suf-
ficient number of input vectors has been presented, the weights are self-organized in
62 CHAPTER 2

a way that, topologically speaking, close nodes are sensible to similar inputs. Nodes
physically far will stand completely inactive. Clusters that have their topological
equivalent in the network are produced. For this reason this kind of networks are
known as Self-Organized Feature Map (SOFM).
The algorithm that assigns values to the connections of the synaptic weights
is based on the concepts of neighborhood and competitive learning. The distance
between the input and the weights of all the nodes is computed, establishing the
closest one as the winner node. The updating of weights is performed for this node
and the neighbor nodes. The rest are not actualized favoring a concrete physical
organization.
This kind of network has always two layers: the input and the output one. The
dimension of the input vector establishes the number of nodes of the input layer:
one node for each component of the input vector. The input neurons drive the input
vectors to the output layer controlled by the connections weights. In this type of
network it is very important to establish a neighborhood and a distance measure in
the network.
In the example of Figure 11, the nodes are configured in a bidimensional structure.
The algorithm used to compute the output is designed in such a way that only
one output neuron is activated when one input vector is applied to the network.
The fired node corresponds to the category of classification corresponding to the
input vector. Similar input vectors activate the same output, while different vectors
activate different neurons. Only the neuron with the minimum difference between
the input vector and the output weights vector node is activated. When the training
algorithm starts, the adjustment is done in a wide zone surrounding the fired node or
winner node. As the training progresses, the neighbor area is progressively reduced.
Through this little adjustment, the network follows any systematic change in the
input vectors: the network self-organizes. Therefore, this algorithm behaves as a
vectorial quantifier when the number of desired clusters can be a priori specified
and a sufficient amount of data relative to the number of the desired clusters is

Outputs

Input layer

Figure 11. Structure of the Kohonen Network


NEURAL NETWORKS HISTORICAL REVIEW 63

known. However, the results depend on the order of the presentation of the input
data, specially when the amount of input data is small.

7.1 Training
Training of the SOMF network can be summarized in five steps:
1. Weights initialization: The network structure is N input nodes and M output
nodes. Random values are assigned to each of the weight wij  connections.
Initial neighbor radius is fixed for the neighbor mask.
2. Presentation of a new E/S pair: A new pattern is presented at the input Xp t =
x1 t x2 t     xN t.
3. Computation of the distance dj between the input and each one of the output
nodes

N −1
dj = xi t − wij t2 
i=0

where xi t is the input to the node i in the iteration t, and wij t is the input
weight i to the output j in the iteration t.
4. Selection of the output node as the node with the minimum distance: node j ∗ is
selected as the node with the minimum distance dj .
5. Updating node j ∗ and its neighbor: weights are updated for node j ∗ and all its
neighbors in the vicinity matrix defined by NEj ∗ t.
The new weights are:

wij t + 1 = wij t +


txi t − wij t

for j ∈ NEj ∗ t 0 ≤ i ≤ N − 1.


The term
t is a gain term 0 <
t < 1 that decreases with time.
6. Back to step 2.
An standard example introduced by Kohonen illustrates the self-organized net-
works capacity to learn random distributions of the input vectors presented to the
network. For example, if the input is an order two vector with component uniformly
distributed and the output is designed as bidimensional, then the network weights
will organize in a reticular fashion as shown in Figure 12.

1.5 1
3 0.9
1 0.8
2.5 0.7
2 0.6
0.5 0.5
1.5 0.4
1 0.3
0 0.2
0.5 0.1
0
–0.5 0
–0.5 0 0.5 1 1.5 0.5 0 0.5 1 1.5 2 2.5 3 0.4 0.6 0.8 1 1.2 1.4 1.6

Figure 12. Kohonen Map for the two-dimensional case


64 CHAPTER 2

8. FUTURE PERSPECTIVES

Artificial neural networks are inspired from the biological performance of the
human brain, where the former attempts to emulate the latter. This is the main
link between biological and artificial neural networks. From this starting point,
both disciplines follow separate ways. The present understanding of the brain
mechanisms is so limited that the systems designer has not sufficient data to emulate
its performing. Therefore, the engineer has to be one step forward from the biological
knowledge, searching and devising useful algorithms and structures that efficiently
solve given problems. In the vast majority of cases, this search delivers a result that
diverges completely from the biological reality and the brain similarities become
metaphors.
Despite this faint and usually inexisting analogy between biology and artificial
neural networks, the results of the latter frequently evoke comparisons with the
former, because they are frequently reminiscent of the performing of the brain.
Unfortunately, these comparisons are not benign and produce unrealistic expec-
tations that lead to disappointment. Researching based on false expectations can
evaporate when illuminated by the light of reality, as happened in the sixties. This
promising researching field could eclipse again if we do not contain the temptation
of comparing our results with those of the brain.
It has been said that NNs are capable of being applied in all activities spe-
cific of the human brain. Currently, they are considered an alternative for all
those tasks where the conventional computation does not achieve satisfactory
results. There has been speculations about a next future where NNs will be
able to reach a place together with classical computation. However, this will
only happen if the researchers achieve sufficient knowledge for that develop-
ing. Currently, the theoretical knowledge is not robust enough to justify such
predictions.

REFERENCES
W.W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Inminent in Nervous Activity, Bulletin
of Mathematical Biophysics, 5:115–133, 1943.
W. Pitts and W.W. McCulloch, How We Know Universals, Bulletin of Mathematical Biophysics, 9:127–
147, 1947.
D.O. Hebb, Organization of Behaviour, Science Editions, New York, 1961.
F. Rosenblatt, Principles of Neurodynamics, Science Editions, New York, 1962.
B. Widrow, M. E. Hoff, Adaptive Switching Circuits, In IRE WESCON Convention Record, pages
96–104, 1960.
M. Minsky, S. Papert, Perceptrons, MIT press, Cambridge, MA, 1969.
G. Cybenko, Approximation by Superposition of a Sigmoidal Function, Mathematics of Control, Signals,
and Systems, 2:303–314, 1989.
K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are Universal Approxi-
mators, Neural Networks, 2(5):359–366, 1989.
K. Hornik, M. Stinchcombe and H. White, Universal Aproximation of an Unknown Mapping and Its
Derivatives using Multilayer Feedforward Networks, Neural Networks, 3:551–560, 1990.
NEURAL NETWORKS HISTORICAL REVIEW 65

S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College Publishing, Ontario,


1994.
P.J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences,
PhD thesis, Harvard University, Boston, 1974.
D.E. Rumerlhart, G. E. Hinton and R. J. Williams, Learning Internal Representations by Error Propaga-
tion, In D. E. Rumelhart, J. L. McClelland and the PDP Research Group, editors, Parallel Distributed
Processing: Explorations in the Microstructure of Cognition, volume 1: Foundations, pages 318–362,
MIT Press, Cambridge, MA, 1986.
D.B. Parker, Learning Logic, Technical report, Technical Report TR-47, Cambridge, MA: MIT Center
for Research in Computational Economics and Management Science, 1985.
D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippman?,
IEEE Signal Processing Magazine, 2:721–729, January, 1993.
N.B. Karayiannis and A.N. Venetsanopoulos, Artificial Neural Networks, Learning Algorithms, Perfo-
mance Evaluation and Applications, Kluwer Academic Publishers, Boston, MA, 1993.
H.A. Bourland and N. Morgan, Connectionist Speech recognition. A hybrid Approach, Kluwer Academic
Publishers, Boston, MA, 1994.
B. Widrow and S.D. Stearns, Adaptative Signal Processing, Prentice-Hall, Signal Processing Series,
Englewood Cliffs, NJ, 1985.
R.P. Lipmann, An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, 328–339, April,
1987.
T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984.
CHAPTER 3
ARTIFICIAL NEURAL NETWORKS

D. T. PHAM, M. S. PACKIANATHER, A. A. AFIFY


Manufacturing Engineering Centre, Cardiff University, Cardiff CF24 3AA, United Kingdom

INTRODUCTION
Artificial neural networks are computational models of the brain. There are many
types of neural networks representing the brain’s structure and operation with
varying degrees of sophistication. This chapter provides an introduction to the main
types of networks and presents examples of each type.

1. TYPES OF NEURAL NETWORKS


Neural networks generally consist of a number of interconnected processing ele-
ments (PEs) or neurons. How the inter-neuron connections are arranged and the
nature of the connections determine the structure of a network. How the strengths of
the connections are adjusted or trained to achieve a desired overall behaviour of the
network is governed by its learning algorithm. Neural networks can be classified
according to their structures and learning algorithms.

1.1 Structural Categorisation


In terms of their structures, neural networks can be divided into two types: feed-
forward networks and recurrent networks.
Feedforward networks: In a feedforward network, the neurons are generally
grouped into layers. Signals flow from the input layer through to the output layer
via unidirectional connections, the neurons being connected from one layer to the
next, but not within the same layer. Examples of feedforward networks include the
multi-layer perceptron (MLP) [Rumelhart and McClelland, 1986], the radial basis
function (RBF) network [Broomhead and Lowe, 1988; Moody and Darken, 1989],
the learning vector quantization (LVQ) network [Kohonen, 1989], the cerebellar
67
D. Andina and D.T. Pham (eds.), Computational Intelligence, 67–92.
© 2007 Springer.
68 CHAPTER 3

model articulation control (CMAC) network [Albus, 1975a], the group-method of


data handling (GMDH) network [Hecht-Nielsen, 1990] and some spiking neural
networks [Maass, 1997]. Feedforward networks can most naturally perform static
mappings between an input space and an output space: the output at a given instant
is a function only of the input at that instant.
Recurrent networks: In a recurrent network, the outputs of some neurons are
fedback to the same neurons or to neurons in preceding layers. Thus, signals can flow
in both forward and backward directions. Examples of recurrent networks include
the Hopfield network [Hopfield, 1982], the Elman network [Elman, 1990] and the
Jordan network [Jordan, 1986]. Recurrent networks have a dynamic memory: their
outputs at a given instant reflect the current input as well as previous inputs and
outputs.

1.2 Learning Algorithm Categorisation

Neural networks are trained by two main types of learning algorithms: supervised
and unsupervised learning algorithms. In addition, there exists a third type, rein-
forcement learning, which can be regarded as a special form of supervised learning.
Supervised learning: A supervised learning algorithm adjusts the strengths or
weights of the inter-neuron connections according to the difference between the
desired and actual network outputs corresponding to a given input. Thus, supervised
learning requires a teacher or supervisor to provide desired or target output signals.
Examples of supervised learning algorithms include the delta rule [Widrow and
Hoff, 1960], the generalised delta rule or backpropagation algorithm [Rumelhart
and McClelland, 1986] and the LVQ algorithm [Kohonen, 1989].
Unsupervised learning: Unsupervised learning algorithms do not require the
desired outputs to be known. During training, only input patterns are presented to the
neural network which automatically adapts the weights of its connections to cluster
the input patterns into groups with similar features. Examples of unsupervised
learning algorithms include the Kohonen [Kohonen, 1989] and Carpenter-Grossberg
Adaptive Resonance Theory (ART) [Carpenter and Grossberg, 1988] competitive
learning algorithms.
Reinforcement learning: As mentioned before, reinforcement learning is a spe-
cial case of supervised learning. Instead of using a teacher to give target outputs,
a reinforcement learning algorithm employs a critic only to evaluate the goodness
of the neural network output corresponding to a given input. An example of a
reinforcement learning algorithm is the genetic algorithm (GA) [Holland, 1975;
Goldberg, 1989].

2. NEURAL NETWORKS EXAMPLE

This section briefly describes the example neural networks and associated learning
algorithms cited previously.
ARTIFICIAL NEURAL NETWORKS 69

2.1 Multi-layer Perceptron (MLP)

MLPs are perhaps the best known type of feedforward networks. Figure 1a shows
an MLP with three layers: an input layer, an output layer and an intermediate or
hidden layer. Neurons in the input layer only act as buffers for distributing the
input signals xi to neurons in the hidden layer. Each neuron j (Figure 1b) in the
hidden layer sums up its input signals xi after weighting them with the strengths of
the respective connections wji from the input layer and computes its output yj as a
function f of the sum, viz.
 
(1) yj = f wji xi

f can be a simple threshold function or a sigmoidal, hyperbolic tangent or radial


basis function (see Table 1).
The output of neurons in the output layer is computed similarly.
The backpropagation (BP) algorithm, a gradient descent algorithm, is the most
commonly adopted MLP training algorithm. It gives the change wji in the weight

Output Layer

y1 yn

Hidden Layer
w1m
w12
w11

Input Layer

x1 x2 xm

Figure 1a. A multi-layer perceptron

x1 wj1

wji yj
xi Σ f(.)
wjn

xn

Figure 1b. Details of a neuron


70 CHAPTER 3

Table 1. Activation functions

Type of Functions Functions

Linear fs = 
s
+1 if s > st
Threshold fs =
−1 otherwise
Sigmoid fs = 1/1 + exp−s
Hyperbolic tangent fs = 1 − exp−2s/1 + exp2s
Radial basis function fs = exp−s2 /2 

of a connection between neurons i and j as follows:

(2) wji = j xi

where  is a parameter called the learning rate and j is a factor depending on


whether neuron j is an output neuron or a hidden neuron. For output neurons,
 
f t
(3) j = yj − yj
netj

and for hidden neurons,


 
f 
(4) j = w 
netj q qj q

In Equation (3), netj is the total weighted sum of input signals to neuron j and
t
yj is the target output for neuron j.
As there are no target outputs for hidden neurons, in Equation (4), the difference
between the target and actual output of a hidden neuron j is replaced by the
weighted sum of the q terms already obtained for neurons q connected to the output
of j. Thus, iteratively, beginning with the output layer, the  term is computed
for neurons in all layers and weight updates determined for all connections. The
weight updating process can take place after the presentation of each training pattern
(pattern-based training) or after the presentation of the whole set of training patterns
(batch training). In either case, a training epoch is said to have been completed
when all training patterns have been presented once to the MLP.
For all but the most trivial problems, several epochs are required for the MLP
to be properly trained. A commonly adopted method to speed up the training is to
add a “momentum” term to Equation (2) which effectively lets the previous weight
change influence the new weight change, viz:

(5) wji k + 1 = j xi +


wji k

where wji k + 1 and wji k are weight changes in epochs k + 1 and k
respectively and
is the “momentum” coefficient.
ARTIFICIAL NEURAL NETWORKS 71

Another learning method suitable for training MLPs is the genetic algorithm
(GA). This is an optimisation algorithm based on evolution principles. The weights
of the connections are considered genes in a chromosome. The goodness or fitness
of the chromosome is directly related to how well trained the MLP is. The algorithm
starts with a randomly generated population of chromosomes and applies genetic
operators to create new and fitter populations. The most common genetic operators
are the selection, crossover and mutation operators. The selection operator chooses
chromosomes from the current population for reproduction. Usually, a biased selec-
tion procedure is adopted which favours the fitter chromosomes. The crossover
operator creates two new chromosomes from two existing chromosomes by cutting
them at a random position and exchanging the parts following the cut. The muta-
tion operator produces a new chromosome by randomly changing the genes of an
existing chromosome. Together, these operators simulate a guided random search
method which can eventually yield the optimum set of weights to minimise the
differences between the actual and target outputs of the neural network. Further
details of genetic algorithms can be found in the chapter on Soft Computing and
its Applications in Engineering and Manufacture.

2.2 Radial Basis Function (RBF) Network

Large multi-layer perceptron (MLP) networks take a long time to train. This has led
to the construction of alternative networks such as the Radial Basis Function (RBF)
network [Cichocki and Unbahauen, 1993; Hassoun, 1995; Haykin, 1999]. The RBF
network is the most used network after MLPs. Figure 2 shows the structure of a RBF
network which consists of three layers. The input layer neurons receive the inputs
x1   xM . The hidden layer neurons provide a set of activation functions that consti-
tute an arbitrary “basis” for the input patterns in the input space to be expanded into the
hidden space by way of non-linear transformation. At the input of each hidden neuron,
the distance between the centre of each activation or basis function and the input vector
is calculated. Applying the basis function to this distance produces the output of the
hidden neuron. The RBF network output y is formed by the neuron in the output layer
as a weighted sum of the hidden layer neuron activation.

Hidden
Layer
Input
Layer
x1 w1 Output
Layer
wi
xk y
wN

xM

Figure 2. The RBF network


72 CHAPTER 3

K(x)
1.0

Standard
Deviation
σ=1
x
0

Figure 3. The Radial Basis Function

The basis function is generally chosen to be a standard function which is positive


at its centre x = 0 and then decreases uniformly to zero on either side as shown in
Figure 3. A common choice is the Gaussian distribution function:
 2
x
(6) Kx = exp −
2
This function can be shifted to an arbitrary centre, x = c, and stretched by varying
its standard deviation as follows:
   
x − c x − c2
(7) K = exp −
2 2
The output of the RBF network y is given by:
 
N
x − ci 
(8) y = wi K
i=1 i ∀x

where wi is the weight of the hidden neuron i, ci the centre of basis function i and
i the standard deviation of the function. x − ci  is the norm of x − ci . There
are various ways to calculate the norm. The most common is the Euclidean norm
given by:

(9) x − ci  = x1 − ci1 2 + x2 − ci2 2 + + xM − ciM 2

This norm gives the distance between the two points x and ci in N-dimensional
space. All points x that are the same radial distance from ci give the same value
for the norm and hence the same value for the basis function. Hence the basis
functions are called Radial Basis Functions. Obtaining the values for wi , ci and i
requires training the RBF network. Because the basis functions are differentiable,
back-propagation could be used as with MLP networks. Training of a multiple-input
single-output RBF network can proceed as follows:
(i) choose the number N of hidden units;
There is no firm guidance available for this. The selection of N is normally
made by trial and error. In general, the smallest N that gives the RBF network
an acceptable performance is adopted.
ARTIFICIAL NEURAL NETWORKS 73

(ii) choose the centres, ci ;


Centre selection could be performed in three different ways [Haykin, 1999]:
a) Trial and error:
Centres can be selected by trial and error. This is not always easy if little is
known about underlying functional behaviour of data. Usually, the centres
are spread evenly or randomly over N -dimensional input space.
b) Self-organized selection:
An adaptive unsupervised method can be used to learn where to place the
centres.
c) Supervised selection:
A supervised learning process, commonly error correction learning, can be
deployed to fix the centres.
(iii) choose stretch constants, i ;
Several heuristics are available. A popular way is to set i equal to the distance
to nearest neighbour. First the distances between centres are computed then
the nearest distance is chosen to be the value of i .
(iv) calculate weights, wi .
When ci and wi are known, the outputs of hidden units O1   ON T
can be calculated for any pattern of inputs x = x1   xM . Assuming there
are P input patterns x in the training set, there will be P sets of hidden unit
outputs that can be calculated. These can be assembled in a N × P matrix:

1 2 P ⎤
O1 O1 O1
⎢O1 O2 OP ⎥
⎢ 2 2 2 ⎥
(10) O=⎢



⎣ ⎦
1 2 P
ON ON ON

If the output yi of the RBF network corresponding to training input pattern
i i i
xi is yi = O1 w1 + O2 w2 + + ON wN , the following equation can be
obtained:
⎡ ⎤ ⎡ 1 1 ⎤ ⎡ ⎤
y1 O1 ON w1
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
(11) y=⎢ ⎥ ⎢
⎣ ⎦=⎣
⎥ · ⎢ ⎥ = OT · w
⎦ ⎣ ⎦
P P
yP O1 ON wN

y is the vector of actual outputs corresponding to the training inputs x. Ideally,


y should be equal to d, the desired/target outputs. Unknown coefficients wi
can be chosen to minimise the sum-squared-error of y compared with d. It can
be shown that this is achieved when:

(12) w = O OT −1 O d
74 CHAPTER 3

2.3 Learning Vector Quantization (LVQ) Network

Figure 4 shows an LVQ network which comprises three layers of neurons: an input
buffer layer, a hidden layer and an output layer. The network is fully connected
between the input and hidden layers and partially connected between the hidden
and output layers, with each output neuron linked to a different cluster of hidden
neurons. The weights of the connections between the hidden and output neurons are
fixed to 1. The weights of the input-hidden neuron connections form the components
of reference vectors (one reference vector is assigned to each hidden neuron). They
are modified during the training of the network. Both the hidden neurons (also
known as Kohonen neurons) and the output neurons have binary outputs. When an
input pattern is supplied to the network, the hidden neuron whose reference vector
is closest to the input pattern is said to win the competition for being activated and
thus allowed to produce a “1”. All other hidden neurons are forced to produce a
“0”. The output neuron connected to the cluster of hidden neurons that contains
the winning neuron also emits a “1” and all other output neurons a “0”. The output
neuron that produces a “1” gives the class of the input pattern, each output neuron
being dedicated to a different class. The simplest LVQ training procedure is as
follows:
(i) initialise the weights of the reference vectors;
(ii) present a training input pattern to the network;
(iii) calculate the (Euclidean) distance between the input pattern and each reference
vector;
(iv) update the weights of the reference vector that is closest to the input pattern,
that is, the reference vector of the winning hidden neuron. If the latter belongs

Output layer

Hidden
(Kohonen)
Layer

Reference vector

Input layer

Input vector

Figure 4. Learning Vector Quantization network


ARTIFICIAL NEURAL NETWORKS 75

to the cluster connected to the output neuron in the class that the input pattern is
known to belong to, the reference vector is brought closer to the input pattern.
Otherwise, the reference vector is moved away from the input pattern;
(v) return to (ii) with a new training input pattern and repeat the procedure until
all training patterns are correctly classified (or a stopping criterion is met).
For other LVQ training procedures, see for example [Pham and Oztemel, 1994].

2.4 CMAC Network

CMAC (Cerebellar Model Articulation Control) [Albus, 1975a, 1975b, 1979a,


1979b; An et al 1994] can be considered a supervised feedforward neural network
with the characteristics of a fuzzy associative memory. A basic CMAC module is
shown in Figure 5.
CMAC consists of a series of mappings:
e f g
(13) S −→M −→A−→u

where

S = input vectors
M = intermediate variables
A = association cell vectors
u = output of CMAC ≡ hS
h ≡ g·f ·e

(a) Input encoding (S → M mapping)


The S → M mapping is a set of submappings, one for each input variable:
⎡ ⎤
s 1 → m1
⎢ s2 → m 2 ⎥
(14) S→M =⎢ ⎥
⎣  ⎦
sn → mn

S >M M >A A >u


Actual
Input Input Weight Output u
: : : +
S Encoding Table

+
Desired
Output

Figure 5. A basic CMAC module


76 CHAPTER 3

The range of s1 is coarsely discretised using the quantising functions


q1  q2   qk . Each function divides the range into k intervals. The intervals pro-
duced by function qj+1 are offset by one kth of the range compared to their coun-
terparts produced by function qj . mi is a set of k intervals generated by q1 to qk
respectively.
An example is given in Figure 6 to illustrate the internal mappings within a
CMAC module. The S → M mapping is shown in the leftmost part of the figure.
In Figure 6, two input variables s1 and s2 are represented with unity resolution
in the range of 0 to 8. The range of each input variable is described using three
quantising functions. For example, the range of s1 is described by functions q1  q2 ,
and q3 . q1 divides the range into intervals A, B, C and D. q2 gives intervals E, F ,
G, and H and q3 provides intervals I, J , K and L. That is,

q1 = A B C D
q2 = E F G H
q3 = I J K L

For every value of s1 , there exists a set of elements, m1 , which are the intersection
of the functions q1 to q3 , such that the value of s1 uniquely defines set m1 and
vice versa. For example, value s1 = 5 maps to set m1 = B G K and vice versa.
Similarly, value s2 = 4 maps to set m2 = b g j and vice versa.
The S → M mapping gives CMAC two advantages: the first is that a single
precise variable si can be transmitted over several imprecise information channels.
Each channel carries only a small part of the information of si . This increases
the reliability of the information transmission. The other advantage is that small
changes in the value of si have no influence on most of the elements in mi . This
leads to the property of input generalisation which is important in an environment
where random noise exists.

S M M A A u
d * * * *
c * * * *
m2 X1 *
b * *
l d s2 a * * * *
h 8 A B C D
c 7
k 6 h * * * *
g 5 g * * X2 *

b 4 f * * * * +
j 3 e * * * *
f 2 E F G H
a 1 _
i
0
e l * * * *
0 1 2 3 4 5 6 7 8 s1 k * * * *
A B C D j * * X3 * +
E F G H i * * * *
m1 I J K L
I J K L

Figure 6. Internal mappings within a CMAC module


ARTIFICIAL NEURAL NETWORKS 77

(b) Address computing (M → A mapping)


A is a set of address vectors associated with weight tables. A is obtained by
combining the elements of mi . For example, in Figure 6, the sets m1 = B G K
and m2 = b g j are combined to give the set of elements A = a1  a2  a3  =
Bb Gg Kj.

(c) Output computing (A → U mapping)


This mapping involves looking up the weight tables and adding the contents of the
addressed locations to yield the output of the network. The following formula is
employed:

(15) u = wi ai 
i

That is, only the weights associated with the addresses ai in A are summed. For
this given example, these weights are:

wBb = x1
wGg = x2
wKj = x3

Thus the output is:

(16) u = x1 + x2 + x3

Training a CMAC module consists of adjusting the stored weights. Assuming


that f is the function that CMAC has to learn, the following training steps could
be adopted:
(i) select a point S in the input space and obtain the current output u corresponding
to S;
(ii) let u be the desired output of CMAC, that is, u = f S;
(iii) if u − u ≤ , where  is an acceptable error, then do nothing; the desired
value is already stored in CMAC. However, if u − u > , then add to every
weight which contributed to u the quantity
u−u
(17) =
A
where A = the number of weights which contributed to u and  is the learning
rate.

2.5 Group Method of Data Handling (GMDH) Network


Figure 7 shows a GMDH network and the details of one of its neurons. Unlike
the feedforward neural networks previously described which have a fixed structure,
78 CHAPTER 3

N-Adaline
x1

N-Adaline N-Adaline
x2
N-Adaline
y
N-Adaline N-Adaline N-Adaline
x3
N-Adaline
N-Adaline N-Adaline
x4

N-Adaline

Figure 7a. A trained GMDH network

Note: Each GMDH neuron is an N-Adaline, which is an Adaptive Linear Element with a nonlinear preprocessor

Nonlinear processor
x1
w1
+1

x1 x21
Square w2 w0

x1x2 output
X w3 +

x2 x22 –
Square w4 e

+
x2
w5 yd
desired output

Figure 7b. Details of a GMDH Neuron

a GMDH network has a structure which grows during training. Each neuron in a
GMDH network usually has two inputs x1 and x2 and produces an output y that is
a quadratic combination of these inputs, viz.

(18) y = wo + w1 x1 + w2 x12 + w3 x1 x2 + w4 x22 + w5 x2

Training a GMDH network consists of configuring the network starting with the
input layer, adjusting the weights of each neuron, and increasing the number of
layers until the accuracy of the mapping achieved with the network deteriorates.
ARTIFICIAL NEURAL NETWORKS 79

The number of neurons in the first layer depends on the number of external
inputs available. For each pair of external inputs, one neuron is used.
Training proceeds with presenting an input pattern to the input layer and adapting
the weights of each neuron according to a suitable learning algorithm, such as the
delta rule (see for example [Pham and Liu, 1994]), viz.
Xk  
(19) Wk+1 = Wk +  ykd − WkT Xk
Xk 2

where Wk , the weight vector of a neuron at time k, and Xk the modified input vector
to the neuron at time k, are defined as

(20) Wk = w0  w1  w2  w3  w4  w5 T
 T
(21) Xk = 1 x1  x12  x1 x2  x22  x2

and ykd is the desired network output at time k.


Note that, for this description, it is assumed that the GMDH network only has one
output. Equation (19) shows that the desired network output is presented to each
neuron in the input layer and an attempt is made to train each neuron to produce
that output. When the sum of the mean square errors SE over all the desired outputs
in the training data set for a given neuron reaches the minimum for that neuron,
the weights of the neuron are frozen and its training halted. When the training
has ended for all neurons in a layer, the training for the layer stops. Neurons that
produce SE values below a given threshold when another set of data (known as the
selection data set) is presented to the network are selected to grow the next layer.
At each stage, the smallest SE value achieved for the selection data set is recorded.
If the smallest SE value for the current layer is less than that for the previous layer
(that is, the accuracy of the network is improving), a new layer is generated, the
size of which depends on the number of neurons just selected. The training and
selection processes are repeated until the SE value deteriorates. The best neuron in
the immediately preceding layer is then taken as the output neuron for the network.

2.6 Hopfield Network


Figure 8 shows one version of a Hopfield network. This network normally accepts
binary and bipolar inputs (+1 or −1). It has a single “layer” of neurons, each
connected to all the others, giving it a recurrent structure, as mentioned earlier. The
training of a Hopfield network takes only one step, the weights wij of the network
being assigned directly as follows:
⎧ P
⎨ 1  xc xc  i = j
(22) wij = N c=1 i j

0 i=j
where wij is the connection weight from neuron i to neuron j, and xic (which is
either +1 or −1) is the ith component of the training input pattern for class c, P
80 CHAPTER 3

Outputs
y1 y2 y3 yN

w12 w13 w1N


Hopfield
Layer

x1 x2 x3 xN
Inputs

Figure 8. A Hopfield network

the number of classes and N the number of neurons (or the number of components
in the input pattern). Note from Equation (22) that wij = wji and wii = 0, a set of
conditions that guarantee the stability of the network. When an unknown pattern
is input to the network, its outputs are initially set equal to the components of the
unknown pattern, viz.

(23) yi 0 = xi  1≤i≤N

Starting with these initial values, the network iterates according to the following
equation until it reaches a minimum energy state, i.e. its outputs stabilise to constant
values:
 

N
(24) yi k + 1 = f wij yi k  1 < i ≤ N
j=1

where f is a hard limiting function defined as



−1 x < 0
(25) fx =
1 x>0

2.7 Elman and Jordan Nets

Figures 9a and b show an Elman net and a Jordan net, respectively. These networks
have a multi-layered structure similar to the structure of MLPs. In both nets, in
addition to an ordinary hidden layer, there is another special hidden layer sometimes
called the context or state layer. This layer receives feedback signals from the
ARTIFICIAL NEURAL NETWORKS 81

outputs

output units

1
1
hidden units

input units
context unit
inputs

Figure 9a. An Elman network

output

output feedback output unit

hidden layer

input unit
self feedback
input
context unit

Figure 9b. A Jordan network

ordinary hidden layer (in the case of an Elman net) or from the output layer (in
the case of a Jordan net). The Jordan net also has connections from each neuron
in the context layer back to itself. With both nets, the outputs of neurons in the
context layer, are fed forward to the hidden layer. If only the forward connections
are to be adapted and the feedback connections are preset to constant values, these
networks can be considered ordinary feedforward networks and the BP algorithm
used to train them. Otherwise, a GA could be employed [Pham and Karaboga,
1993b; Karaboga, 1994]. For improved versions of the Elman and Jordan nets, see
[Pham and Liu, 1992; Pham and Oh, 1992].
82 CHAPTER 3

2.8 Kohonen Network

A Kohonen network or a self-organising feature map has two layers, an input buffer
layer to receive the input pattern and an output layer (see Figure 10). Neurons in the
output layer are usually arranged into a regular two-dimensional array. Each output
neuron is connected to all input neurons. The weights of the connections form the
components of the reference vector associated with the given output neuron.
Training a Kohonen network involves the following steps:
(i) initialise the reference vectors of all output neurons to small random values;
(ii) present a training input pattern;
(iii) determine the winning output neuron, i.e. the neuron whose reference vector is
closest to the input pattern. The Euclidean distance between a reference vector
and the input vector is usually adopted as the distance measure;
(iv) update the reference vector of the winning neuron and those of its neighbours.
These reference vectors are brought closer to the input vector. The adjustment
is greatest for the reference vector of the winning neuron and decreased for
reference vectors of neurons further away. The size of the neighbourhood of a
neuron is reduced as training proceeds until, towards the end of training, only
the reference vector of a winning neuron is adjusted.
In a well-trained Kohonen network, output neurons that are close to one another
have similar reference vectors. After training, a labelling procedure is adopted where
input patterns of known classes are fed to the network and class labels are assigned to
output neurons that are activated by those input patterns. As with the LVQ network,
an output neuron is activated by an input pattern if it wins the competition against
other output neurons, that is, if its reference vector is closest to the input pattern.

Output neurons

Reference vector

Input neurons

Input vector

Figure 10. A Kohonen network


ARTIFICIAL NEURAL NETWORKS 83

2.9 ART Networks

There are different versions of the ART network. Figure 11 shows the ART-1
version for dealing with binary inputs. Later versions, such as ART-2 can also
handle continuous-valued inputs.

ART-1
As illustrated in Figure 11, an ART-1 network has two layers, an input layer and an
output layer. The two layers are fully interconnected, the connections are in both
the forward (or bottom-up) direction and the feedback (or top-down) direction. The
vector Wi of weights of the bottom-up connections to an output neuron i forms
an exemplar of the class it represents. All the Wi vectors constitute the long-term
memory of the network. They are employed to select the winning neuron, the latter
again being the neuron whose Wi vector is most similar to the current input pattern.
The vector Vi of the weights of the top-down connections from an output neuron
i is used for vigilance testing, that is, determining whether an input pattern is
sufficiently close to a stored exemplar. The vigilance vectors Vi form the short-term
memory of the network. Vi and Wi are related in that Wi is a normalised copy of
Vi , viz.

Vi
(26) Wi = 
+ Vji

where  is a small constant and Vji , the jth component of Vi (i.e. the weight of the
connection from output neuron i to input neuron j).

output layer

bottom up
weights W top down weights V

input layer

Figure 11. An ART-1 network


84 CHAPTER 3

Training an ART-1 network occurs continuously when the network is in use and
involves the following steps:
(i) initialise the exemplar and vigilance vectors Wi and Vi for all output neurons,
setting all the components of each Vi to 1 and computing Wi according to
Equation (26). An output neuron with all its vigilance weights set to 1 is
known as an uncommitted neuron in the sense that it is not assigned to
represent any pattern classes;
(ii) present a new input pattern x;
(iii) enable all output neurons so that they can participate in the competition for
activation;
(iv) find the winning output neuron among the competing neurons, i.e. the neuron
for which x. Wi is largest; a winning neuron can be an uncommitted neuron
as is the case at the beginning of training or if there are no better output
neurons;
(v) test whether the input pattern x is sufficiently similar to the vigilance vector
Vi of the winning neuron. Similarity is measured by the fraction r of bits in
x that are also in Wi , viz.
x V
(27) r= i
xi
x is deemed to be sufficiently similar to Vi if r is at least equal to vigilance
threshold 0 <  ≤ 1;
(vi) go to step (vii) if r ≥  (i.e. there is resonance); else disable the winning
neuron temporarily from further competition and go to step (iv) repeating this
procedure until there are no further enabled neurons;
(vii) adjust the vigilance vector Vi of the most recent winning neuron by logically
ANDing it with x, thus deleting bits in Vi that are not also in x; compute the
bottom-up exemplar vector Wi using the new Vi according to Equation (26);
activate the winning output neuron;
(viii) go to step (ii).
The above training procedure ensures that if the same sequence of training pat-
terns is repeatedly presented to the network, its long-term and short-term memories
are unchanged (i.e. the network is stable). Also, provided there are sufficient output
neurons to represent all the different classes, new patterns can always be learnt, as
a new pattern can be assigned to an uncommitted output neuron if it does not match
previously stored exemplars well (i.e. the network is plastic).

ART-2
The architecture of an ART-2 network [Carpenter and Grossberg, 1987; Pham and
Chan, 1998; 2001] is depicted in Figure 12. In this particular configuration, the
“feature representation” field F 1 consists of 4 loops. An input pattern will be
circulated in the lower two loops first. Inherent noise in the input pattern will be
suppressed (this is controlled by the parameters a and b and the feedback function
f·) and prominent features in it will be accentuated. Then the enhanced input
ARTIFICIAL NEURAL NETWORKS 85

pattern will be passed to the upper two F 1 loops and will excite the neurons in the
“category representation” field F 2 via the bottom-up weights. The “established
class” neuron in F 2 that receives the strongest stimulation will fire. This neuron will
read out a “top-down expectation” in the form of a set of top-down weights some-
times referred to as class templates. This top-down expectation will be compared
against the enhanced input pattern by the vigilance mechanism. If the vigilance test
is passed, the top-down and bottom-up weights will be updated and, along with the
enhanced input pattern, will circulate repeatedly in the two upper F 1 loops until
stability is achieved. The time taken by the network to reach a stable state depends
on how close the input pattern is to passing the vigilance test. If it passes the
test comfortably, i.e. the input pattern is quite similar to the top-down expectation,
stability will be quick to achieve. Otherwise, more iterations are required. After the
top-down and bottom-up weights have been updated, the current firing neuron will
become an established class neuron. If the vigilance test fails, the current firing
neuron will be disabled. Another search within the remaining established class neu-
rons in the F 2 layer will be conducted. If none of the established class neurons has
a top-down expectation similar to the input pattern, an unoccupied F 2 neuron will
be assigned to classify the input pattern. This procedure repeats itself until either
all the patterns are classified or the memory capacity of F 2 has been exhausted.
The basic ART-2 training algorithm can be summarised as follows:
(i) initialising the top-down and bottom-up long term memory traces;
(ii) presenting an input pattern from the training data set to the network;
(iii) triggering the neuron with the highest total input in the category representation
field;
(iv) checking the match between the input pattern and the exemplar in the top-
down filter (long term memory) using a vigilance parameter;
(v) starting the learning process if the mismatch is within the tolerance level
defined by the vigilance parameter and then going to step (viii); otherwise,
moving to the next step;
(vi) disabling the current active neuron in the category representation field and
returning to step (iii); go to step (vii) if all the established classes have been
tried;
(vii) establishing a new class for the given input pattern;
(viii) repeating (ii) to (vii) until the network stabilises or a specified number of
iterations are completed.
In the recall mode, only steps (ii), (iii), (iv) and (viii) will be utilised.

Dynamics of ART-2: The dynamics of the ART-2 network illustrated in Figure 12


is controlled by a set of mathematical equations. They are as follows:

(28) wi
= Ii + au
i
wi

(29) xi
=
W

86 CHAPTER 3

F2 reset Yj
ρ Zij F2

vigilance mechanism g(Yj) = d

cpi Zji
ri qi
pi
bf(qi)

ui vi
aui
f(xi)
F2
wi xi

F1
q′i p ′i
bf(q′i )

v′i u′i

f(x′i ) au′i

x′i w′i

Ii

Figure 12. Architecture of an ART-2 network

(30) vi
= f xi
 + bf qi

vi

(31) u
i =
V

(32) pi
= u
i
pi

(33) qi
=
P

(34) wi = qi

wi
(35) xi =
W 
(36) vi = f xi  + bf qi 
v
(37) ui = i
V 
ARTIFICIAL NEURAL NETWORKS 87
  
(38) pi = ui + g Yj zji
j
p
(39) qi = i
P


X represents the L2 norm of the vector X. If X = x1  x2 xn ,
The symbol
then X = x12 + x22 + + xn2 . The output of the jth neuron in the classification
layer is denoted by gYj . The L2 norm is used in the equations for the purpose of
normalising the input data. The function f· used in Equations (30) and (36) is a
non-linear function, the purpose of which is for suppressing the noise in the input
pattern down to a prescribed level. The definition of f· is

0 if 0 ≤ x < 
(40) fx =
x if x ≥ 

where  is a user defined parameter, it has a value between 0 and 1.

Learning Mechanism of ART-2: When an input pattern is applied to the ART-2


network, it will pass through the 4 loops comprising F 1 and then stimulate the
classification neurons in F 2. The total excitation received by the jth neuron in the
classification layer is equal to Tj where

(41) Tj = pi zij
i

The neuron which is stimulated by the strongest total input signal will fire by
generating an output with the constant value d. Therefore, for the winning neuron,
gYj  equals d. When a winning neuron is determined, all the other neurons will
be prohibited from firing. The value d will be used to multiply the top-down
expectation of the firing class before the top-down expectation pattern is read out
for comparison in the vigilance test. When the winning neuron fires, all the other
neurons are inhibited from firing so it can be inferred that when there is a firing
neuron (say j), Equation (38) becomes:

(42) pi = ui + dzji

otherwise if there is no winning neuron, it can be simplified as:

(43) pi = u i

The top-down expectation pattern is merged with the enhanced input pattern at
point ri before they enter the vigilance test (see Figure 12). ri is defined by

qi
+ cpi
(44) ri =
Q
 + cP
88 CHAPTER 3

The vigilance test is failed and the firing neuron will be reset if the following
condition is true:

(45) >1
R
where  is the vigilance parameter.
On the other hand, if the vigilance test is passed (in other words, the current
input pattern can be accepted as a member of the firing neuron), the top-down and
the bottom-up weights are updated so that the special features present in the current
input pattern can be incorporated into the class exemplar represented by the firing
neuron. The updating equations are as follows:
d  
(46) z = d pi − zji
dt ji
d  
(47) zij = d pi − zij
dt
The bottom-up weights are denoted by Zij and the top-down weights by Zji .
According to the recommendations in [Carpenter and Grossberg, 1987], all the top-
down weights should be initialised with the value 0 at the beginning of the learning
process. This can be expressed by the following equation:

(48) Zji 0 = 0

This measure is designed to prevent a neuron from being reset when it is allocated
to classify an input pattern for the first time. The bottom-up weights are initialised
using the equation:
1
(49) Zji 0 = √
1 − d M
where M is the number of neurons in the input layer. This number is equal to the
dimension of the input vectors. This arrangement ensures that after all the neurons
with the top-down expectations similar to the input pattern have been searched, it
would be easy for the input pattern to access a previously uncommitted neuron.

2.10 Spiking Neural Network


Experiments with biological neural systems have shown that they use the timing
of electrical pulses or “spikes” to encode and transmit information. Spiking neural
networks, also known as pulsed neural networks, are attempts at modelling the
operation of biological neural systems more closely than is the case with other
artificial neural networks.
An example of spiking neural network is shown in Figure 13. Each connection
between neurons i and j could contain multiple connections associated with a
weight value and delay [Natschläger and Ruf, 1998].
ARTIFICIAL NEURAL NETWORKS 89

1
I 1 wlij , dlij
N O
P 2 wkij , dkij
U
U j T i j
T wkij , dkij
P
i U
T
n

m
Figure 13. Spiking neural network topology showing a single connection composed of multiple
weights wijk with corresponding delays dijk

PSP

ε ij (t − s)

s t

a)

PSP

s t

ε ij (t − s)

b)
Figure 14. Different shapes of response functions. a) Excitatory post synaptic potentials (EPSPs)
function b) Inhibitory post synaptic potentials (IPSPs) function
90 CHAPTER 3

In the leaky integrate-and-fire model proposed by Maass [Maass, 1997], a neuron


is regarded as a homogeneous unit that generates spikes when the total excitation
exceeds a threshold value.
Consider a network that consists of a finite set V of such spiking neurons, a set
E ⊆ V × V of synapses, a set of weights Wuv ≥ 0, a response function uv  R+ → R
for each synapse u v ∈ E where R+ = x ∈ R x ≥ 0 and a threshold function
v  R+ → R for each neuron v ∈ V . If Fu ⊆ R+ is the set of firing times of a neuron
u, then the potential at the trigger zone of each neuron v at time t is given by:
 
(50) Pv t = u uv∈E s∈F s<t wuv ∗ uv t − s
u

In the simplest model of a spiking neuron, a neuron v fires whenever its potential
Pv t reaches a certain threshold v t. This potential Pv t is the sum of the
so-called excitatory post synaptic potentials (EPSPs) and inhibitory post synaptic
potentials (IPSPs), which result from the firing of other neurons u that are connected
through a synapse to neuron v. The firing of a presynaptic neuron u at time s
contributes to the potential Pv t at time t an amount that is modeled by the term
wuv ∗ uv t − s, where wuv is the weight of the connection between neurons u and
v, and uv t − s is the response function. Some biologically realistic shapes of the
post synaptic potentials (PSPs) are shown in Figure 14. The change in threshold as
a function of time is illustrated in Figure 15.
Learning can be achieved in spiking neural networks as in traditional neural
networks for tasks such as classification, pattern recognition and function approx-
imation [Lannella and Back 2001; Bohte et al., 2002a; 2002b]. Different learning

Threshold

Time

Figure 15. Firing threshold of a neuron


ARTIFICIAL NEURAL NETWORKS 91

algorithms have been proposed for the training of spiking neural networks [Maass
and Bishop, 98; Gerstner and Kistler, 2002].
A supervised learning algorithm has been proposed [Bohte et al., 2002b], where
it is shown that a feedforward network of spiking neurons can be trained for clas-
sification tasks by means of error backpropagation. Unsupervised learning can also
be achieved for spiking neural networks [Bohte et al., 2002a] with self organisation
as for Radial Basis Function (RBF) networks.

3. SUMMARY
This chapter has presented the main types of existing neural networks and has
described examples of each type. For an overview of the different systems engi-
neering applications of these neural networks, see the chapter on Soft Computing
and its Applications in Engineering and Manufacture for example.

4. ACKNOWLEDGEMENTS
This work was carried out within the ALFA project “Novel Intelligent Automa-
tion and control systems 11” (NIACS 11), the ERDF (Objective One) projects
“Innovation in Manufacturing Centre” (IMC), “Innovative Technologies for
Effective Enterprises” (ITEE) and “Supporting Innovative Product Engineering and
Responsive Manufacturing” (SUPERMAN) and within the project “Innovative Pro-
duction Machines and Systems” (I∗ PROMS).

REFERENCES
Albus J S, (1975a), “A new approach to manipulator control: cerebellar model articulation control
(CMAC)”, Trans. ASME, J. of Dynamics Syst., Meas. and Contr., 97, 220–227.
Albus J S, (1975b), “Data storage in the cerebellar model articulation controller (CMAC)”, Trans. ASME,
J. of Dynamics Syst., Meas. and Contr., 97, 228–233.
Albus J S, (1979a), “A model of the brain for robot control”, Byte, 54–95.
Albus J S, (1979b), “Mechanisms of planning and problem solving in the brain”, Math. Biosci., 45,
247–293.
An P E, Brown M, Harris C J, Lawrence A J and Moore C J, (1994), “Associative memory neural
networks: adaptive modelling theory, software implementations and graphical user”, Engng. Appli.
Artif. Intell., 7 (1), 1–21.
Bohte S M, La Poutre H and Kok J N, (2002a), “Unsupervised clustering with spiking neurons by sparse
temporal coding and multilayer RBF networks”, IEEE Trans. on Neural Networks, 13 (2), 415–425.
Bohte S M, La Poutre H and Kok J N, (2002b), “Error-back propagation in temporally encoded networks
of spiking neurons”, Neuro Computing, 17–37.
Broomhead D S and Lowe D, (1988), “Multivariable functional interpolation and adaptive networks”,
Complex Systems, 2, 321–355.
Carpenter G A and Grossberg S, (1987), “ART2: Self-organisation of stable category recognition codes
for analog input patterns”, Appl. Optics, 26 (23), 4919–4930.
Carpenter G A and Grossberg S, (1988), “The ART of adaptive pattern recognition by a self-organising
neural network”, Computer, 77–88.
Cichocki A and Unbahauen R, (1993), Neural Networks for Optimisation and Signal Processing,
Chichester: Wiley.
92 CHAPTER 3

Elman J L, (1990), “Finding structure in time”, Cognitive Science, 14, 179–211.


Gerstner W and Kistler W M, (2002), Spiking Neuron Models: Single Neurons, Populations and
Plasticity, Cambridge University Press, UK.
Goldberg D, (1989), Genetic Algorithms in Search, Optimisation and Machine Learning, Reading, MA:
Addison-Wesley.
Hassoun M H, (1995), Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, MA.
Haykin S, (1999), Neural Networks: A Comprehensive Foundation, 2nd Edition, Upper Saddle River,
NJ: Prentice Hall.
Hecht-Nielsen R, (1990), Neurocomputing, Reading, MA: Addison-Wesley.
Holland J H, (1975), Adaptation in Natural and Artificial Systems, Ann Arbor, MI: University of
Michigan Press.
Hopfield J J, (1982), “Neural networks and physical systems with emergent collective computational
abilities”, Proc. National Academy of Sciences, 79, 2554–2558.
Iannella N and Back A D, (2001), Spiking neural network architecture for nonlinear function approxi-
mation, Neural Networks, Special Issue, 14(6), 922–931.
Jordan M I, (1986), “Attractor dynamics and parallelism in a connectionist sequential machines”, Proc.
8th Annual Conf. of the Cognitive Science Society, 531–546.
Karaboga D, (1994), Design of Fuzzy Logic Controllers Using Genetic Algorithms, PhD thesis, University
of Wales, Cardiff, UK.
Kohonen T, (1989), Self-Organising and Associative Memory (3rd ed.), Berlin: Springer-Verlag.
Lannella N and Back A D, (2001), Spiking neural network architecture for nonlinear function approxi-
mation, Neural Networks, Special Issue, 14(16), 922-931.
Maass W, (1997), “Networks of spiking neurons: The third generation of neural network models”,
Neural Networks, 10, 1659–1671.
Maass W and Bishop CM, (1998), Pulsed Neural Networks, Cambridge: MIT Press.
Moody J and Darken C J, (1989), “Fast learning in networks of locally-tuned processing units”, Neural
Computation, 1 (2), 281–294.
Natschläger T and Ruf B, (1998), “Spatial and temporal pattern analysis via spiking neurons”, Network:
Computation in Neural systems, 9 (3), 319–332.
Pham D T and Chan A J, (1998), “Control chart pattern recognition using a new type of self-organising
neural network”, Proc. of the Institution of Mechanical Engineers, 212 (Part I), 115–127.
Pham D T and Chan A J, (2001), “Unsupervised adaptive resonance theory neural networks for control
chart pattern recognition”, Proc. of the Institution of Mechanical Engineers, 215 (Part B), 59–67.
Pham D T and Karaboga D, (1993), “Dynamic system identification using recurrent neural networks and
genetic algorithms”, Proc. 9th Int. Conf. on Mathematical and Computer Modelling, San Francisco.
Pham D T and Liu X, (1992), “Dynamic system modelling using partially recurrent neural networks”,
Journal of Systems Engineering, 2 (2), 90–97.
Pham D T and Liu X, (1994), “Modelling and prediction using GMDH networks of Adalines with
nonlinear preprocessors”, Int. J. Systems Science, 25 (11), 1743–1759.
Pham D T and Oh S J, (1992), “A recurrent backpropagation neural network for dynamic system
identification”, Journal of Systems Engineering, 2 (4), 213–223.
Pham D T and Oztemel E, (1994), “Control chart pattern recognition using learning vector quatization
networks”, Int. J. Production Research, 32 (3), 721–729.
Rumelhart D and McClelland J, (1986), Parallel distributed processing: exploitations in the micro-
structure of cognition, volumes 1 and 2, Cambridge: MIT Press.
Widrow B and Hoff M E, (1960), “Adaptive switching circuits”, Proc. 1960 IRE WESCON Convention
Record, Part 4, IRE, New York, 96–104.
CHAPTER 4
APPLICATION OF NEURAL NETWORKS

D. ANDINA1 , A. VEGA-CORONA2 , J. I. SEIJAS3 , M. J. ALARCÓN


1
Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica
de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. andina@gc.ssr.upm.es
2
Facultad de Ingeniería, Mecánica, Eléctrica y Electrónica (FIMEE), Universidad de Guanajuato
(UG), Salamanca, Gto., México. tono@salamanca.ugto.mx
3
Departamento de Señales, Sistemas y Radiocomunicaciones (SSR), Universidad Politécnica
de Madrid (UPM), Ciudad Universitaria C.P. 28040, Madrid, España. jseijas@gc.ssr.upm.es

Abstract: This chapter is dedicated to the scope of which facts should be considered when deciding
whether a Neural Network (NN) solution is suitable to solve a given problem. This is
followed by a detailed example of a successful and useful application: a Neural Binary
Detector

INTRODUCTION
There are three main criteria which need to be applied when deciding whether a
given problem lend to a neural solution.
a) Unknown Algorithm. The solution to the problem cannot be explicitly described
by an algorithm, a set of equations or a set of rules. Making the decision between
a conventional and a neural computing solution on the basis of this criterion
is not always entirely clear. There are problems where both, conventional and
neural computing, may be able to provide appropriate solutions. The choice then
depends on the resources available and the ultimate goals of the designer.
b) Conviction of the existence of a solution. There must exists some evidence
that an input-output mapping exists between a set of input variables x and
corresponding output data y, such as y = fx. The form of f , however is
not known. For example, let us suppose that we have input data patterns and
corresponding desired output patterns. We pretend using a supervised algorithm
to train a NN that will establish high-order (non-linear) correlations between the
input features x and the output data y in order to minimize the output error. But,
to train the NN, the fact that a database with a set of input and output pairs
can be assembled does not necessarily means that a mapping can be constructed
93
D. Andina and D.T. Pham (eds.), Computational Intelligence, 93–108.
© 2007 Springer.
94 CHAPTER 4

between the input variables and the desired data. Also, no matter how good the
algorithm is, efficient results will not be achieved unless relevant features are
used as the input to the network.
c) Availability of data. There should be a large amount of data available, i.e. many
different examples with which to train the network. If there is any doubts about
their availability then, in all probability, the NN will not be efficient to solve
the problem. Lack of suitable data is one of the main causes of problems during
neural computing applications. Such data has to be collected and compiled in
a computer readable form. Designer of the NN application must take account
that there may be considerable practical problems associated with collecting
and processing data; special instrumentation and recording facilities may be
required, and specific experiments may be needed to ensure that the data cover
the necessary range of conditions. For example, if a NN is to be applied to
detect signals in noisy environments, data corresponding to all expected values
of noise and signal should be used to train the network.

1. FEASIBILITY STUDY

If a given application accomplishes the three main criteria stated above, the designer
has to choose the type of the NN to apply. A thorough literature search for references
to related work should be carried out at this designing stage. Nowadays, neural
computing has a large applications literature and it is highly likely that there are
papers on applications similar to the one you are considering.
Once the literature search has been carried out, an outline design should be
prepared as early as possible in the feasibility study stage. This may only be a paper
exercise, but it could extend to the collection of sample of data and even some
prototype work. Specially relevant is the consideration of pre-processing the input
to extract and select the features that characterize the problem and a preliminary
assessment of the required neural network architecture. If the prototype work is
successfully, and the designer has sufficient training data available, the design of
the NN can start; a work that involves a lot of empirical research. That will be
shown in the following section.

2. APPLICATION OF NNs TO BINARY DETECTION

It is expected that neural hardware will provide crucial tools for the enhancement and
development of algorithms applied to automatic classification. In this sense, some
relevant studies about the possibilities of neural nets [Kim and Guest, 1990, Decatur,
1992, Roth, 1990] have selected four principal areas of application: automatic
target recognition, speech recognition, seismic signal processing and sonar signal
processing.
The availability of Neural Networks (NNs) as detectors is based on their capa-
bility to incorporate a great quantity of information of several classes, their ability
APPLICATION OF NEURAL NETWORKS 95

to generalize from noisy or incomplete information, and their robustness to ambi-


guities. Another advantage of NNs is the computational simplicity of their nodes.
Although they are computationally intensive in general purpose computers, they can
be implemented in specific massive parallel processing hardware that can overcome
the implementations on the most powerful computers.
Neural networks can have interesting robust capabilities when applied as binary
detectors. This type of networks has proved its abilities in classifying problems,
and we could reduce the binary detection problem as having to decide if an input
has to be classified as one of two outputs, 0 or 1. However, the NNs detectors have
some typical drawbacks: slow and unpredictable training, and some difficulty for
adding new information to the net that employs too much time in its retraining.
These problems become more critical as the size of the net increases.
This application example deals with the possibilities of applying a Multilayer Per-
ceptron (MLP) Neural Network to binary detection. After an optimization process,
the performance of the proposed neural detector is compared with the optimal one,
whose performance is given by the Optimum Neyman-Pearson detector, commonly
used in radar detection applications.
The detector input is modeled as a signal with additive noise (given by its complex
envelope). The binary output is 1 or 0. As the MLP output can accurately estimate
the a posteriori probabilities of the input classes [Ruck et al., 1990], a threshold
T is established at the NN output to satisfy the Neyman-Pearson requirements
[Van Trees, 1968]. The performance is evaluated by Monte-Carlo simulations over
several input models. Their Receiver Operating Characteristics (ROC) and detection
curves are obtained.
The main topics of designing the network are:
(a) The network structure design. Variations in performance are analyzed for dif-
ferent numbers of inputs and hidden layers, and different number of nodes in
these hidden layers.
(b) The network training. Using the BackPropagation (BP) algorithm with momen-
tum term, we study the training parameters: initial set of weights, threshold
value for training, momentum value, and the training method. The preparation
of the training set, whose key parameter, after preliminary exprimental results
to be Training Signal-to-Noise Ratio (TSNR), is discussed. Also, it is shown
how to find the TSNR value that maximizes the detection probability Pd  for
a given false alarm probability Pfa , as is required by the Neyman-Pearson
criterion.
(c) The dependence of the training and the performance on the criterion func-
tion to be minimized by the BP algorithm. The following criterion func-
tions are analyzed: Least Mean Square (LMS) Error [Hush and Horne, 1993],
Minimum Misclassification Error (MME) [Telfer and Szu, 1994, Telfer and Szu,
1992] and El-Jaroudi and Makhoul (JM) criterion [El-Jaroudi and Makhoul,
1990].
(d) The results (training time, threshold values, error probability curves, ROC and
detection curves) are analyzed, and appropriate conclusions are extracted.
96 CHAPTER 4

3. THE NEURAL DETECTOR


The detector under consideration is a modified envelope detector [Andina and
Sanz-González, 1995, Root, 1970], as is shown in Figure 1. The binary detection
problem is reduced to decide if an input complex value (the complex envelope
involving signal and noise) has to be classified as one of two outputs, 0 or 1. The
need of processing complex signals with an all-real coefficient NN, requires to
split off the input in its real and imaginary parts (the number of inputs doubles the
number of integrated pulses); then, a threshold T is established at the NN output.
The input rt is a band-pass signal, and the complex envelope xt = xC t +
jxS t is sampled each T0 seconds. Then

(1) xkT0  = xC kT0  + jxS kT0  k = 1     Mj = −1

At the neural network output, values in 0 1 are obtained. A threshold value
T ∈ 0 1 is chosen so that output values in 0 T will be considered as binary
output 0 (decision D0 ) and values in T 1 will represent 1 (decision D1 ).
The two hypotheses H0 (target absent) and H1 (target present) are defined as
follows

(2a) H0  xkT0  = nkT0 


(2b) H1  xkT0  = SkT0  · ej kT0  + nkT0 

where T0 is the pulse repetition period, k varies from 1 to the number of integrated
pulses M, SkT0  is the signal amplitude sequence, is the signal phase and
nkT0  is the complex envelope of the noise sequence, i.e. nkT0  = nC kT0  +
jnS kT0 .

3.1 The Multi-Layer Perceptron (MLP)


The NNs have clear advantages over classical detectors as nonparametric detectors.
In this case, the statistical description of the input is not available. The only
information available in the design process is the performance of the detector on a

xc(k0T )=xc(k)

LPF
z–1 Comparison
xc(t)
Device
π

r(t) 2
z–M–1 Neural D0
Net D1
~ sinwct (MLP)

z–1
Threshold, T
LPF
xs(t) xs(k0T)=xs(k) z –M–1

Figure 1. The Neural Detector


APPLICATION OF NEURAL NETWORKS 97

group of patterns called training patterns. For this task, the BP algorithm carries out
an implicit histogram of the input distribution, adapting freely to the distributions
of each class; so, NNs contribute with a new level of sophistication to the classical
techniques of nonparametric detection.
For detection purposes, the Multi-Layer Perceptron (MLP) trained with the Back-
Propagation (BP) algorithm, has been found more powerful than other types of NNs
[Kim and Guest, 1990]. While other types of NNs can learn topological maps using
lateral inhibition and Hebbian learning, a MLP trained with BP can also discover
topological maps. Furthermore, the MLP can be even superior to the parametric
bayesian detectors when the input distribution departures from the assumptions.

3.2 The MLP Structure


Since mathematical methods to calculate the dimensions of the net are not yet
available, one main question is: “How should the net be designed in order to obtain
the desired decision regions in a reasonable amount of time?”. Generally, each
problem requires different capacities of the net and it is not clear that a general rule
to calculate its size would be found. If the net is very small, it could not be capable
of forming a good model for the problem; if it is too large, it could implement
several solutions, and many of them would probably be suboptimal.
At least, for the majority of problems, only one hidden layer has demonstrated
to be necessary, and it seems to be the case of binary detection. After a thorough
study of different MLP structures, that included growing and pruning of the NN,
there have not been found any performance improvements by adding more than
one hidden layer, with the inconvenient of increasing critically the training time.
So, the structure we have chosen is a MLP with one hidden layer, and we write
2M/N/1 for a MLP with 2M input nodes, N hidden layer nodes and one output
node.
Also, there is no way of establishing a priori the number of nodes in the hidden
layer. For realizing exactly the input-output relation, it has been demonstrated that
an upper bound for the number of nodes in the hidden layer of a MLP is of the
number of training patterns [Hush and Horne, 1993]. But, for practical purposes, the
number of hidden nodes should be much lower than the number of training patterns;
otherwise, the net simply “memorizes” the training set, losing its generalization
capability. In general, the size of the net has to be determined by means of a test
and error procedure. Therefore, to choose the size of the hidden layer, empirical
curves as the one presented in Figure 2 have been used [Andina, 1995].
The parameter Training Signal-to-Noise Ratio (TSNR) is the Signal-to-Noise-
Ratio used to generate the training set patterns, and is one of the key design
parameters of the NN. If
2 is the noise power and A is the received signal amplitude
(Marcum model [Marcum, 1960]), the Signal-to-Noise Ratio (SNR) is defined as

A2
(3) SNR =

2
More details will be given later.
98 CHAPTER 4

Pe(%)
40
TSNR=0dB
TSNR=3dB
TSNR=6dB
35 TSNR=12dB

30

25

20

15

10

0
0 10 20 30 40 50 60 70 80 90 100
Num. Nodes
Figure 2. Error probability Pe  vs. number of nodes (N) in the hidden layer for an MLP, depending
on Training Signal-to-Noise Ratio (TSNR)

In Figure 2, each curve presents a knee where the most effective relation between
complexity and performance is verified. After a thorough study of the values of M,
N and TSNR the structure 16/8/1 has been chosen as the most efficient one. In
section 2 of this chapter, it will be shown that there is a range on the number of
integrated pulses M (half the number of inputs) where the NN works efficiently.

4. THE TRAINING ALGORITHM

4.1 The BackPropagation (BP) Algorithm


Even if the size of the net has been precisely determined, finding the adequate
weights is a difficult problem. The BackPropagation (BP) algorithm modifies the
output layer weights during the training as (see Appendix A, section 1)

L L Wt


(4) wij t + 1 = wij t − L
wij t
L
where wij is the weight connecting output from i-th node of the layer L − 1 to the
j-th node in the output layer L; the iteration counter is t, is the learning rate and
W is the criterion or error function, that depends on the weights matrix, W .
APPLICATION OF NEURAL NETWORKS 99

In order to improve the learning time, it is common to include the moment term
l l
within the basic BP algorithm. This term is wij t − wij t − 1, where 0 <  < 1
l
and wij t is the weight that connects the i-th output of layer l − 1 to j-th node
in layer l. By means of the inclusion of this term, the current search direction is
an exponentially weighted average of past directions, helping to keep the weights
moving across flat portions of the error surface, after they have descended from
the steep portions. In the case of our binary detector,  = 08 has been used. This
value has been chosen from empirical results [Andina, 1995].
The dynamic of learning utilized is cross-validation, a method that monitorizes
the generalization capabilities of the net. This method demands to split learning
data into two sets: a training set, employed to modify the net weights, and a
testing set, which is utilized to measure its generalization capability. During the
training, the performance of the net on the training set will continue improving,
but its performance on the test set will only improve until a point, beyond which it
will begin to degrade. It is at this point when the net begins to be overtrained (it
is excessively specialized on the training set), loosing capacity of generalization;
consequently, the training should be finished. In practice, the training has been
stopped when a typical number of iterations has been carried out, choosing the net
with the smallest error probability Pe  over the test set. Although there is not any
guarantee that an absolute minimum has been reached, for the optimized network
the smallest Pe is achieved typically in the order of 1,500 iterations.
The Least-Mean-Squares (LMS) criterion is the most widely used for training a
Multi-Layer Perceptron (MLP). However, depending on the application, there is no
reason to think that this criterion is the optimal one [Barnard and Casasent, 1989].
With the purpose of using an adequate criterion function for our detector, we have
analyzed the following criterion functions (see also Appendix A, section 2):
(a) Least-Mean-Squares (LMS). This criterion is the most widely used in Back
Propagation (BP) learning algorithm. It has been proved that it approximates
the Bayes optimal discriminant function and yields a least-squares estimate of
the a posteriori probability of the class given in the input [Ruck et al., 1990].
It minimizes the expression

1 P
1
(5) ELMS = y − ŷp 2
P p=1 2 p

where P is the number of training pairs, p is the training pair counter, ŷp ∈
0 1 is the net output (the neural detector has only one output, as it has been
mentioned in the Section 3), and yp is the desired output, 0 or 1 for the binary
case.
One of the LMS drawbacks is that a least squares estimate of the probability
can lead to gross errors at low probabilities [El-Jaroudi and Makhoul, 1990].
(b) Minimum Misclassification Error (MME). This criterion minimizes the number
of misclassified training samples. It approximates class boundaries directly from
the criterion function, and it could perform better than LMS for less complexity
100 CHAPTER 4

networks [Telfer and Szu, 1992, Telfer and Szu, 1994]. In our detector, it
minimizes the expression
 
1 P  
T L−1
(6) EMME = P − f 2yp − 1Wp Yp
P p=1

where  YpL−1 is the output vector of layer L − 1 and WpT is the weight vector
of the output layer L, f· is the node activation function (see Appendix A,
section 3.
(c) El-Jaroudi and Makhoul (JM) criterion. It is similar to the Kullback-Leibler
information measure and results in superior probability estimates when com-
pared to least squares [El-Jaroudi and Makhoul, 1990]. For the binary detection
case it minimizes

1 P
(7) EJM = − ln1 − yp − ŷp 
P p=1

Details about how this criterion has been implemented are given in Appendix A,
section 4.

4.2 The Training Sets

The training set has been formed by an equal number of signal plus noise patterns
and only noise patterns (i.e., PH0  = PH1  during the training), presented alter-
natively, so the desired output varies from D0 (detecting noise) to D1 (detecting
target) in each iteration. Other choices of presenting the training pairs, as in a
random sequence, are also suitable.
The pattern sets are classical input models for radar detection [Marcum, 1960,
Swerling, 1960]. The net inputs are samples of the complex envelope of the signal.
There are NNs with complex coefficients [Kim and Guest, 1990] to process complex
signals, but it seems more convenient to sample its “in-phase” and “quadrature”
components, obtaining an all real coefficient net. This provides generality to this
study, because the same NN can be utilized with a different preprocessing [Andina,
1995].
There are no methods that indicate the exact number of training patterns to be
used. Assuming that the test and training patterns have the same distribution, there
have been found limits to the number of training patterns necessary to achieve a
given error probability: this number is approximately the number of weights divided
by the desired error. If we utilize this upper limit, the number of patterns necessary
for training becomes prohibitive. After an empirical study, an upper limit of 2,000
training patterns has proved to be sufficient for any criterion function or target
model.
The fact of using a model for the laboratory experiments is, partially, obligated.
The acquisition of real patterns representative of the environment is expensive,
APPLICATION OF NEURAL NETWORKS 101

difficult, and even, some times, impossible. We must not overestimate the real data
value, and it is really difficult to obtain the data under all propagation conditions.
The use of learning from a model in the construction of the system lets the designer
to train the MLP easily. Then the robustness of the system could be sufficient to
achieve quasi-optimal results (see Section 5.2) over a real input distribution; or, if
there is time to on-line training, the NN could be then refined.

5. COMPUTER RESULTS

In order to make the resulting network independent of the initial weight values, each
one has been trained four times, with random initial values of the weights ranging
in −01 01, selecting as the final network the NN that provides the smallest
probability of error. The training threshold has been set to 0.5.
The signal input model in the following figures corresponds to the Marcum
model [Marcum, 1960], that is

(8) xkT0  = A · ej0 + nkT0 

where

A: signal amplitude of constant value.


0 : initial phase, constant for each input pattern and uniformly distributed in
0 2 between patterns.
nkT0 : complex white Gaussian noise of zero mean and variance
2 in each
component.

The signal-to-noise ratio for training or testing is defined as

A2
(9) TSNRorSNR =

2

The parameters for the study are the following:

Training signal-to-noise ratio: TSNR


Input signal-to-noise ratio: SNR
Structure: N0 /N1 /1 (N0 inputs / N1 nodes in the hidden layer / 1 output)

Let us call probability of detection, Pd , the probability that the detector decides
D1 (target present) under the hypothesis H1 (target present), Pd = Pr D1 /H1  and
false alarm Probability Pfa , the probability of deciding D1 under H0 (target absent),
Pfa = Pr D1 /H0 . The performance of the detector is measured by detection curves
(Pd vs. SNR, for a fixed Pfa ) and ROC curves (Receiver Operating Characteristics
curves, Pd vs. Pfa for a fixed SNR).
102 CHAPTER 4

5.1 The Criterion Function

In this section we present significant results of the comparison of criterion functions.

5.1.1 Convergence of the algorithm


The best results are those of the JM criterion, followed by the LMS one. The MME
requires, in general, a higher training iterations than the others. In Figure 3, the
results for JM and LMS criterions are presented for Training Signal-to-noise ratio
(TSNR) of 13 dB, showing that significant convergence improvements are obtained
by using JM instead of LMS. This conclusion does not depend on the value of
TSNR.
Another general result is that rising the TSNR decreases the number of training
iterations, as the decision regions to be separated by the MLP become more different.

5.1.2 Detection curves


First, we compare the detection characteristics of the networks with a fixed value of
Pfa . In Figure 4, we present the best results for each criterion (these results depend
on TSNR [Andina et al., 1995]) for two Pfa values: 10−2 and 10−3 , respectively. As
we can see, the performance differences become more clear as the design conditions
are more restrictive (i.e. lower values of the false alarm probability, Pfa ). The results
show that criterion JM is the best for our detection problem. These results support
the idea suggested in [El-Jaroudi and Makhoul, 1990] that the estimation of the a
posteriori probabilities carried out by the JM criterion is more accurate than the
LMS one.
The error surface for the JM criterion is also more appropriate for the gradient
search (faster convergence of learning).
The value of TSNR that achieves better performance is 13 dB for JM criterion
and 6 dB for LMS. The MME criterion presents the worst characteristics (see also
Figure 5). This criterion does not adapt to the application, because it minimizes the
classification error for a training threshold different from the threshold T used to

Pd
Pd
0.5 0.5
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 2,000 6,000 10,000 0 2,000 6,000 10,000


Iterations Iterations

Figure 3. Error probability Pe  vs. iteration for TSNR = 13 dB (a) JM criterion (b) LMS criterion.
APPLICATION OF NEURAL NETWORKS 103

Pd
1

.8

.6

.4
Pfa = 0.01

LMS6dB
.2
MME6dB
JM13dB

-2 0 2 4 6 8 10 12
(a) Pfa

Pd
1

.8

.6

.4
Pfa = 0.001

LMS13dB
.2
MME13dB
JM13dB

-2 0 2 4 6 8 10 12
(b) Pfa
Figure 4. Detection probability Pd  vs. Signal-to-Noise Ratio (SNR) for a MLP trained with different
criterion functions and Training Signal-to-Noise Ratio (TSNR). LMS6dB means LMS criterion and
TSNR = 6 dB. MME6dB means MME criterion and TSNR = 6 dB, and so on. False alarm probability,
Pfa = (a) 0.01 (b) 0.001
104 CHAPTER 4

Pd
1

.96

.92

SNR= 6 dB
.88
LMS6dB

.84 MME13dB

JM13dB
.8
0 .02 .04 .06 .08 .1
Pfa
Figure 5. Detection probability Pd  vs. false alarm probability Pfa  for a MLP trained with all
criteria in study. SNR = 6 dB

achieve the design value of Pfa . Unless you can find a way to estimate the value
of T before the training, the network will be forced to suboptimal performance in
the direct operating mode, degrading its performance.

5.1.3 Receiver Operating Characteristic (ROC) curves


Now, the performance of the networks, as Pd vs. Pfa , under a fixed SNR is presented.
In Figure 5, it is observed that JM provides the best results. The MME criterion
presents the worst characteristics, as in the case of the detection curves.

5.2 Robustness Under Different Target Models

For more complex target models as is the classical Swerling I (SWI) [Swerling,
1960] target model where S in Equation (2b) has Rayleigh distribution, the results
are also very near the optimal detector (Neyman-Pearson detector), as shown in
Figure 6.
In Figure 6, an interesting phenomenon is observed. In this case, for the purpose
of evaluating the robustness of the network, a net trained by the simple Marcum
model is evaluated over the SWI case. The network is not only capable to generalize
and obtain good results over the more complicated input model, but also its results
are superior to those of the network trained over the same SWI distribution. This
result suggests that the neural network may achieve better results trained over
simple hypotheses and then letting it to generalize over more complicated ones.
This characteristic may be very useful in real cases, where input distributions will
generally be different from those assumed in the design stage.
APPLICATION OF NEURAL NETWORKS 105

Pd
Pd 1 1

0.8 0.8

0.6 0.6

Pfa=0.01 Pfa=0.01
0.4 0.4
TSNR6 TSNR6
TSNR13 TSNR13
0.2 0.2
TSNR15 TSNR15
Optimum Optimum
0
-10 -5 0 5 10 15 -10 -5 0 5 10 15
(a) SNR (dB) (b) SNR (dB)

Figure 6. Detection probability Pd  vs. Signal-to-Noise Ratio (SNR) for a MLP of structure 16/8/1
(16 inputs, 8 nodes in the hidden layer and one output) and different TSNR. Pfa = 001 (a) Swerling I
for training and testing (b) Net trained with Marcum model and tested over Swerling I target model.

APPENDIX: ON BACKPROPAGATION AND THE CRITERION


FUNCTIONS
1. THE BACKPROPAGATION ALGORITHM
The BackPropagation algorithm modifies the weight values of input i in node j,
wij , following the expression

(A.1) wijl t + 1 = wijl t + ijl t


where

l is the layer counter: l = 0 · · ·  L. l = 1 is the first hidden layer, l = L is the


output layer.
t is the iteration counter.
is the learning rate.
 is the gradient of the criterion function, W.

Let us define the error term for each node as


l W l W 1
(A.2) j = − · fj =− ·
ˆ
y
l l
wij
l−1
ŷi
j

l l
where ŷj is the actual output of node j in layer l, fj is the derivative of the
activation function of the same node, and
l l l l
(A.3) Wj = w0j  w1j  · · ·  wNl−1 j 
then, we have
l W l l−1
(A.4) ij = l
= − j · ŷi
wij
106 CHAPTER 4

applying the chain rule, the error gradient for each node can be also expressed as a
function of the error terms in the previous layers and Equation A.1 can be expressed
as

L L−1
l wij t + ŷi t Lj t l = L
(A.5) wij t + 1 = l l−1
wij t + ŷi t lj t l = L − 1 L − 2     1
being 0 ≤ i ≤ Nl−1 − 1, 0 ≤ j ≤ Nl − 1, where Nl is the number of nodes in layer l
and
Nl+1 −1
l l  l+1
(A.6) j t = fj l+1
n twjn t
n=0

Now, to adapt the algorithm to each criterion, it is only necessary to calculate the
L
value of j and apply Equation (A.5) (bias values, always set to “1” must be
added to properly train the network).

2. LEAST MEAN SQUARES (LMS)


The criterion function to minimize is the mean square error between the actual
output ŷ and the desired output y, in the training set, i.e.
1 P  Ns
1
(A.7) ELMS = y − ŷjp 2
P p=1 j=1 2 jp

where P is the number of training pairs and Ns the number of output nodes.
Approximating Equation A.7 by its value over one pattern,

Ns
1
(A.8) LMS = yj − ŷj 2
j=1 2

l
As L is the number of layers of the MLP, ŷj = ŷj . For the binary case, NS = 1 and
from Equation (A.8) and (A.2)
W 
(A.9) L
= −y − ŷL  ⇒ L = y − ŷL  · f L

3. MINIMUM MISSCLASSIFICATION ERROR (MME)


This criterion minimizes the classification error on the training set and is proposed
in [Telfer and Szu, 1994, Telfer and Szu, 1992]. The criterion function, when the
network has one output node whose output values are in (0,1) and whenever the
final values for the components of the weight matrix are high in module (what
usually happens [Telfer and Szu, 1994, Telfer and Szu, 1992]), is
 
1 P
(A.10) EMME = P − f2yp − 1W Ŷp T L−1

P p=1
APPLICATION OF NEURAL NETWORKS 107

where f· is the node activation function.


Minimizing Equation (A.10) in the step by step training is equivalent to minimize
(A.11) MME W = 1 − f2y − 1W T Ŷ L−1 
For the binary case,

(A.12) L = 2y − 1 · f L

4. THE JM CRITERION
This criterion is similar to the Kullback-Leibler information measure, and it mini-
mizes
 Pi
(A.13) HQ/P = Pi · ln
i Qi
being Pi = yi = PrHi X the a posteriori probability of hypothesis Hi i = 0 1
and Qi = ŷj = P̂rHi X. When there is enough training vectors, minimizing
Equation (A.13) is equivalent to minimize

(A.14) EJM = − px · ln ŷi x Hi  W · dx
i ∀x

being x the input vector. For the binary case,


1  
(A.15) EJM = − ln ŷ0 x H0  W + ln ŷ1 x H1  W
P Xp ∈H0 Xp ∈H1

being P the number of input patterns. As there is just one output node with values
in 0 1, P̂rH1 X = ŷ, P̂rH0 X = 1 − ŷ and Equation (A.15) can be written as

1  
(A.16) EJM = − ln1 − ŷx H0  W + ln ŷx H1  W
P Xp ∈H0 Xp ∈H1

After some simple simplifications, Equation (A.16) can be written as function of


the output ŷ
1 P
(A.17) EJM = − ln1 − yp − ŷp 
P p=1
and, for each iteration step
(A.18) JM W = − ln1 − y − ŷW
Finally,we have
sgny − ŷ L
(A.19) L = ·f
1 − y − ŷ
where sgn· is the sign function.
108 CHAPTER 4

REFERENCES
M.S.Kim, C.C.Guest, Modification of backpropagation networks for complex-valued signal processing
in frequency domain, IEEE Proc. Int. Conf. Neural Networks, IJCNN, Vol. III, pp. 27–31, San Diego,
June 1990.
S. E. Decatur, Application of neural networks to terrain classification, Proc. Int. Conf. Neural Networks,
pp. 283–288, 1989.
N. Miller, M.W. McKenna, T.C. Lau, Office of Naval Research Contributions to Neural Networks and
Signal Processing in Oceanic Engineering, IEEE Journal of Oceanic Engineering, vol. 17, no. 4, Oct.
1992.
M.W. Roth, Survey of neural network technology for automatic target recognition, IEEE Trans. Neural
Networks, vol. 1, no. 1, pp. 28–43, Mar. 1990.
D.W. Ruck, S.K. Rogers, M. Kabrisky, M.E. Oxley, B.W. Suter, The Multilayer Perceptron as an
Approximation to a Bayes Optimal Discriminant Function, IEEE Trans. on Neural Networks, vol. 1,
no. 4, pp. 296–298, Dec. 1990.
H.L. Van Trees, Detection, Estimation and Modulation Theory, Part I, Eds. Wiley and Sons, New York,
1968.
D.R. Hush and B.G. Horne, Progress in Supervised Neural Networks. What’s new since Lippmann?,
IEEE Signal Processing Magazine, pp. 8–51, Jan. 1993.
B.A. Telfer, H.H. Szu, Implementing the Minimum-Misclassification-Error Energy Function for Target
Recognition, Neural Networks, vol. 7, no. 5, pp. 809–818, 1994.
B.A. Telfer, H.H. Szu, Energy Functions for Minimizing Misclassification Error With Minimum-
Complexity Networks, Proc. of Int. Joint Conf. Neural Networks, IJCNN, vol IV, pp. 214–219, 1992.
A. El-Jaroudi, J. Makhoul, A New Error Criterion For Posterior Probability Estimation With Neural
Nets, Proc. of Int. Joint. Conf. Neural Networks, IJCNN, vol. I, no. 5, pp. 185–192, 1990.
D. Andina, J.L. Sanz-González, On the problem of Binary Detection with Neural Networks, Proc. of 38
Midwest Symposium on Circuits and Systems, Rio de Janeiro, Brazil, vol. I, pp. 554–557, Aug. 1995.
W.L. Root, An Introduction to the Theory of the Detection of Signals in Noise, Proc. of the IEEE, vol
58, pp. 610–622, May 1970.
D.Andina, Optimización de Detectores Neuronales: Aplicación a Radar y Sonar, Ph. D. Dissertation (in
Spanish), ETSIT-M, Polytechnic University of Madrid, Spain, Dec. 1995.
J.L. Marcum, A Statistical Theory of Target Detection by Pulsed Radar, IRE Trans. on Information
Theory, vol. IT-6, no. 2, pp. 59–144. Apr. 1960.
E. Barnard, D. Casasent, A Comparison Between Criterion Functions for Linear Classifiers, with Appli-
cation to Neural Nets, IEEE Trans. Systems, Man, and Cybernetics, vol. 19, no. 5, pp. 1030–1040,
Oct. 1989.
P. Swerling, Probability of detection for fluctuating targets, IRE Trans. on Information Theory, vol.
IT-6, no. 2, pp. 269–308, Apr. 1960.
D. Andina, J.L. Sanz-González, J.A. Jiménez-Pajares, A Comparison of Criterion Functions for a Neural
Network Applied to Binary Detection, Proc. of Int. Conf. Neural Networks, ICNN, Perth, Australia,
Vol I, pp. 329–333, Nov. 1995.
CHAPTER 5
RADIAL BASIS FUNCTION NETWORKS AND THEIR
APPLICATION IN COMMUNICATION SYSTEMS

ASCENSIÓN GALLARDO ANTOLÍN1 , JUAN PASCUAL GARCÍA2 ,


JOSÉ LUIS SANCHO GÓMEZ3
1
Departamento de Teoría de la Señal y Comunicaciones, EPS-Universidad Carlos III de Madrid,
Avda. de la Universidad, 30, 28911-Leganés (Madrid), SPAIN. gallardo@tsc.uc3m.es
2
Departamento de las Tecnologías de la Información y las Comunicaciones, Universidad Politécnica
de Cartagena, Campus de la Muralla del Mar, s/n, 30202, Cartagena (Murcia) SPAIN.
juan.pascual@upct.es
3
Departamento de las Tecnologías de la Información y las Comunicaciones. Universidad Politécnica
de Cartagena Campus de la Muralla del Mar, s/n, 30202, Cartagena (Murcia) SPAIN.
josel.sancho@upct.es

Abstract: Among the different types of Neural Networks (NN), the most popular and frequently used
architectures are the Multilayer Perceptron (MLP) and Radial Basis Functions (RBF) due
to their approximation capabilities. In this chapter we discuss the use of RBF networks
to solve problems in different areas related to communications systems. In the first part
of the chapter, we revise the structure of the RBF networks and the main procedures to
train them. In the second part, the main applications of RBF networks in communication
systems are presented and described. In particular, we will focus our attention in antenna
array signal processing (direction-of-arrival estimation and beamforming) and channel
equalization (intersymbol interferences and co-channel interferences). Other applications
such as coding/decoding, system identification, fault detection in access networks and
automatic recognition of wireless standards are also mentioned

Keywords: Neural networks, radial basis functions, communication systems, channel equalization,
antenna array signal processing

1. RADIAL BASIS FUNCTION NETWORKS


The Radial Basis Function (RBF) networks are one type of layered feedforward
neural networks (NN) capable of approximating any continuous function with a
certain precision level. The performance of the RBF networks can be viewed as a
function approximation that consists of a multidimensional curve fitting problem.
109
D. Andina and D.T. Pham (eds.), Computational Intelligence, 109–130.
© 2007 Springer.
110 CHAPTER 5

The function approximation is realized by means of a first nonlinear mapping from


the input space to a high dimensional hidden space and a second linear mapping from
the hidden space to the final output space. This operation is used in complex pattern
classification problems because once the input space has been mapped in a high-
dimensional space in a nonlinear way, the patterns are easier separable by means
of a linear transformation. The support of this operation is given by the Cover’s
Theorem of the separability of patterns [Cover, 1965]. According to this theorem,
a complex pattern-classification problem is more likely to be linearly separable if
a nonlinear cast in a high-dimensional space is done. The aforementioned mapping
used in the pattern classification problem can be viewed as a surface construction
that can also be used in the interpolation problems.
This approach in the RBF networks design was taken by first time in [Broomhead
and Lowe, 1988]. In the RBF networks the function that interpolates the data is of
the form [Powell, 1988]:


N
(1) Fx = wi x − ci 
i=1

where · is the radial basis function,  ·  is the Euclidean norm, N is the number
of the training data pairs, and ci is the center of the i-th radial basis function. In this
approximation, there are so many centers as number of inputs, being their values
the same as the input vectors, i.e., ci = xi . During the training stage a known set of
input and output data pairs are delivered to the RBF network to select the centers
and compute the output layer weights. F function has to satisfy the interpolation
condition Fxi  = di , where di is the desired training output value. The method
applied to carry out the output layer weights is easier to explain if the RBF network
performance is expressed in a matrix form:

(2) d = w

where d = d1  d2      dN T is the desired output vector,  is an N -by-N matrix


with elements ji = xj − ci  with j i = 1 2    N , and w = w1  w2      wN T
is the weight vector. This weight vector is computed as follows:

(3) w = −1 d

where −1 is the inverse of the matrix .


The Micchelli’s theorem [Micchelli, 1986] gives the conditions under the RBF
network produces an output surface or function F that passes through all the training
points. To measure the generalization capabilities of the trained network, i.e., its
behavior with patterns that have not been used in the training phase, new points are
delivered to the RBF network. The strict interpolation surface usually implies a poor
generalization capability, especially when the number of training points is high. The
overfitting produced by the strict interpolation is undesirable in the most of cases
because the new RBF network outputs, different from the training outputs, will not
RADIAL BASIS FUNCTION NETWORKS 111

be correct. Due to the noise present in the training data or the lack of data in some
regions of the input and output space, the construction of the interpolation surface is
an ill-posed problem. The regularization theory was proposed by Poggio and Girosi
to solve the ill-posed surface reconstruction problem [Poggio and Girosi, 1990]. It
can be proved that this regularization network is a universal approximator and the
approximation performed is optimal. Because of the regularization network training
is a very computational demanding task, a suboptimal solution was proposed in
[Poggio and Girosi, 1990]. This proposal allows different RBF so that the function
approximation is realized by:

m
(4) Fx =
i i x − xi 
i=1

where the number of radial basis is m ≤ N . Typical examples of radial basis are
inverse multiquadric functions and Gaussian functions.
In [Park and Sandberg, 1991], the universal approximation theorem for RBF net-
works is proved. This theoretical support allows to design RBF networks which have
the capability of approximating any continuous function when a set of appropriate
parameters are selected.

2. ARCHITECTURE
A RBF network consists of two layers as seen in Figure 1. The input level (not
considered as a layer) is responsible of the data acquisition. The first layer is
called the hidden layer because it is between the input layer and the output layer
and it performs a nonlinear transformation by means of the radial basis functions.
Finally, the output layer produces the output of the network by means of a linear
transformation.
In Figure 1, it is shown how each input data is delivered to every neuron (RBF)
in the hidden layer. A bias term is applied multiplying a constant function plus its
linear weight. Every radial basis function output is multiplied by the corresponding
linear weight and all are summed up to carry out the RBF network output. The

ϕ=1 w0=b

x1 ϕ w1

x2 ϕ w2
F(x)

wm
xn ϕ
Hidden Layer Output Layer

Figure 1. Radial Basis Function Network structure


112 CHAPTER 5

architecture depicted in Figure 1 refers to an output space of one dimension. It


is straightforward to generalize the model to the case of multidimensional output
space. Among all possible Radial Basis functions, one of the most used is the
Gaussian function. Thus:
1
(5) i x = exp x − ci 
2 i2
where ci is the radial basis center, i2 is the variance of the Gaussian function, and
x is the input vector. The variances are usually chosen to be common to all the
Gaussian functions, although other alternatives have been proposed; one of them is
described in the next section. The Gaussian radial basis functions considered above
can be generalized to allow for arbitrary covariance matrices i . Thus, the basis
functions take the form:
1
(6) i x = exp− x − ci T −1
i x − ci 
2
It is sometimes useful to consider the covariance matrices equal and introduce a
weighted norm of the form:
1 −1
(7) = CT C
2
1 1 1
(8) C = diag   · · ·  
1 2 m
where C is the norm weighting matrix and −1 is the covariance matrix. This
weighting matrix is diagonal with different values along the diagonal, then the
Gaussian function will have different variances along the different dimensions of
the input space.

3. TRAINING ALGORITHMS
The design of a RBF network is composed of two phases: the training process and
the testing phase. During the first one, the network learns from a well known set of
input-output data pairs called the training set. After training process is completed,
the generalization capability of the trained network is checked in the testing phase.
To do this, a set of input-output data pairs different from these used in the training
phase is delivered to the RBF network to measure the performance of the network.
The aim of a training algorithm is to choose the centers, variances (or the
covariance matrix if the norm weighting is used) and the linear layer weights of the
output layer. The most important issue in the RBF network design is the centers
selection. Usually, training strategies use combined supervised and unsupervised
strategies. The supervised algorithms take into account the desired output values
to compute the weights. An unsupervised algorithm only uses the input data to
calculate the radial basis centers and variances. Following this explanation, four
different training algorithms are presented.
RADIAL BASIS FUNCTION NETWORKS 113

3.1 Fixed Centers Selected at Random

This unsupervised strategy consists of selecting the centers of the Gaussian Radial
Basis randomly from the input data of the training set. Each center, ck , is equal to
one training input pattern. The variance of each Gaussian function is calculated as:
2
dmax
(9) =
2m
where dmax is the maximum distance between the selected centers, and m is the
number of centers [Haykin, 1999]. Choosing too peaked or too flat Gaussian func-
tions should be avoided since extreme variance values produce a bad performance.
The output layer weights can be carried out with the pseudo-inverse procedure. The
RBF network operation can be expressed in a matrix form as:
⎛ ⎞ ⎛ ⎞⎛ ⎞
d1 x1  c1  x1  c2     x1  cm  w1
⎜ d2 ⎟ ⎜ x2  c1  x2  c2     x2  cm  ⎟ ⎜ w2 ⎟
⎜ ⎟ ⎜ ⎟⎜ ⎟
(10) ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠⎝ ⎠
dN xN  c1  xN  c2     xN  cm  wm
(11) d = w

where d is the vector of the desired training outputs, w is the weight vector, and
 is a non-square matrix with components equal to the output of the radial basis
functions evaluated in the training points.
The weights are calculated as:

(12) w = + d
(13) + = T −1 T

where + is the pseudo-inverse of the  matrix. If the training set is numerous,


the pseudo-inverse calculation requires extensive computation; nevertheless, there
exist many efficient algorithms to compute the pseudo-inverse matrix [Golub and
Van Loan, 1996]. Another problem arises when T  is singular or nearly singular.
In this case, the direct solution given by Equation (12) can lead to numerical
difficulties. In practice, such problems are best resolved by using the technique of
Singular Value Decomposition (SVD) to find a solution for the weights, [Golub
and Van Loan, 1996].

3.2 Self-Organized Selection of Centers

The fixed centers selected at random is a simple and fast training strategy but it
usually leads to a poor performance or large RBF networks. Moreover, if the centers
are chosen close together it produces ill conditioning in the pseudo-inverse problem.
The training algorithms included in the self-organized selection of centers category
114 CHAPTER 5

try to avoid the previous problems by means of clustering procedures. In this


section two algorithms are explained: the k-means clustering and the self-organizing
map clustering. The RBF network training is completed with the last layer weight
estimation. Procedures as pseudo-inverse or a gradient descent algorithm may be
utilized to deal with this last part of the training. The most frequently used gradient
descent algorithm is the Widrow-Hoff or Least Mean-Square (LMS) rule due to
its simplicity, ability to operate satisfactorily in an unknown environment, and its
ability of tracking variations of input statistics [Haykin, 2002].

3.2.1 K-means clustering algorithm


This learning strategy allocates the centers along the regions of the input space
where the relevant data are present. The algorithm begins with the random selection
of m centers. At each iteration of the algorithm the training input pattern x is
assigned to the k center if: x − ck  < x − cj  for j = 1 2     m and j = k. The

new center is computed by ck = N1 xSk x where Nk is the number of input patterns
k
that belong to the cluster Sk defined as:

(14) Sk = x with x − ck  < x − cj  for j = 1 2     m and j = k 

The procedure stops when there is no variation in the position of the centers.
After the centers have been calculated, several ways to calculate the variance
of the Gaussian functions may be used. Normally, the variance of each Gaussian
function is the mean of the distances of some certain group of neighboring centers
or input patterns; for example, we may select as variance value of the k-th Gaussian

function, the mean of the distances to the p nearest centers by k =  p1 pi=1 li ,
where li is the distance between the k-th center and the i-th nearest center and l1 ≤
   ≤ li ≤    ≤ lm . Usually, the Euclidean norm is used to compute the distances.
Another alternative to select k consists of computing the distance between the
current center and the nearest center belonging to other class and establishing
k = mincj − ck  with j = 1 2     m j = k, and  a scale factor. The k
can also be estimated making use of the mean of the input patterns belonging to
the cluster Sk .

3.2.2 Self-Organizing map clustering: SOM


The Self-Organized Map is a method to classify the input patterns into groups
or clusters each one characterized by a particular center. This iterative algorithm
developed by T. Kohonen can be divided in two steps [Kohonen, 1990]. First, an
input training pattern is randomly selected and assigned to the closest (“winner”)
cluster, this is called the competitive phase. In a second step, the center of the
winner cluster, together with a predefined neighborhood, are updated such that they
are moved towards the input vector. This is called the updating phase.
The initial position of the centers are initialized to random values. All the centers
have to be different and it is recommendable that the centers take small euclidean
norms. In the neural network structure, a neighborhood Nc is defined around each
RADIAL BASIS FUNCTION NETWORKS 115

neuron. At the beginning of the algorithm, because of the random initialization of the
centers, the corresponding centers of the neurons belonging to each neighborhood
set Nc can be located in distant positions in the input space. The aim of the algorithm
is therefore to place the centers of the corresponding neighborhood set neurons in
nearby positions in the input space.
During the i-th iteration of the the competitive phase all the Euclidean distances
between the given input pattern and the cluster centers are calculated. The winner
center, ck , is that one with the minimum distance, i.e., kxi = argminj xi −
cj i for j = 1 2    m. During the i-th iteration of the updating phase, the centers
belonging to the neighborhood, Nc , of the winner center, ck , are adjusted moving
towards the input; the rest of the centers are not updated. Thus, the updating
equations are:

ck i + ix − ck i  if k ∈ Nc 


(15) ck i + 1 =
ck i if k ∈ Nc

The parameter i is an adapting rate that may take values comprised between 0
and 1. The i is related to the gain used in the stochastic approximation processes.
As in these methods it should decrease with time. Once the centers are calculated
according to Equation (15), the final structure is such that similar input vectors
activate the same output (center or neuron), while different vectors activate different
neurons.

3.3 Supervised Selection of Centers

One approach to develop a supervised selection of centers is to apply a gradient


descent algorithm to a network error function. The supervised selection of centers
tries to avoid the inherent problems found in the previous training strategies. The
heuristic methods as selecting the centers at random or k-means clustering usually
lead to large networks for a given error or do not generalize well. The time to
converge to a solution is higher in the supervised selection than in the algorithms
previously explained. The supervised procedure can also be extended to the calcu-
lation of the gaussian variances and the output layer weights. In the most general
form of the supervised training algorithm, the gradient descent procedure is applied
to the evaluation of the covariance matrices −1k .
Once the error function has been defined, the relevant error function gradients
respect the corresponding parameters are calculated. The change in every iteration
is proportional to the evaluated gradient. One possibly error function is the Sum of
Square Errors (SSE) defined as:

1 N
(16) E= e2
2 j=1 j
116 CHAPTER 5

where N is the number of the training patterns, and ej is the error committed by
the RBF network when one pattern is presented. This error can be written as:


m
(17) ej = dj −
i xj − ci 
i=1

The following equations implement the updating step of this supervised training
method for the case of a one-dimensional output space [Haykin, 1999]. It is apparent
to generalize it to the general case of a multidimensional output space. The initial
values of centers, matrix covariance, and weights are chosen to be in a useful region
of the parameter space. This initial constraint is done to reduce the probability of
getting stuck in a local minimum.
The updating equations for the k-th radial basis center in the j-th iteration is
calculated as follows:

Ej N
(18) = 2
k j ei j xi − ck j −1
k xi − ck j
ck j i=1

Ej
(19) ck j + 1 = ck j − 1
ck j

The linear weight updating is carried out by means of the equations:

Ej N
(20) = ei jxi − ck j

k j i=1
Ej
(21)
k j + 1 =
k j − 2

k j

Finally, the −1
k covariance matrix is adjusted with the next equations:

Ej N
(22) −1
= −
k j ei j xi − ck jQik j
 k j i=1

(23) Qik j = xi − ck j xi − ck j T


Ej
(24) −1 −1
k j + 1 = k j − 3
 −1
k j

In the above equations, the parameters updating is made when all the training
patterns have been presented to the network. This is a batch update. The adjust of
each network parameter can as well be accomplished in a continuous way. In this
second case, the parameter updates are applied after every pattern presentation. This
procedure is known as sequential or pattern-based updating. The 1 , 2 and 3
terms are the learning rate parameters of the centers, weights, and matrix covariance
respectively.
RADIAL BASIS FUNCTION NETWORKS 117

3.4 Orthogonal Least Squares (OLS)

The Orthogonal Least Squares algorithm can be applied to the RBF network centers
selection problem [Chen et al., 1991]. The objective of the OLS is to select an
appropriate centers set in a rational way and to maximize the contribution of the
selected centers to the desired response. This iterative algorithm completes the
training process evaluating the output linear weights. From the point of view of
this procedure, the RBF network can be considered as a special case of the linear
regression model. In matrix notation, the output of the RBF network is given by
Pw. In this way, we can write:

(25) d = Pw + E

where d = d1 d2     dN is the desired output vector of the training data
set, P = p1      pM is the regression matrix, w = w1      wM is the weight
vector, and E = e1  e2      eN is the error vector of the training patterns. In this
case, the output space is one-dimensional. The objective of the OLS algorithm
is to find w and pj being M as lower as posible such that the energy of E is
minimized.
Each column of the regression matrix is a regressor vector that is calculated as the
result of evaluation onto every input data the radial basis nonlinear function with the
corresponding center: pi = x1 − ci  x2 − ci      xN − ci  T . We can
considerer that the regressor vectors pi form a set of basis vectors. The Least-Squares
solution ŵ of the above problem satisfies the condition that Pŵ is the projection
of d in the space spanned by the regressors. The square of this projection is part
of the desired output energy. Since the initial regressors are generally correlated
and in order to measure the individual contribution to the output energy, the OLS
transforms the initial set of M regressors in a set of m ≤ M orthogonal basis vectors.
The individual contribution to the desired output energy of each orthogonal vector
is easy to find. It is usually to select as initial regressors set the ones that result from
using as centers all the training input data. In this last case, we have N regressor
vectors. The orthogonalization of the initial regressor vectors is realized by means
of the well known Gram-Schmidt method. This method decomposes the regression
matrix in a product of a two matrices U and A:

(26) P = UA

where U matrix is composed of orthogonal columns and A is a triangular matrix. At


each iteration of the algorithm, a column of U and the corresponding column of A
are calculated. Each column of U is the result of the orthogonalization of a selected
regressor p. An error reduction ratio is calculated to select the regressor that has
to be orthogonalized in the corresponding iteration. Each column of U represents
the selection of a certain center. The elements of the corresponding column of
the matrix A are calculated with the selected regressor and the already calculated
orthogonal columns of U. The algorithm stops when a maximum pre-set error is
118 CHAPTER 5

reached. The triangular matrix A is used in the computation of the output weights.
In this procedure the variance is common to all the radial basis functions and it is
a value to be set previously to the algorithm beginning.
In [Shertinsky and Picard, 1996], it is shown that the OLS does not produce
the smallest set of centers if a nonorthogonal basis is used. When the basis is
nonorthogonal, the energy contributions of the basis vectors are not independent. In
this case, the OLS is not able to determine the regressor that produces the maximal
alignment in a global sense. Despite of this suboptimal performance, the OLS has
proved its validity in numerous applications and it usually leads to more compact
set of centers than the previous learning strategies [Haykin, 1999]. In the last years
some versions of OLS have been developed. In the next sections a brief explanation
of two OLS variant methods is given.

3.4.1 Recursive Orthogonal Least Squares (ROLS)


The recursive orthogonal least squares is useful in the time variant problems [Gomm
and Yu, 2000]. In this proposal, the classical OLS is transformed in a set of matrix
equations in which the orthogonal decomposition is made in a recursive way.
The matrices in each iteration depends on the previous ones. A recursive residual
equation is introduced to measure the RBF network accuracy. Higher values of the
residual represent better accuracy levels. In [Gomm and Yu, 2000], two operation
modes of the ROLS are described. In the backward mode, a center is removed at
each iteration. In this operation, the center that causes the smallest increase in the
residual is removed from the network. In the forward mode, a center is added at
each iteration. In this last operation mode, the center that allows for the maximum
increment in the residual is selected. In [Gomm and Yu, 2000], the Givens rotations
are used to achieved an efficient implementation of the forward and backward
techniques.

3.4.2 Locally Regularised Orthogonal Least-Squares (LROLS)


The locally regularised orthogonal least-squares, is an algorithm developed to
design RBF networks that can generalize well in situations with high noise levels
[Chen, 2002]. Moreover, the local regularization enforces the sparsity of the subset
model obtained by the original OLS training algorithm. The LROLS technique
combines the local regularization approach with the OLS training. To achieve the
local regularization, each orthogonal regressor uj has associated a regularization
parameter j . This new parameter is included in a new error reduction ratio equa-
tion. Just as in the classical OLS, this new error reduction ratio is evaluated to
select the orthogonal regressor uj . Since the optimal value of the regularization
parameter is unknown, an iterative procedure is carried out. At each iteration a
regressors subset is selected with the current values of the regularization parame-
ters. The iterative procedure stops when there is no changes in the regularization
parameters.
RADIAL BASIS FUNCTION NETWORKS 119

4. RELATION WITH SUPPORT VECTOR MACHINES (SVM)

The Support Vector Machines are another kind of feedforward neural networks
developed by Vapnik [Boser, Guyon and Vapnik, 1992], [Cortes and Vapnik,
1995],[Vapnik, 1995] and [Vapnik, 1998]. The aim of a SVM is to construct
an optimal hyperplane that allows a maximum margin classification in the case
of linear separable patterns. When the patterns are nonseparable, the constructed
hyperplane minimizes the probability of classification error. The hyperplane solution
is found, thanks to the application of the method of structural risk minimization,
so that the optimal hyperplane is the one that minimizes the Vapnik-Chervonenkis
dimension.
As seen in the introduction, the Cover’s theorem assures that a set of nonlinearly
separable patterns are linear separable with high probability if the input space is
mapped into a higher dimension space. This new feature space must have a high
enough dimension and the mapping has to be nonlinear. In a SVM, the optimal
hyperplane is not constructed in the input space but in the new feature space. The
hyperplane construction involves the evaluation of an inner-product kernel. The
Mercer’s theorem [Hochstadt, 1989] allows to determine if a certain kernel is an
inner-product kernel and therefore if the kernel is acceptable in the design of a
SVM. Once the kernel has been chosen, the optimum SVM design implies the
maximization of an objective function subject to certain constraints by means of
the Lagrange multipliers. The design finishes when a weight matrix and a set of
support vectors have been evaluated. The SVM output calculation consists of several
operations. First, the inner-product kernel between the input and the support vectors.
Each inner-product kernel is multiplied by the corresponding weight. Finally, all the
multiplications between the inner-product kernels and weights are summed up to
produce the desired output. It is easy to implement the SVM operation as a neural
network structure. Depending on the inner-product kernel there are different types
of support vector machines. The polynomial learning machine uses the function
xT xi +1p . A two layer perceptron is constructed if the function tanh0 xT  xi +1 
is applied. If the Gaussian function exp− 2 1 2 x − xi 2  is used, then a RBF network
is obtained. The RBF networks are from this point of view a special case of SVM.
The SVM designing algorithm allows to construct RBF networks avoiding the
heuristics needed in the conventional RBF network training algorithms.

5. APPLICATIONS OF RADIAL BASIS FUNCTION NETWORKS


TO COMMUNICATION SYSTEMS

The general aim of this section is to cover the applications of radial basis function
(RBF) networks to different areas related to communication systems. In particu-
lar, we will focus our attention in antenna array signal processing and channel
equalization. We will also mention other applications such as coding/decoding,
system identification, fault detection in access networks and automatic recognition
of wireless standards.
120 CHAPTER 5

5.1 Antenna Array Signal Processing

Recently, antenna array signal processing (AASP) is receiving growing attention


from researchers as a possible means of improving the performance and capacity
of many communication systems, such as mobile radio systems.
AASP comprises basically two research areas: direction-of-arrival (DoA) esti-
mation and beamforming.
Conventional methods for AASP are usually linear. However, they can be
improved by using non-linear techniques, typically neural networks. Interested read-
ers can find a complete review of neural methods for antenna array signal processing
in [Du et al., 2002].
Radial basis function networks stand out other neural approaches due to several
reasons. Firstly, RBFs have the property of universal approximation, i.e. they can
approximate continuous functions arbitrarily well when a large number of neurons
is considered. The importance of this property will be highlighted in the subsection
devoted to RBF-based DoA estimators. Secondly, it has been demonstrated that
RBF methods are very robust against noise, so they perform well in a wide range of
signal-to-noise ratios (SNRs). Finally, their training process is faster as compared
to other neural networks as multilayer perceptron network (MLP) allowing an on-
line adjustment of the network. This last property is very important because DoA
and beamforming are usually applications in which both adaptive adjustment to
time-varying conditions and real-time processing are required.
The rest of the section is devoted to the description of some of the most relevant
contributions in the field of RBF-based methods applied to AASP problems.

5.1.1 Direction-of-Arrival (DoA) estimation


The purpose of DoA algorithms is to obtain the direction of arrival of signals
from the information contained in the measurements of antenna array outputs. The
antenna array can be viewed as a device which performs a non-linear mapping,
G  S → , from the space of angles of arrival  to the space of sensor output S.
So, one method for estimating the values of  is to approximate the inverse
mapping F   → S. Due to its universal approximation property, RBFs can be used
for this purpose.
Figure 2 is a block diagram of a DoA estimator based on RBF networks. As it
can be observed, this system is composed of the sensor array, a preprocessing stage
and a RBF network. The array output S is passed through the preprocessing module
for removing irrelevant information. This step is fundamental because it contributes
to minimize the required size of the RBF network. The preprocessing data is the
input of the RBF network which performs the non-linear mapping by approximating
the function F . Finally, the RBF output is the estimation of the directions of
arrival.
Several RBF-based DoA estimator have been developed following the generic
scheme in Figure 2. Two representative examples are [Lo et al., 1994] and [El
Zooghby et al., 1997]. In both cases, the preprocessing stage performs the compu-
tation of the normalized covariance matrix of the sensor output in order to eliminate
RADIAL BASIS FUNCTION NETWORKS 121

Sensor array

...
s1 s2 sk
Sensor outputs

PREPROCESSING

RBF
NETWORK

DOA estimation (θ)

Figure 2. Block diagram of a DoA estimator based on RBF networks

the initial phase and gain of the array output. A similar scheme was used in [Southall
et al., 1995], but, in this case, the RBF input consisted of the sines and cosines of
the phase differences between the measured signals of the elements and the refer-
ence element. All these approaches outperform one of the most used algorithms in
this field, the MUSIC (Multiple Signal Classification method) algorithm [Schmidt,
1986], in both accuracy and speed.
Another example of tracking system based on RBF networks is proposed in
[Mukai et al., 2002]. In this article, the authors used an adaptive RBF in conjunction
with an array feed compensation system for acquisition and continuously tracking of
a Deep Space Network (DSN) antenna for communications at Ka-band of 32 GHz.
However, the extension of these methods for multiple-source tracking is not
straightforward. In fact, for detecting more than three sources, the resulting RBF
network presents a large size and its training is impracticable [Du et al., 2002]. In
addition, the knowledge of the number of sources is required.
The solution proposed in [El Zooghby et al., 2000] overcomes these limitations.
In this work, El Zooghby et al. developed a new DoA algorithm (N-MUST algo-
rithm) with application to smart antennas for wireless terrestrial and satellite mobile
communications. N-MUST consists of two stages. In the first stage (detection), a
coarse detection of sources is performed. This result is refined in the second stage
(estimation) in which several RBF networks were trained for different ranges of
angles of arrival.
122 CHAPTER 5

5.1.2 Beamforming
The main objectives of beamforming techniques is to acquire and reconstruct the
original signal of the desired source while rejecting the rest of non-desired sources.
Beamforming is of special importance in modern mobile satellite communication
systems and global positioning systems (GPS) because they require the presence
of smart antennas capable of distinguishing signals from multiple sources. These
systems must adapt the radiation pattern of the antenna in order to cancel the
interfering signals and emphasize the desired ones.
In [El Zooghby et al., 1998] it was demonstrated that RBF networks can be
used for this purpose. In this case, RBF-based beamforming was applied in one
and two-dimensional adaptive arrays obtaining a results very close to the optimum
Wiener solution.
In the context of digital beamforming (DBF) antenna array, Xu et al. proposed the
combination of the so-called Constant Module algorithm (CMA) and a novel RBF-
based beamformer for interference cancellation and Gaussian and non-Gaussian
noise [Xu et al., 2000]. The presence of non-Gaussian noise (for example, atmo-
spheric noise) is typical of satellite communications. In these noisy conditions, the
authors showed that their approach outperformed the conventional CMA algorithm.

5.2 Channel Equalization


Digital communication systems are often impaired by several types of distortions
which produce significant changes in both, the amplitude and phase of transmit-
ted signals. One of the most common distortions is the intersymbol interference
(ISI) which causes an overlap of the transmitted symbols over successive time
intervals. ISI is usually a result of the restricted bandwidth of the communication
channel and it is also produced when several versions of the transmitted signals
arrive at the receiver due to multi-path propagation. Other sources of distortion
which degrade the performance of the communication system are noise, the intrin-
sic characteristics of the transmission channel, co-channel and adjacent channel
interference, non-linear distortions (for example, those produced by memoryless
non-linear devices such as the travelling-wave tube (TWT) amplifiers included in
satellite communication systems), fading, time-varying characteristics, etc.
Figure 3 illustrates a digital communication system in which the channel includes
the effects of the transmitter filter, the transmission medium and the receiver filter.
Stationary linear dispersive channels can be characterized by a N taps finite impulse
response (FIR) digital filter, given by the coefficient vector h = h0  h1      hN −1 .
As it can be observed in Figure 3, the digital data sequence xk is passed through
this FIR filter and it is corrupted by an additive zero-mean Gaussian noise, nk,
producing the distorted received sequence, yk. The relationship between the trans-
mitted signal, xk, and the received signal, yk, can be expressed as


N −1
(27) yk = hi xk − i + nk
i=0
RADIAL BASIS FUNCTION NETWORKS 123

y(k) ^x(k)
x(k)
CHANNEL + EQUALIZER
Transmitted Received Estimated
signal signal signal
n(k)
Noise

Figure 3. Digital communications system

For a more general model of the channel (for example, non-linear channels) the
received sequence, yk, is related to the input, xk, of the channel through this
expression

(28) yk = fh xk  xk−1      xk−N +1  + nk

in which fh is some linear or non-linear function which models the behavior of the
channel.
In this context, adaptive channel equalization is a major issue in digital commu-
nication systems. The role of the channel equalization is to recover the transmitted
symbols from the distorted received signal. Therefore, the equalizer is the part of
the receiver that is employed for mitigating channel disturbances such as noise,
intersymbol interferences and the other distortions previously mentioned. In addi-
tion, in most practical communication systems, the channel characteristics varies
with time. Therefore, it is necessary to build adaptive equalizers.
Equalizers can be classified in two groups: sequence and symbol-by-symbol
equalizers. Estimation theory suggests that the best performance for symbol detec-
tion is obtained by using equalizers belonging to the first group [Proakis, 2001].
Such equalizers are maximum likelihood sequence estimators (MLSE) and they are
usually implemented by means of the well-known Viterbi algorithm. However, the
entire transmitted sequence and a certain knowledge of the channel characteristics
(i.e. a channel estimator) are required for decoding the optimum sequence of sym-
bols. In addition, sequence equalizers present a high computational complexity and
a large decision delay. All these requirements seriously limit their use in many
practical communication systems.
Symbol-by-symbol equalizers, in which only one output symbol is estimated in
each symbol period, are the most popular alternative to MLSE equalizers. There
exist three common symbol-decision architectures: transversal equalizers (TE), deci-
sion feedback equalizers (DFE) and growing memory structures [Mulgrew, 1996]
[Mulgrew, 1998].
In TE architectures, the equalizer reconstructs each input symbol xk using
M consecutive channel outputs yk yk − 1    yk − M + 1. This way, the
equalizer output, x̂k, is an estimate of the channel input. The integer M is known
as the equalizer order and determines the behavior of the detector. In fact, the
performance and the complexity of the system increase with the number M of
124 CHAPTER 5

received symbols used for making the decision. Often, the equalizer operates with
a decision delay d, thus at time k the equalizer produces an estimate of the input
symbol xk − d.
The DFE architecture is an extension of the TE one. In this case, for the estimation
of xk − d not only the M most recent channel observations are used, but also the
P past detected symbols x̂k − d − 1 x̂k − d − 2     x̂k − d − P + 1, where
P is the equalizer feedback order.
Finally, equalizers with a growing memory structure are usually implemented
using recurrent networks. Recurrent networks provide the possibility of improv-
ing the quality of symbol-by-symbol decisions by using recursively all the pre-
viously received symbols instead of the last M channel outputs as in the TE
architecture, while memory requirements of the equalizer are kept as small as
possible. Thus, recurrent networks offer a compromise between performance and
complexity.
A traditional method of adaptively compensating for the ISI is to construct an
approximation to the inverse of the channel [Qureshi, 1985]. From this perspec-
tive, equalization is viewed as a deconvolution problem and the equalizer can be
implemented as a linear adaptive filter. Linear transversal equalizers (LTE) are the
most common example of this approach. However, several shortcomings of this
method have been reported [Chen et al., 1993a], [Mulgrew, 1996]: firstly, adaptive
filters can not effectively mitigate the additive noise component and secondly, the
linear approach completely ignores the fact that, in the absence of noise, a received
sequence yk can take only values belonging to a predefined finite alphabet of
symbols.
Precisely, the latter consideration allows to view channel equalization as a deci-
sion problem. This is the reason why non-linear approaches (many of them based
on neural networks) have attracted the interest of several researchers in the field of
channel equalization. In fact, from the Bayesian decision theory [Proakis, 2001], it
can be seen that the optimal solution for the symbol-detection decision corresponds
to a non-linear classification problem.
Several neural networks have been proposed for channel equalization, like multi-
layer feedforward neural networks (MLP) and radial basis function networks (RBF)
[Ibnkahla, 2000]. The main drawback of MLP networks is their training process,
which requires a great number of examples and it is very time-consuming [Mulgrew,
1996]. This fact restricts the use of MLP networks in scenarios in which adap-
tation is a fundamental issue as occurs in channel equalization problems. On the
other hand, RBF networks have a structure with a close relationship to the opti-
mal Bayesian equalizers [Chen et al., 1993a] and they are well suited for solving
non-classification problems. For these reasons, in the recent years, RBFs have been
considered as an attractive alternative to linear-based approaches.
In the next subsections, we summarize some of the most important research works
related to the application of RBF networks to channel equalization. In particular,
we will focus on the methods developed for the mitigation of intersymbol and
co-channel interferences.
RADIAL BASIS FUNCTION NETWORKS 125

5.2.1 Intersymbol interference


Numerous transversal adaptive equalizers based on RBF networks have been intro-
duced in the literature to overcome ISI and nonlinear distortions in communication
systems. In all cases, the training phase is the most important process involved in the
equalizer design. Many communication channels transmit at certain time intervals
a training sequence which can be used for estimating the network parameters.
Former adaptive RBF-based equalizers were designed for real-valued binary
channels. In these systems, the training of the RBF networks is usually carried out
in two stages [Chen et al., 1993a]. The first stage consists of a clustering algorithm
which computes the optimal centers of the network. If the training symbol sequence
provided by the communication system is available, the learning of the network
centers can be done in a supervised manner. If not, an unsupervised clustering
algorithm must be used (generally, the unsupervised k-means algorithm). In the
second stage the network weights are updated using, for example a least mean
square (LMS) algorithm.
Chen et al. [Chen et al., 1994] developed a complex-valued version of their previ-
ous equalizer which can be used for two-dimensional signalling such as quadrature
amplitude modulation (QAM) channels. In [Cha, 1995] another approach for train-
ing complex-valued RBF network equalizers was proposed: a stochastic gradient
(SG) algorithm. The main advantage of this algorithm is the possibility of training
simultaneously all the free parameters of the network (centers and weights). In both
cases, it has been shown that the proposed approach is capable of equalizing suc-
cessfully and slowly the time-varying nonlinear channels and outperform classical
linear equalizers.
The achievement of RBF networks with a minimum size (and complexity) is
one of the main issues considered in this application area. Gan et al. [Gan et al.,
1999] developed an equalizer in which the number of centers depended only on
the decision delay an not on the equalizer order like in previous approaches. They
obtained a very good results in the equalization of fast time-varying complex-
valued channels (4-QAM). In this context, it is worth mentioning the application
of Minimal Resource Allocation Networks (MRAN) for equalization of real-valued
[Chandra Kumar et al., 1998] or complex-valued channels [Jianping et al., 2002].
MRAN is a network which uses a scheme for adding or removing RBF centers
depending on the channel conditions yielding to a minimum network structure.
Satellite mobile communication channels are characterized for the presence of
linear distortions due to the emitter and receiver filters, and nonlinearities caused
by on-board power amplifiers and multipath propagation. Bouchired et al. reported
that RBF networks have also been applied to satellite channel equalization, out-
performing LTE and MLP-based equalizers [Bouchired et al.,1998a]. In the same
application area, Bouchired et al. obtained very successful results in the equal-
ization of satellite channels by using Kohonen’s self-organizing maps (SOMs) for
improving the clustering stage in a RBF network equalizer [Bouchired et al.,1998b].
The application of RBF networks to decision feedback equalizers have also been
considered. Chen et al. [Chen et al., 1993b] obtained significant improvements
126 CHAPTER 5

in both performance and reduction of computational complexity by combining


RBF networks with DFE structures. They applied this novel equalizer to highly
non-stationary channels (as fading mobile radio channels) and demonstrated the
superior performance of their approach in comparison to a conventional Viterbi-
based equalizer.
As mentioned before, higher-order equalizers perform better than low-order ones.
In RBF-based equalization, an increment of the order implies an exponential grow-
ing of the number of centers of the RBFs involved. For solving this drawback,
recurrent RBF networks have been proposed in [Cid-Sueiro, 1994], [Mimura and
Furukawa, 2001] for linear and non-linear channels.

5.2.2 Co-channel interference


Many communication systems such as cellular mobile channels are impaired by
the so-called Multiple Access Interference (MAI). It originates from the frequency
reuse plan which allows several cells to share the same set of frequencies. This
way, the signal received at the mobile station consists of the sum of all signals sent
by the base station in the same cell, and those in the other cells transmitting in the
same frequency band or in adjacent frequency bands (co-channel interference, CCI
or adjacent-channel interference, ACI). Figure 4 illustrates a digital communication
system with co-channel interferences. Although, traditionally, the main objective of
channel equalization is to reduce the effects of ISI, several works have shown that
equalizers are able to mitigate the distortions due to co-channel and adjacent-channel
interferences.
Adaptive RBF networks have been applied successfully to overcome CCI. In
[Chen and Mulgrew, 1992] a RBF-based transversal equalization was constructed

n(k)
Noise

x(k) OWN y(k)


+
Transmitted CHANNEL Received
signal signal

xc1(k) CO-CHANNEL
Interfered #1
signal

xcN(k) CO-CHANNEL
Interfered #N
signal

Figure 4. Digital communication system with co-channel interference


RADIAL BASIS FUNCTION NETWORKS 127

for this purpose. This approach exploits the differences between the interference
signals and noise for improving the performance of a conventional equalizer.
Chen et al. extended their previous work to an equalizer with decision feedback
architecture [Chen et al., 1996] and adaptive implementation. They showed that in
presence of severe CCI, the RBF approach produced significant improvements in
the performance of the system in comparison to the conventional approach MLSE.
Finally, in [Chen et al., 2003] a RBF-based adaptive equalization in presence
of interference symbol, additive gaussian noise and co-channel interference was
proposed. The novelty of this study was the definition of a new algorithm (LBER)
for training the RBF network. LBER is a stochastic gradient adaptive algorithm
which tries to minimize the bit error rate (BER) instead of the mean square error
(MSE) which is the criterion used in conventional equalizers. As the real goal of
equalization is the minimum BER, this method improved significantly the results
obtained with a linear MMSE equalizer.

5.2.3 Blind equalization


The equalization techniques presented in the previous subsection rely on the exis-
tence of a training signal which is transmitted at regular intervals. This allows the
regular adjustment of the RBF network parameters in order to improve its per-
formance in the presence of time-varying channels. However, when the training
sequence is not available or the channel is highly time-varying (as for example,
Rician fading channels typical of outdoor mobile communications environment),
it is necessary to develop more sophisticated techniques generally known as blind
equalization methods.
Most of blind equalization methods include a channel estimator in order to
adapt the centers and weights of the RBF network using non-supervised training
techniques. For example, in [Cid-Sueiro, 1993] the channel response was estimated
through a non-supervised, non-decision-directed algorithm applied to a recurrent
RBF network equalizer.
However, other approaches do not require a channel estimator. For example,
Gomes and Barroso modified the RBF equalizer proposed by Chen et al. [Chen et al.,
1993a] for blind equalization. In this case, a non-supervised clustering algorithm
was used for updating RBF’s centers and radii, while the network weights were
adjusted by minimizing an appropriate function cost [Gomes and Barroso, 1997].
Finally, in [Lin and Yamashita, 2002] the RBF equalizer was directly designed
using only the received signal and a novel cluster map algorithm.

5.3 Other RBF Networks Applications

In the previous sections, some of the most common applications of radial basis func-
tion networks in the field of digital communication systems have been described.
However, in this context, there are other important areas in which RBFs have
demonstrated their usefulness and good performance. These areas will be summa-
rized in the next paragraphs.
128 CHAPTER 5

Several authors have applied RBFs to coding-decoding systems. In [Kaminsky


and Deshpande, 2003], the functional and structural similarities between directed
graphs in the conventional Viterbi decoding and RBFs networks were exploited for
developing an adaptive RBF-based decoder for a trellis coded modulation (TCM)
system, which was able to take advantage of its learning capabilities for improving
the decoding decisions. Müller and Poor have employed RBFs for constructing a
novel chaotic-based coding/decoding method [Müller and Elmirghani, 2002] and
they showed that this new strategy presented high noise robustness compared to
other conventional schemes for the same problem.
System modelling and identification is another fundamental issue in many com-
munication systems. It can be used for channel estimation as a part of a blind
equalization system or for computer-based simulation of communication systems.
Two good examples are the studies conducted in [Yingwei et al., 1996], [Leong
et al., 2002], in which RBFs were used for adaptive identification of non-linear
systems taking advantage of their universal approximation property.
RBFs can also be applied successfully to the problem of fault detection in access
networks as shown in [Zhou and Austin, 1999] in which RBF networks were used for
automatically learning the relationship between several telephone line parameters
and fault occurrences.
Finally, RBF networks can be used for automatic recognition of modulation
schemes or identification or wireless standards. These issues have a direct applica-
tion in the design of wireless reconfigurable receivers [Palicot and Roland, 2003].
The interest of this novel concept of device is clear due to the fast growing of
services which possibly have to be carried out on several heterogeneous wire-
less networks like the Global System for Mobile Communications (GSM) or the
Universal Mobile Telecommunication System (UMTS) standards.

REFERENCES
Cortes, C. and Vapnik, V.N., Support vector networks. Machine learning, 20: 273–297, 1995.
Boser, B., Guyon, I. and Vapnik, V.N., A training algorithm for optimal margin classifiers. Fifth annual
workshop on computational learning theory, :144–152, San Mateo, CA, 1992.
Bouchired, S., Ibnkahla, M., Roviras, D. and Castanié, F., Equalization of satellite mobile communication
channels using combined self-organizing maps and RBF networks, In Proceedings of IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP’98), 3377–3380, Seattle,
WA, USA, April 1998b.
Bouchired, S., Ibnkahla, M., Roviras, D. and Castanié, F., “Equalization of satellite UMTS channels
using RBF networks”, In Proceedings of IEEE Workshop on Personal Indoor and Mobile Radio
Communications (PIMRC), Boston, USA, September 1998a.
Broomhead, D.S. and Lowe, D., Multivariable functional interpolation and adpative networks Complex
Systems, 2:321–355, 1988.
Cid-Sueiro, J. and Figueiras-Vidal, A. R., Recurrent radial basis function networks for optimal blind
equalization, In IEEE-SP Workshop on Neural Networks for Signal Processing, 562–571, Baltimore,
MA (USA), September 1993.
Cid-Sueiro, J., Artés-Rodríguez A. and Figueiras-Vidal, A. R., Recurrent radial basis function networks
for optimal symbol-by-symbol equalization, In Signal Processing, vol. 40, no. 1, 53–63, October
1994.
RADIAL BASIS FUNCTION NETWORKS 129

Cha, I. and Kassam, S. A., Channel equalization using adaptive complex radial basis function networks,
In IEEE Journal on Selected Areas in Communications, vol. 13, no. 1, 122–131, January 1995.
Chandra Kumar, P., Saratchandran, P. and Sundararajan, N., Non-linear channel equalisation using
minimal radial basis function neural networks, In Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP’98), 3373–3376, Seattle, WA, USA, April 1998.
Chen, S., Cowan, C. F. N. and Grant, P. M., Ortoghonal least squares learning algorithm for radial basis
function networks, In IEEE Transactions on Neural Networks, vol. 2, 302–309, March 1991.
Chen, S. and Mulgrew, B., Overcoming co-channel interference using an adaptive radial basis function
equalizer, In EURASIP Signal Processing Journal, vol. 28, no. 1, 91–107, 1992.
Chen, S., Mulgrew, B. and Grant, P. M., A clustering technique for digital communications channel
equalization using radial basis function networks, In IEEE Transactions on Neural Networks, vol. 4,
no. 4, 570–579, July 1993a.
Chen, S., Mulgrew, B. and McLaughlin, S., Adaptive bayesian equalizer with decision feedback, In
IEEE Transactions on Signal Processing, vol. 41, no. 9, 2918–2927, September 1993b.
Chen, S., McLaughlin, S. and Mulgrew, B., Complex-valued radial basis function networks, Part II:
Application to digital communication channel equalization, In Signal Processing, vol. 36, no. 2,
175–188, March 1994.
Chen, S., McLaughlin, S. and Mulgrew, B. and Grant, P. M., Bayesian decision feedback equalizer for
overcoming co-channel interference, In Proc. Inst. Elect. Eng., vol. 143, 219–225, August 1996.
Chen, S., Multi-output regression using a locally regularised orthogonal least-squares algorithm. IEE
proceedings-vision image and signal processing, 149 (4): 185–195, 2002.
Chen, S., Mulgrew, B. and Hanzo, L., “Least bit error rate adaptive nonlinear equalisers for binary
signalling”, In IEE Proc. Communications, vol. 150, no. 1, 29–36, February 2003.
Cover, T.M., Geometrical and statistical properties of systems of linear inequalities with applications in
pattern recognition. IEEE Transactions on Electronic computers, EC-14:326–334, 1995.
Du, K.-L., Lai, A. K. Y., Cheng, K. K. M. and Swamy, M. N. S., Neural methods for antenna array
signal processing: a review, In Signal Processing, vol. 82, 547–561, 2002.
El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., Performance of radial basis function
with antenna arrays, In IEEE Transactions on Antennas and Propagation, vol. 45, no. 11, 1611–1617,
1997.
El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., Neural network-based adaptive
beamforming for one- and two-dimensional antenna arrays, In IEEE Transactions on Antennas and
Propagation, vol. 46, no. 12, 1891–1893, 1998.
El Zooghby, A. H., Christodoulou, C. G. and Georgiopoulos, M., A neural network-based smart antenna
for multiple source tracking, In IEEE Transactions on Antennas and Propagation, vol. 48, no. 5,
768-776, May 2000.
Gan, Q., Saratchandran, P., Sundararajan, N. and Subramanian, K. R., A complex valued radial basis
function network for equalization of fast time varying channels, In IEEE Transactions on Neural
Networks, vol. 10, no. 4, 958–960, July 1999.
Golub,G.H. and Van Loan, C.G., Matrix computantions Johns Hopkins University Press, 1996.
Gomes, J. and Barroso, V., Using a RBF network for blind equalization: desing and performance evalu-
ation, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP’97), vol. 4, 3285–3288, April 1997.
Gomm, B.J. and Yu, D.L., Selecting radial basis function network centers with recursive orthogonal
least squares training. IEEE transactions on neural networks, 11 (2) :306–314, 2000.
Haykin, S., Neural Networks: a comprehensive foundation Prentice-Hall, New York, 1999.
Haykin, S., Adaptive Filter Theory Prentice Hall, New Jersey, 2002.
Hochstadt, H., Integral equations. Wiley Classics Library. John Wiley and Sons Inc., New York, 1989.
ISBN 0-471-50404-1. Reprint of the 1973 original, A Wiley-Interscience Publication.
Ibnkahla, M., Applications of neural networks to digital communications - a survey, In Signal Processing,
vol. 80, 1185–1215, 2000.
Jianping, D., Sundararajan, N. and Saratchandran, P., Communication channel equalization using
complex-valued radial basis function neural networks, In IEEE Transactions on Neural Networks,
vol. 13, no. 3, 687–696, May 2002.
130 CHAPTER 5

Kaminsky, E. J. and Deshpande, N., TCM decoding using neural networks, In Engineering Applications
of Artificial Intelligence, vol. 16, 473-489, 2003.
Kohonen, T., The Self-Organizing Map. Proceedings IEEE, 78 (9) :1464–1480, 1990.
Leong, T. K., Saratchandran, P. and Sundararajan, N., Real-time performance evaluation of the minimal
radial basis function network for identification of time varying nonlinear systems, In Computers and
Electrical Engineering, vol. 28, 103–117, 2002.
Lin, H. and Yamashita, K., Blind RBF equalizer for received signal constellation independent channel,
In Proceedings of 8th International Conference on Communication Systems (ICCS’02), 82–86, 2002.
Lo, T., Leung, H. and Litva, J., Radial basis function neural network for direction-of-arrivals estimation,
In IEEE Signal Processing Letters, vol. 1, no. 2, 45–47, February 1994.
Micchelli, C.A., Interpolation of scattered data: distance matrices and conditionally positive definite
fucntions, 2:11–22, 1986.
Mimura, M., Furukawa, T., A recurrent RBF network for non-linear channel, In Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP’01), 1297–1300, 2001.
Müller, A. and Elmirghani, J. M. H., “Novel approaches to signal transmission based on chaotic signals
and artificial neural networks”, In IEEE Transactions on Communications, vol. 50, no. 3, 384–390,
March 2002.
Mukai, R., Vilnrotter, V. A., Arabshahi, P. and Jamnejad, V., Adaptive acquisition and tracking for deep
space array feed antennas, In IEEE Transactions on Neural Networks, vol. 13, no. 5, 1149–1162,
September 2002.
Mulgrew, B., Applying radial basis functions, In IEEE Signal Processing Magazine, vol. 13, no. 2,
50–65, March 1996.
Mulgrew, B., Nonlinear signal processing for adaptive equalisation and multi-user detection, In Proc. IX
European Signal Processing Conference (EUSIPCO’98), 537–544, Rhodes, Greece, September 1998.
Palicot, J. and Roland, C., A new concept for wireless reconfigurable receivers, In IEEE Communication
Magazine, 124–132, July 2003.
Park, J., and Sandberg, I.W., Universal approximation using radial basis function networks. Neural
Computation,3 :246–257, 1991.
Poggio, T. and Girosi, F., Networks for approximation and learning. Proceedings of the IEEE, 78:
1481–1497, 1990.
Powell, M.J.D., Radial basis function approximations to polynomials Numerical Analysis 1987 proceed-
ings, :223–241, 1988.
Proakis, J. G., Digital communications, McGraw-Hill, Boston, 4th edition, 2001.
Qureshi, S. U. H., Adaptive equalization, In Proc. IEEE, vol. 73, 1349–1387, 1985.
Schmidt, R., Multiple emitter location and signal parameter estimation In Antennas and Propagation,
IEEE Transactions on (legacy, pre - 1988) Volume 34, Issue 3, Mar 1986 Page(s):276–280
Shertinsky, A. and Picard, R.W., On the efficiency of the orthogonal least squares training method for
radial basis function networks. IEEE transactions on neural networks, 7 (1) :195–200, 1996.
Southall, H. L., Simmers, J. A. and O’Donnel, T. H., Direction finding in phased arrays with a neural
network beamformer, In IEEE Transactions on Antennas and Propagation, vol. 43, no. 12, 1369–1374,
1995.
Vapnik, V.N, The nature of statistical learning theory. Wiley, New York, 1995.
Vapnik, V.N., Statistical learning theory. Wiley, New York, 1998.
Xu, C. Q., Law, C. L. and Yoshida, S., Interference rejection in non-Gaussian noise for satellite
communications using non-linear beamforming, In International Journal of Satellite Communications
and Networking, vol. 21, 13–22, 2003.
Yingwei, L., Sundararajan, N. and Saratchandran, P., Adaptive nonlinear system identification using
minimal radial basis function neural networks, In Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP’96), vol. 6, 3521–3524, 1996.
Zhou, P. and Austin, J., Neural network approach to improving fault location in local telephone networks,
In Proc. Artificial Neural Networks, 958–963, 1999.
CHAPTER 6
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL
NEURONS

JAVIER ROPERO PELÁEZ, JOSE ROBERTO CASTILLO PIQUEIRA


Escuela Politécnica da Universidade de Sao Paulo, Departamento de Engenharia
de Telecomunicações e Controle

Abstract: Mc Culloch-Pitts neuron is a simple yet powerful building block for most of nowa-
days neural networks. However recent advances in neurosciences show that this classical
paradigm can be certainly improved. For example, biological synaptic weights have new
properties like synaptic normalization and meta-plasticity that are crucial for develop-
ing new neural-networks architectures. Other peculiar biological mechanisms like the
synchronization among neurons, allowing the identification of the neuron with maximal
activation, and the dual behavior (high/low frequency) of some biological neurons can be
used for improving the performance of artificial neural networks. As an example a new
neural network that mimics the human thalamus will be analyzed and tested

INTRODUCTION

Warren McCulloch and Walter Pitts published in 1943 the pioneering work entitled
“A Logical Calculus of the Ideas Immanent in Nervous Activity” [McCulloch and
Pitts, 1943] in which they proposed that neurons are essentially binary units per-
forming Boolean computations. For their proposal they used the limited knowledge
about the nervous system available at that moment. Their work was the source of
inspiration for many neural networks designers: from Frank Rosenblatt, the creator
of the Perceptron [Rosenblatt, 1956] , a neural network for recognizing characters,
to Rumelhart, Hinton and Williams [Rumelhart et al., 1986] who developed the
backpropagation algorithm for adjusting the synaptic weights in multi-layered neu-
ral networks [McClelland, 1986] [McClelland and Rumelhart, 1986]. Many other
neural networks were developed inspired by McCulloch and Pitts model like the
Kohonen’s self-organizing map [Kohonen, 1982], the Hopfield’s auto-associative
model [Hopfield, 1982], the Grossberg’s ART [Carpenter and Grossberg, 1988] etc.
All these neural networks use essentially McCulloch’s neuron model although they
131
D. Andina and D.T. Pham (eds.), Computational Intelligence, 131–146.
© 2007 Springer.
132 CHAPTER 6

lack many recently discovered properties of real neurons. For the proper functioning
of these networks a great deal of mathematical algorithms counterbalance the lack
of up-to-date neurons’ properties.
The purpose of this work is to review some of the recently found properties of
individual neurons that can improve the conventional model of the neuron. The
following section introduces these properties in a bottom-up order: first the synaptic
plasticity properties followed by the neurons properties and ending by network
properties that are relevant for the functioning of individual neurons. Afterwards a
new updated paradigm will be presented in which synapses, neurons and networks
are revisited. Finally we integrate the updated neuron model in a model of a real
brain architecture, the thalamus, in the core of the brain.
In the following sections the nitty-gritty is explained in the text while the fine
details are in the figures that are self-explanatory.

1. BIOLOGICAL PROPERTIES

1.1 Synaptic Plasticity Properties


Synaptic plasticity, the occurrence of sustained changes in synapses was envisioned
in 1894 by Santiago Ramón y Cajal who pointed out that learning could pro-
duce changes in the communication between neurons and that this changes could
be the essential mechanisms of memory. In 1948 Konorski alluded to persistent
plastic changes in memory and in 1949 Hebb postulated that during learning synap-
tic connections are strengthened due to the correlated activity of presynaptic and
postsynaptic neurons. The mathematical formalization of the Hebb rule is:

(1) wn =  · AnBn

in which the increment of weight at instant n is proportional (through factor )


to the presynaptic activity at time n, A(n) and to the postsynaptic activity B(n).
Unfortunately this rule is not able to produce reversible changes in synapses. This
reversible changes appeared in posterior models like in the Sejnowski’s covari-
ance rule and in the BCM (Bienestock, Copper and Munro) model [Bienestock
et al., 1982].
None of these models were formulated considering the real nature of plasticity in
terms of molecular interactions (see Figure 1) [Bear et al., 2001]. From this point of
view it is relevant to notice that events A and B that appears equals in importance
in the Hebbian learning rule are not equally important during the plastic changes
that takes place in real synapses. We will call this property directionality. The
presynaptic event A, a presynaptic action potential, and the postsynaptic event B,
a postsynaptic increase of potential above a certain value, do not perform the same
role in real synapses. For example, in the Hebbian learning rule, when event A is
present and event B is not, no alteration of synaptic strength takes place. However
in this same circumstances real synapses are depressed because the low calcium
concentration in the postsynaptic area leads to synaptic depression (see Figure 1).
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 133

Presynaptic

Ca2+ NMDA Na+


AMPA

Mg2+
Ca2+ Na+
Ca2+ LTP
Ca2+ LTD

Postsynaptic

Figure 1. Potentiation and depression is accomplished in biological synapses in the following way:
Neurotransmitter Glutamate is released in the synaptic cleft from the presynaptic neuron. This
neurotransmitter acts like a key for opening the postsynaptic gates that are of two types NMDA and
AMPA gates (channels). To open the NMDA gates it is also necessary to expel the magnesium that
blocks the NMDA channels. For this purpose, a certain voltage should be attained in the postsynaptic
area. Once the channels are opened, ion Na+ is allowed to enter both type of channels. Ca2+ hydrated
molecule is big and for this reason it only passes through the NMDA channels. A big amount of Ca2+
entering the postsynaptic space produces an increment of the number and size of postsynaptic channels
(synaptic potentiation or LTP) while a moderate amount produces a decrement of them (synaptic
depression or LTD)

In the case of event A taking place without event B little activation is expected in
the postsynaptic neuron. This is different in the case events A and B both occurred
in which a high activation probably takes place in the postsynaptic neuron. In the
just mentioned BCM model [Bienestock et al., 1982] these two intervals of high and
low activation correspond respectively to regions of increment and depression of
synaptic strength. Artola et al. [Artola et al., 1990] corroborate the BCM model in
slices of rat visual cortex. This model is explained in more detail in Figure 2 where
two thresholds separates the different regions: the LTD (Long Term Depression)
threshold separates the region of no change of synaptic strength and the region of
synaptic depression and the LTP (Long Term Potentiation) threshold separates the
region of synaptic depression from the region of synaptic potentiation.

1.1.1 Synaptic metaplasticity


Synaptic metaplasticity [Abraham and Tate, 1997] [Pompa and Friedrich, 1998]
which some authors consider the plasticity of synaptic plasticity [Abraham and
Bear, 1996] consists in the dependence of the LTP threshold on the initial synaptic
weight. If the initial weight is low the LTP threshold is also low and when the
initial weight is high the LTP threshold is placed over higher values of postsynaptic
activation (see detailed explanation in Figure 3).
134 CHAPTER 6

Change in synaptic strength

LTD threshold
Postsynaptic activity
LTP threshold

Figure 2. Change of synaptic strength due to the postsynaptic activity in biological neurons. If
postsynaptic activity is high a positive change of synaptic strength takes place. If the postsynaptic
activity is between the LTD and LTP thresholds a negative change of postsynaptic strength takes place.
For lower values of postsynaptic activity no increment nor decrement of synaptic strength is noticed

Δw
Change in synaptic

Potentiation
LTP threshold
Depression
strength

Postsynaptic activity (voltage)

Initial weight

Metaplasticity=LTP threshold variation

Δw
Change in synaptic

Potentiation

Depression
strength

Postsynaptic activity (voltage)

Initial weight

Figure 3. Synaptic metaplasticity consists in the shift of the LTP threshold according to the initial
weight of the synapse. The above two figures show graphically this idea. For higher values of the
initial synaptic weight the curve is elongated so that the LTP threshold value corresponds to higher
values of postsynaptic activity
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 135

None of the just presented models of synaptic weight alteration considered the
property of synaptic metaplasticity. In section 3 we will propose an alternative
model that also accomplishes this important property.

1.1.2 Synaptic normalization


Synaptic normalization is a property of biological synapses [Turrigiano , 1998] con-
sisting in the normalization of the synaptic weights of a neuron. This is equivalent
to the mathematical operation of dividing the synaptic weights by their norm which
is calculated by the overall sum of weights. For example if the set of weights of
a neuron are expressed as vector: W = 04 05 01 02 07 01 its norm is
calculated as: normW = 04 + 05 + 01 + 02 + 07 + 01 = 2. Dividing W by this
norm, normW = 2, a new W’ is obtained W’=(0.2, 0.25, 0.05, 0.1, 0.35, 0.05). In
biological neurons the norm can be multiplied by an arbitrary multiplicative factor
n
k. This norm is therefore calculated as: normW = k wi where wi is each one
i=1
of the n synaptic weights. The biological counterpart of synaptic normalization is
explained in more detail in Figure 4.
Synaptic metaplasticity and synaptic normalization are considered homeostatic
properties of synapses. Homeostasis is the property of a system to return to its
situation in equilibrium when it is forced to leave that situation. In this sense
metaplasticity favours synaptic depression when the initial synaptic weight is high
and it favours synaptic potentiation when the synaptic weight is low. Synaptic
normalization is also an homeostatic property because when the total synaptic

Weight increase Normalization

Pre-synaptic
terminal

Post-synaptic
terminal

Moderate activity High activity Moderate activity


Figure 4. Synaptic normalization. The above figure shows the evolution of two nearby synapses
before, during and after the stimulation of the second of them. In the initial stage, the two synapses
have two ionic channels each, giving four ionic channels in total. During the stimulation (center) the
second synapse increments the number of its ion channels to six while the number of ion channels in
the first synapse remains the same: two. Therefore the weight in the second synapse becomes three
times bigger than in the first, in a proportion 6:2=3. After some hours the total number of channels in
both synapses goes back to the initial value, four : one channel in the first synapse and three channels
in the second. However the relative proportion of ion channels obtained during the stimulation: 6:2=3
is maintained, being in the last stage 3:1=3
136 CHAPTER 6

weight of a set of nearby synapses is altered, the tendency is to draw back this set
to its original overall weight.
Up to this point some homeostatic properties of synapses were explained. In the
following section we will start with an homeostatic property of neurons: the spike
threshold adaptation property.

1.2 Neuron’s Properties

1.2.1 Spike threshold adaptation


The firing rate of a biological neuron depends on the value of its synaptic inputs
according to the solid line in Figure 5. If the synaptic input is high the neuron fires
at its maximal rate and if it is low it stops firing. In a intermediate range the firing
rate is almost proportional to its synaptic input. This is the only useful range for
the activity of the neuron. If its synaptic inputs varied inside any one of the more
extreme ranges this variation produces no variation at the neuron’s outputs. For this
reason it is very important that the neuron is tuned in the intermediate region which
is the region of maximal firing rate variation.
This tuning actually occurs in real neurons according to the studies of Desai and
colleagues [Desai et al., 1999]. When the synaptic input activity is very high the
sigmoidal activation function shifts to the right and when the synaptic input activity
is very low this function shifts to the left (see Figure 5). The shift of the neuron’s
activation function was a characteristic of the Rosenblatt’s Perceptron [Rosenblatt,
1956] which was a pioneering neural network for characters recognition.

1.2.2 Burst and tonic firing


The purpose of this paragraph is to illustrate that neurons are capable of having
a dual frequency behaviour. Thalamo-cortical neurons which are placed in the
Firing Rate

Low Activity High Activity

Synaptic Input
Figure 5. The activation function of a biological neuron is capable of shifting to the right or to the left
depending on the synaptic input of the neuron. Let us suppose that the activation function is given by
the solid line. If synaptic input levels become very high the curve shifts to the right to avoid the
saturation of the firing rate. Conversely if synaptic input levels are very low the activation function
shifts to the left in order to prevent the neuron to stay silent
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 137

Tonic firing Inhibition Burst firing

>100 msec

Figure 6. Some types of neurons like thalamo-cortical neurons present a dual firing behaviour: in their
tonic firing mode the frequency of their response is proportional to the stimulus (10–165 Hz). However
when they are stimulated and afterwards inhibited during at least 100 msec. their response changes to
burst firing with much higher frequency rates (150–320 Hz)

thalamus, at the core of the brain, are able to fire either in tonic or in burst mode
as shown in Figure 6. The main characteristic of the tonic mode is that the spiking
frequency is proportional to the stimulus being in the range of 10 to 165 Hz.
However in the burst mode, the frequency is not related to the input activation,
being in the range of 150 to 320 Hz. This burst mode is very interesting because
it takes place after a precise sequence of preliminary facts. For the burst mode to
happen, the thalamo-cortical neuron needs to be positively stimulated and afterwards
inhibited during at least 100 msec. After these two previous events the burst firing
is produced when a slight positive stimulation is given to the neuron. For a deeper
study of these mechanisms see [Llinas and Jahnsen, 1982], [Llinas, 1994], [Steriade
and Llinas, 1988].
The purpose of this dual behaviour is still a matter of controversy. Ropero
[Ropero, 1997] proposed that the tonic mode served for intrathalamic operations.
When the result of this intrathalamic operations are concluded the result is relayed
to the cortex via the burst firing mode.

1.3 Network Properties


1.3.1 Synchronization among neurons
Some type of neurons for example, the granule cells of the olfactory bulb and the
reticular cells in the thalamus are able to synchronize their activity and, afterwards,
oscillate together [McCormick and Pape, 1990], [Steriade et al., 1987]. One of the
causes of this behaviour is that these neurons posses dendro-dendritic [Deschenes
et al., 1985] electric contacts in which the potential is communicated directly from
one neuron to the other without any kind of neurotransmitter in between. The
situation is as if we had a set of ping-pong balls tied by fine cords and we used
two very big bats to play with them. The movement of the balls becomes more and
more uniform and synchronized during the play.
The kinetic energy given by each one of the bats over the balls corresponds to
the electric energy of ions entering the neurons. One type of ions increments the
inner potential of the neurons when it is below a certain threshold and other type of
ions reduces the potential when the potential is above an upper voltage threshold.
138 CHAPTER 6

These play beetween ions and the potential sharing of dendrodendritic connected
neurons generates the synchronized oscillations. This behaviour was modelled and
programmed in Matlab [Ropero, 2003] with the results shown in Figure 7.

1.3.2 Normalizing inhibition


Inhibitory neurons were supposed to only perform subtraction [Carandini and
Heeger, 1994] over other neurons and this property was used for biasing the neu-
rons in conventional neural networks models like backpropagation or radial basis
networks. The operation of biasing the neurons was equivalent to shifting the acti-
vation function of these neurons to the right or to the left in a similar way to
the one explained in section 2.2. This kind of subtracting or biasing inhibition
is performed by means of GABA-B (Gamma-aminobutyric acid) neurotransmitter
in real neurons. However inhibition is performed in many of the cases by means
of GABA-A neurotransmitter instead of GABA-B, being the effect of GABA-A
inhibition divisive and not subtractive.
We postulate that this GABA-A inhibition could perform a scaling or normalizing
effect of the input patterns arriving at a certain layer of the brain.
Many structures in the brain have a layered organization. The input to each layer
goes to two type of neurons: (A) To neurons that perform an excitatory projection
onto the following layer (B) To GABA-A neurons that produce inhibition inside its
own layer thereby creating an inhibitory divisive field in the layer (see Figure 8).

Figure 7. The height of each intersection of lines over the surface represents the activation of a 7 × 7
net of neurons. If each of the neurons in this net has an oscillatory activity and the potential of each of
them is partially shared between the other neurons, a synchronization of the activities takes place.
From top to bottom and from left to right a computer simulation of the synchhronization of a 7 × 7 net
of networks is shown
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 139

+ + + ++ +
i 4 i5 i6
i1 i 2 i3

+ + +

+
+

I4
+

I5
+

I6
- Field of inhibitory
interneurons

I1 I2 I3

Figure 8. Normalization of synaptic inputs due to an inhibitory field of GABA-A inhibitory


interneurons. The six neurons of the lower layer of the figure form an excitatory input
I = I1 ,I2 ,I3 ,I4 ,I5 ,I6  impinging on a second layer of neurons (middle). This pattern produces an
excitation + over the six neurons in the middle layer and over GABA-A inhibitory interneurons that
are not shown. Once these inhibitory interneurons are activated, they creates an inhibitory field that
divides the activation of these middle layer neurons by
nI = nI1  + nI2  + nI3  + nI4  + nI5  + nI6 . In this way the neuron at the top receives a
normalized input i = i1 ,i2 ,i3 ,i4 ,i5 ,i6  that is the result of dividing each of the components of pattern I
by the constant n(I)

The activation of excitatory and inhibitory neurons in each layer is almost the same
absolute value because the input pattern impinges at the same time excitatory and
inhibitory neurons. Therefore this inhibitory divisive field is proportional to this
activation.
This divisive inhibition is able to produce a sort of normalization over input
patterns (see Figure 8 for more details).

2. UPDATING MC CULLOCH-PITTS MODEL

Up to this point we introduced several properties of real neurons with remarkable


interest for computational purposes. Using some of them we tried to update some
of the characterisitics of the McCulloch-Pitts paradigm of neural computation.

2.1 Up-to-date Synaptic Model

The classical model of synaptic weight alteration due to Hebb lacked many of
the properties that were mentioned in previous sections. Here we propose another
140 CHAPTER 6

model that not only mimics the way biological reinforcement and depression is
produced but also accomplishes the property of metaplasticity [Ropero and Simões,
1999].
In our model the synaptic weight between the presynaptic neuron A and the
postsynaptic neuron B is calculated as:

(2) wAB = PB/A

where B is a postsynaptic activation above a specific threshold and A a presynap-


tic action potential. As shown the synaptic weight is calculated as a conditional
probability. The above expression can also be written as:
nA I B
(3) wAB = PB/A =
nA
in which the operator “n( )”, number of times, quantifies how many times a certain
event takes place, for example how many times event A, event B or the intersection
of A and B occurs. Starting with different values of the numerator and denominator,
i.e. different initial weights, and allowing the postsynaptic neuron to fire according
to a non-linear squashing function (logistic) a 3-D version of Figure 2 is obtained
in Figure 9.
In this figure a continuous line drawn on the surface shows the evolution of
the LTP threshold in function of the initial weight. It can be noticed that a very
simple statistical expression is able to account for a big variety of properties like

Weight = P(B/A)
Change in synaptic strength

LTP threshold

Initial weight

Normalized postsynaptic activity


(voltage)
Figure 9. The computer simulation above shows that metaplasticity takes place when the synaptic
weight is calculated using the conditional probability P(B/A), being B a suprathreshold activation of
the postsynaptic neuron and A the presynaptic action potential. A line joins the different LTP
threshold, each one of them corresponding to a different initial synaptic weight
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 141

reinforcement, depression and metaplasticity. Therefore, talking into account that


conditional probabilities can be computed in synapses, the more obvious question
that arises here is: Are real synapses the tiniest pieces of the more fascinating
statistical computer ever imagined?

2.2 Up-to-date Neuron Model

We propose a neuron model (see Figure 10) using the just presented equation for
modelling the synaptic weights [Ropero, 1996]. In the soma each of the excitatory
postsynaptic potentials (EPSPs) are summed. An EPSP is obtained by multiply-
ing the probability PIi  of an action potential in the presynaptic neuron by the
corresponding weight PO/Ii .
Although there are no probabilities at the presynaptic space but action potentials at
different frequencies, the product PIi PO/Ii  can approximate each of the EPSPs.
These EPSPs are formed by the sum of the voltage humps produced each one of
them by a presynaptic action potential in a process known as temporal summation.
When these humps are nearby, the humps ride over previous humps creating a
tallest EPSP. When they are far away, as for example when the presynaptic action
potential is low, they can hardly ride over each other and the resulting EPSP is
low. Given that the maximal frequency of presynaptic action potential is limited,
the height of the resulting EPSP is also limited. This maximal height corresponds
to a PIi PO/Ii  of value 1.
All the EPSPs go from the dendrites to the soma where they are summed. This
sum is the so-called activation of the neuron which is transformed afterwards into
a frequency of action potentials by means of a logistic or sigmoidal function.
To prevent the saturation of the weights a normalization of the input pattern by
means of divisive inhibition is commonplace in the brain.

P(I1)
P(O/I1)

P(O/I2)
P(I2)

P(O/I3)

P(I3) P(O/I) = P(O/I1)P(I1) + P(O/I2)P(I2) + P(O/I3)P(I3)

Figure 10. Model of a neuron based on conditional probabilities for calculating the synaptic weights.
In each synapse the probability of presynaptic action potential is multiplied by the synaptic weight and
the result gives the postsynaptic activation in each synapse. The sum of postsynaptic activations gives
the activation of the neuron which is calculated as
POI  = PO/I = PO/I1 PI1  + PO/I2 PI2  + PO/I3 PI3 
142 CHAPTER 6

2.3 Up-to-date Network Model

If the same pattern is input to several neurons, instead of only one, a competitive
process can take place so that only one neuron, the one whose activation is maximal,
becomes the winner of the competition. When the winner fires, the remaining
neurons are kept silent. Silencing the not winning neurons is usually done by an
inhibitory feed-back or lateral inhibition. For avoiding that only one neuron becomes
the winner for every pattern, the probabilistic synapses should be normalized along
time (see Figure 11). This is one of the possible roles of biological synaptic
normalization, giving every neuron the same opportunity to fire.
But what biological mechanisms are involved in the selection of this winning
neuron? In section 2.3.1. it was introduced that the synchronized oscillation of
neurons is a mechanism found at least in the thalamus and the olfactory bulb. This
synchronized oscillation of neurons can allow the finding of the neuron with maxi-
mal activation: if a common oscillatory potential were summed to the activations of
a layer of neurons the neuron whose total activation arrives first to a certain firing
threshold is at the same time the one with biggest activation [Ropero, 2003].

t1 = 0.2
w11 = 0.6 3
y1 = ∑w1j .tj = 0.38
t2 = 0.4 j=1
a.
y1
w12 = 0.4
w13 = 0.2
t3 = 0.5

0.2
0.2 3
y2 = ∑w2j .tj = 0.50
j=1
b. 0.4 0.4 y2
0.6

0.5

0.2
0.4 3
y3 = ∑w3j .tj = 0.42
0.6 j=1
c. 0.4 y3
0.2

0.5
Figure 11. Synaptic normalization allows a competitive process among neurons. The neuron whose
synaptic weight distribution wij is most similar to the input pattern of frequencies T = t1 , t2 , t3 tj
is
also the one with maximal activation. This is the case of neuron b whose weights [0.2, 0.4, 0.6] are
most similar to vector T = 02 04 05
Therefore the sum of the products of the input frequencies
multiplied by its weights yields the maximal value. Notice that due to the synaptic normalization the
number of ionic channels is the same in the three neurons. In summary, synaptic normalization is the
property that allows that the neuron whose weight distribution is most similar to the input pattern also
exhibits the maximal activation
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 143

3. JOINING THE BLOCKS: A NEURAL NETWORK MODEL OF


THE THALAMUS
Probabilistic synapses, synchronized oscillations, weights normalization and nor-
malizing inhibition, all of them are properties that were used to implement a real-
istic computational model of the thalamus. The thalamus is a structure at the
core of the brain that relays sensorial information from the senses to the cortex.
The function of this structure was unknown. The model we propose helps in this
way to understand the role of the thalamus inside the computation in the brain
[Ropero, 1996], [Ropero,1997], [Pelaez, 2000]. The thalamus is basically a two
layered brain structure. The first layer is formed by thalamo-cortical neurons that
receive sensorial patterns and after approximately 100 msec. send the result of the
inner thalamic computation to the cortex. The second layer formed by reticular
neurons that oscillate synchronically performs a competitive process by which each
one of the neurons fires in the presence of specific characteristics of the input patterns.
When several of these neurons fire, they produce several inhibitory masks that,
when superposed, create a negative replica of the input pattern shown in Figure 12
over the first layer. If the input patterns were damaged or noisy the negative
replica recreates a perfect version of the input without defects or noise. Pattern
reconstruction and noise rejection are two of the tasks that we postulate the thalamus
is able to perform. For these tasks, a process of learning must take place at the level
of the thalamus.
Our computer model of the thalamus programmed in Matlab has these two layers,
each one of 9 × 9 = 81 neurons. The two layers are completely interconnected to
each other having 2 × 81 × 81 = 13122 connections. It learned 36 characters during
several epochs and is able to recognize and complete damaged or noisy patterns
(see Figure 12). The learning capability of the model shows that the real thalamus
have also learning capabilities, a fact, that was completely ignored until now in the
thalamus’ research.

4. CONCLUSIONS
In this review we have presented several properties of synapses, neurons and
networks that were not considered in previous neural network models but that have
interesting computational potential.
McCulloch Pitts neuron’s model was based in the restricted knowledge about
neurons that existed in the forties. Nowadays a more comprehensive knowledge
about the amazing properties of neurons can be used to update McCulloch Pitts
model. In the case of synaptic plasticity we presented several properties of synaptic
weights like directionality, existence of both potentiation and depression thresholds,
metaplasticity and normalization. Regarding neurons relevant properties were intro-
duced to the reader like the spike threshold adaptation and the dual behaviour in
frequency of some types of neurons. Finally, and concerning networks of neurons,
we studied the synchronization of a set of neurons and the normalizing inhibition
produced by a set of GABA-A neurons over the input pattern of another neuron.
144 CHAPTER 6

Figure 12. A biologically realistic computer model of the thalamus constituted by two layers of
9 × 9 = 81 neurons each. An example of the pattern reconstruction capability of the thalamus model is
shown (a) After being trained with 36 different characters (letters and numbers) a very noisy and
damaged testing pattern is input which vaguely resembles a B. (b) An “I” shaped sustained feedback
inhibition over the first layer is produced by a reticular neuron in the second layer. After firing, the
reticular neuron rests in refractoriness. This inhibition reduces the subsequent activation in the first
layer. (c) Another neuron fires and immediately enters in the refractory period producing another
sustained inhibition that is superposed over the previous one. Both inhibitions are shaped like an E. (d)
Finally, another reticular neuron fires and the total inhibition completely reconstructs letter B showing
the reconstruction capability of the thalamic model. The central figure of each screen gives the value
of the activations of a net of reticular neurons

With all these elements in mind we proposed a new equation for synaptic rein-
forcement based in conditional probabilities. The paradigm of a neuron was also
modified taking into account that the neuron is always integrated in a network. For
example, if the neuron was detached from the inhibitory field that normalizes its
inputs, its active synaptic weights will increase without bound and the neuron will
be saturated most of the time.
It was also shown that the normalization of synaptic weights is an important
condition for allowing a competitive process between neurons. An example of such
competition and of all the mentioned properties working together is the model of
the thalamus that we programmed in Matlab. It learned 36 characters and exhibits
the property of completing damage or noisy patterns.
BIOLOGICAL CLUES FOR UP-TO-DATE ARTIFICIAL NEURONS 145

We expect that the reader benefits from this paper’s account of recently found
neural properties when creating new artificial neural networks or trying to emulate
the functioning of the brain.

REFERENCES
Abraham, W.C., and Bear, M.F. (1996) Metaplasticity: the plasticity of synaptic plasticity. Trends in
Neuroscience 19:126–130.
Abraham, W.C., and Tate, W.P. (1997) Metaplasticity: a new vista across the field of synaptic plasticity,
Progress in Neurobiology 52:303–323.
Artola, A , Brocher, S., and Singer, W. (1990) Different voltage-dependent threshold for inducing
long-term depression and long-term potentiation in slices of rat visual córtex. Nature 347:69–72
Bear, M.F., Connors, B.W., and Paradise, M.A. (2001) Neuroscience. Exploring the Brain. Lippincott,
Williams & Wilkins. USA
Bienestock, E.L., Cooper, L.N., and Munro, P.W. (1982) Theory for the development of neuron selec-
tivity: orientation specificity and binocular interaction in visual córtex. The Journal of Neurosciences
2(1):32–48.
Carandini, M., and Heeger, D.J. (1994) Summation and division by neurons in primate visual cortex.
Science 264(5163):1333–6.
Carpenter, G., and Grossberg, S. (1988) The ART of adaptive pattern recognition by a self-organizing
neural network. Computer 21(3):77–88
Deschenes, M., Madariaga-Domich, A., and Steriade, M. (1985) Dendrodendrític synapses in the cat
reticularis thalami nucleus: a structural basis for thalamic spindle synchronization. Brain Research
334:165–168.
Desai, N.S., Rutherford, L.C., and Turrigiano, G.G. (1999) Plasticity in the intrinsic excitability of
cortical pyramidal neurons, Nature Neurosciences 2:515–520
Hopfield, J.J. (1982) Neural Networks and physical systems with emergent collective computational
abilities. Proceedings of the National Academy of Sciences 79:2554–2558
Kohonen, T. (1982) Self-organized formation of topologically correct feature maps. Biological Cyber-
netics 43:59–69.
Llinás, R., and Jahnsen, H. (1982) Electrophysiology of mammalian thalamic neurones in vitro. Nature
297:406–408
Llinas, R., Ribary, U., Joliot, M., and Wang, X.J. (1994). Content and Context in Temporal Thalam-
ocortical Binding. In G.Buzsaki et al. (Eds.), Temporal Coding in the Brain (pp. 151–72). Berlin:
Spring-Verlag
McClelland, J.L., Rumelhart, D.E., and The PDP Research Group. (1986). Parallel distributed processing:
Exploration in the microstructure of cognition. Cambridge, MA: MIT Press.
McClelland, J.L., and Rumelhart, D.E. (1988). Explorations in parallel distributed processing. Cambridge,
MA: MIT Press.
McCormick, D.A., and Pape, H.-C. (1990) Properties of a hyperpolarization activated cation current and
its role in rhytmic oscillation in thalamic relay nurons. Journal of Physiology (London) 431:291–318.
McCulloch, W. and Pitts, W. (1943) A logical Calculus of the Ideas Immanent in Nervous Activity.
Bulletin of Mathematical Biophysics, 1943.
Ropero Peláez, J. (1996) A Formal Representation of Thalamus and Cortex Computation. Proceedings
of the International Conference of Brain Processes, Theories and Models. Edited by Roberto Moreno-
Díaz and José Mira-Mira. MIT Press.
Ropero Peláez, J. (1997) Plato’s theory of ideas revisited. Neural Networks, 1997 Special issue 10(7):
1269–1288.
Ropero Pelaez, J., and Godoy Simoes, M. (1999) A computational model of synaptic metaplasticity.
Proceedings of the International Joint Conference of Neural Networks 1999. Washington DC.
Ropero Peláez, J. (2000) Towards a neural network based therapy for hallucinatory disorders. Neural
Networks, 2000 Special Issue 13(2000):1047–1061.
146 CHAPTER 6

Ropero Peláez, J. (2003) Phd Thesis in Neuroscience: Aprendizaje en un modelo computacional del
tálamo. Faculty of Medicine. Autónoma University of Madrid.
Rosenblatt, F. (1956) The perceptron: A probabilistic model for information storage and organization in
the brain. Psychological Review 65:386–408
Steriade, M., Domich, L., Oakson, G., and Deschenes, M. (1987) The deafferented reticular thalamic
nucleus generates spindle rhythmicity. The Journal of Neurophysiology 57:260–273.
Steriade, M., and Llinas, R.R. (1988), The Functional State of the Thalamus and the Associated Neuronal
Interplay. Physiological Review 68(3):649–739.
Tompa, P., and Friedrich, P. (1998). Synaptic metaplasticity and the local charge effect in postsynaptic
densities. Trends in Neuroscience 21(3):97–101.
Turrigiano, G.G., Leslie, K.R., Desai, N.S., Rutherford, L.C., and Nelson, S.B. (1998) Activity-dependent
scaling of quantal amplitude in neocortical neurons. Nature 391:892–896.
CHAPTER 7
SUPPORT VECTOR MACHINES

JAIME GÓMEZ SÁENZ DE TEJADA1 ,


JUAN SEIJAS MARTÍNEZ-ECHEVARRÍA2
1
Escuela Politécnica Superior, Universidad Auónoma de Madrid
2
Escuela Técnica Superior de Ingenieros de Telecomunicaciones, Universidad Politécnica de Madrid

Abstract: Support Vector Machines is the most recent algorithm in the Machine Learning commu-
nity. After a bit less than a decade of live, it has displayed many advantages with respect
to the best old methods: generalization capacity, ease of use, solution uniqueness. It has
also shown some disadvantages: maximum data handling and speed in the training phase.
However, these disadvantages will be overcome in the near future, as computer power
increases, leaving an all-purpose learning method both cheap to use and giving the best
performance. This chapter provides an overview about the main SVM configuration, its
mathematical applications and the easiest implementation

Keywords: Support Vector Machines, Machine Learning

INTRODUCTION

Machine Learning has become one of the main fields in artificial intelligence
today. Whether in the pattern recognition field or in function estimation, statistical
Machine Learning tries to find a numerical hypothesis which adapts correctly to
the given data, that is, machines able to generalize the statistical distribution of a
representative data set. Once we have generated the hypothesis, all future unknown
patterns following the same distribution will be correctly classified.
From the principles of statistical mechanics, a handful of algorithms have been
devised to solve the classification problem, such as decision trees, k-nearest neigh-
bour, neural networks, Bayesian classifiers, radial basis functions classifiers, and,
as a newcomer, support vector machines (from now on SVM).
The basic SVM is a supervised classification algorithm introduced by Vladimir
Vapnik, motivated by VC (Vapnik Chervonenkis) theory [Vapnik, 1995], from
which the Structural Risk Minimization concept was derived. In the late 70’s,
147
D. Andina and D.T. Pham (eds.), Computational Intelligence, 147–191.
© 2007 Springer.
148 CHAPTER 7

Vapnik studied the numeric solution of convex quadratic problems applied to


Machine Learning, and defined an immediate ancestor of SVM called ‘Generaliza-
tion portrait’. In the early 90’s, Vapnik joined Bell laboratories, where his ideas
evolved until the creation of the term ‘support vector machines’ in 1995.
Nevertheless, the basic mathematics behind SVM were developed much earlier.
The concept of a non-input space hyperplane generation to separate data in input
space, the heart of SVM, was settled in 1964. The study of convex quadratic
problems gave the Karush-Kuhn-Tucker optimality conditions in 1936, while the
definition of valid kernel functions for the transformation described above was
formulated by Mercer in 1909.
This chapter provides an introductory view to SVM, so that any computer scientist
or engineer reader can develop his own SVM implementation and apply it to any
real world machine-learning problem. For that purpose, we will sacrifice some
mathematical completeness for the sake of clarity. It has four sections: first, the
SVM will be defined and analysed; second, the main SVM principle mathematical
uses will be developed; third a comparison between SVM and neural networks will
be studied; last, the best current implementation approach will be shown.
Support Vector Machines are easy to understand, not too difficult to implement,
and child’s play to use. If you need a generic Machine Learning method, forget
about neural networks or any other method you previously learnt: the SVM family
globally outperforms them all.

1. SVM DEFINITION
1.1 Structural Risk

Classifiers having a big number of adjustable parameters (and so, great capacity)
most probably will generate overfitting, thus learning the training data set without
errors, but with poor generalization ability. On the contrary, a classifier with insuf-
ficient capacity will not be able to generate a hypothesis complex enough to model
the data. A mid-point must be found where adjustable parameters are neither too
much nor too scarce, both for the training ant test set. For that reason, it is essential
to choose the kind of functions a learning machine can implement.
For a given problem, the machine must have a low classification error, and
also small capacity. Capacity is defined as the ability of a given machine to learn
any training set without errors. For example, the 1-nearest neighbour has infinite
capacity, but is a poor classifier for unseen test data with complex distributions and
noisy sets. A machine with great capacity will tend to generate overfitting over the
data, making it no longer useful because it does not learn. For extended information
about these issues, see [Burges, 1998].
There are a handful of mathematical bound expressions that define the relations
between a machine learning ability and its performance. The underlying theory tries
to find under which circumstances and how fast the performance measure converges
while the number of input data for training increases. On the limit, with an infinite
number of points, we could have a correct performance value, better than just an
SUPPORT VECTOR MACHINES 149

estimation. With respect to the SVM, we will use one limit definition in particular
which will take us to the Structural Risk Minimization (SRM) principle [Vapnik,
1995].
Suppose we have l observations, input data in the training phase. Each data
consists on a pair of values xi  yi , where xi is a vector ∈ n  i = 1     l and the
fixed associated label yi ∈ 1 −1, given by a consistent data source. We assume
there is an unknown probability distribution P(x,y), from which the data points have
been drawn. Data is always assumed to be independently drawn and identically
distributed.
Suppose we have a machine whose task is to learn the mapping xi → yi . This
generic machine is really defined by a set of possible mappings xi → fx , where
functions fx  are generic, defined for the set of adjustable parameters . This
machine is by definition deterministic, that is, for a given input vector xi , and a
parameter set , we will always obtain the same output fxi  . Choosing the
parameter set  gives a trained machine. For example, a neural network with a
fixed architecture and fixed weights (parameter set ) would be a trained machine
as defined in these paragraphs.
Thus, the expected error in the phase test for a trained machine is:
 1
(1) R = y − fx dPx y
2
The value R is called expected risk. Nevertheless, this expression is difficult
to use because the probability distribution P(x,y) is unknown. Thus, a variation of
the formula is developed to use the finite number of available observations. It is
called empirical risk, and is defined as the measured mean error rate on the training
set:
1  l
(2) Remp  = y − fxi  
2l i=1 i

Remp  is a fixed number for a given parameter  and test set. It has been shown
in [Vapnik, 1995] that the following condition holds:

(3) R = Remp  + gh

where g(h) is a real number which is directly related to the VC dimension.


Again, a learning machine is defined as a set of parameterised functions (called
a family of functions) having a similar structure. The term Vapnik-Chervonenkis
(VC) dimension is a non-negative integer that measures the generalization capacity
previously defined. The VC dimension for a given learning machine is defined
as the maximum number of points that can be correctly classified using functions
belonging to the family set. In other words: if VC dimension = h, then there
exists a set of h points that can be classified with family functions regardless of
the point labels. Note that, first, there cannot exist a set of h + 1 points satisfying
the constraint; second, you only need one set of h points for the definition to be
applicable (it did not say “for all h-points sets”).
150 CHAPTER 7

Figure 1. Three wisely chosen points

Let’s try an example. Suppose we are in 2 space and the learning machine
L1 is defined as the set of “one straight line” classifiers. In figure 1 we choose
three points. We see (you can try) that, for all combination of labels (8 possible
combinations using 3 points with two labels), they can be separated using one
straight line. For each combination it would use a different straight line, but it would
still be a component of the family set. Therefore the analysed learning machine VC
dimension is at least 3. If we try 4 points (any 4 point set) we will not be able to
satisfy all constraints, so we can state that “one straight line” classifiers in 2 space
have VC dimension equal to 3.
Another example. Suppose we are in 2 space and the learning machine L2 is
defined as the set of “two-segment line” classifiers (continuous but non-derivable
in the joint point). In figure 2 we choose five points. Again, try all possible label
combinations (now 32). Using a two-segment line you can separate all 32 cases, but
it would not be possible to separate 6 well-chosen points (any 6 points). Therefore
VC dimension for this learning machine is 5.

Figure 2. Five wisely chosen points


SUPPORT VECTOR MACHINES 151

Figure 3. Training set and two valid classifiers, “straight-line”(dashed line) and “two-segment-line”
(solid line)

When facing a problem that can be classified using different learning machines,
as can be seen in figure 3, which one is better.
The SRM principle will try to find the learning machine with the lowest VC
dimension that correctly classifies all data points. The consequences are analysed
in section 2.12.
In what regards SVM definition, SRM principle and VC dimension concept
requires that the chosen classifier be the one with the largest margin (linear SVM
use the family of linear hyperplanes in input space), defined in next section.

1.2 Linear SVM for Separable Data

The simplest case for a SVM is that of linear machines trained with a separable
data set (see Figure 4a).

Figure 4a. Linear separable training set


152 CHAPTER 7

Suppose we have a training data set made of pairs xi  yi  i = 1     l, such


that xi ∈ d  yi ∈ 1 −1. Suppose there exists a hyperplane in d which separates
positive from negative examples (after their yi value). Points that are exactly on the
hyperplane satisfy the condition:

(4) w•x+b = 0

where w is the hyperplane perpendicular vector (regardless of the norm), b/w


(absolute value of term b divided by module of vector w) is the distance from the
origin to the hyperplane, and the operator • is defined as the dot product in the
Euclidean space in which the data belong (we will use the scalar product between
two d-dimension vectors).
Let d+ d−  be the shortest distance between the plane and a positive (negative)
example; the margin of the hyperplane is defined as d+ + d− . We can say that, at
maximizing the classifier margin, we will decrease the risk limit defined in (3).
This is the base for the following SVM mathematical development.
For the linear and separable case, the SVM algorithm calculates the separator
hyperplane that maximizes the classifier margin. Thus, all training data must satisfy
the following constraints:

(5) w • xi + b ≥ +1 for yi = +1


(6) w • xi + b ≤ −1 for yi = −1

which can be formulated in one expression:

(7) yi w • xi + b − 1 ≥ 0 ∀i

All points for which equality at inequality (5) holds, are on hyperplane H1 :
w • xi + b = 1, parallel to the separator hyperplane and distance 1 − b/w to
the origin. In much the same way, those points for which equality at inequality
(6) holds, are on hyperplane H2 : w • xi + b = −1, parallel to H1 and the separator
hyperplane and distance  − 1 − b/w to the origin.
Thus, d+ = d− = 1/w, and so the margin is 2/w. We must find a pair
of planes H1  H2  that maximize the margin, minimizing w2 , with respect to
constraints defined in inequality (7).
Note that, in the training phase, no data point will be between H1 and H2 or on
the wrong side of its class plane (that is the reason for calling it separable case).
Those points that satisfy the equality in inequality (7), (those placed on H1 or H2 ),
and that, if eliminated from the training set, would give a different solution (by
definition would change d+ or d− ), are called support vectors.
The name comes from the fact that the learning machine is completely defined
with these points and their weight on the hyperplane. All other training points,
which are at a greater distance from the hyperplane than the support vectors, serve
no purpose: if we had begun the training without them, the solution would have
remained the same (see Figure 4b).
SUPPORT VECTOR MACHINES 153

Figure 4b. Linear SVM classifier. Support vectors are encircled, the margin is shown with two dashed
lines and the separator hyperplane is shown with a solid line

The problem can be reformulated using Lagrange multipliers. It will help us to


add constraints to the problem more easily, and will let the training data appear
only in the form of dot products between vectors. This will let us generalize the
SVM algorithm to the non-linear case. The general rule for creating the Lagrange
formulation is: for constraints of the type c ≥ 0, the constraint equation is multiplied
by a Lagrange multiplier and subtracted from the objective function. Thus, we intro-
duce non-negative Lagrange multipliers i  i = 1     l , one for each constraint in
inequality (7), that is, one for each training point. The Lagrangian we obtain is:

1 l l
(8) LP = w 2 − i yi w • xi + b + i
2 i=1 i=1

We want to minimize LP with respect to w and b (the variables that define


the plane), and require that partial derivatives of LP with respect to the i be 0.
By definition, this is a convex quadratic optimisation problem, because objective
function is convex and constraints are also a convex set [Burges, 1998].
This means we can solve the problem using the dual formulation [Fletcher, 1987].
This Wolf-dual formulation has the following property: maximization of LD (in
contrast with primal formulation LP ) with the defined constraints occurs at the same
value of w and b than the minimization of LP , shown in the previous paragraph.
All partial derivatives must be zero at the optimum. Calculating partial derivatives
of LP with respect to b and w, we obtain the following conditions:


l
(9) w= i yi xi
i=1


l
(10) i yi = 0
i=1
154 CHAPTER 7

which substituting in equation (8) gives:


l
1 l  l
(11) LD = i −  y  y x • x 
i=1
2 i=1 j=1 i i j j i j

Therefore, now the problem is written as “Maximize LD with respect to all i ,


satisfying conditions (7) and (10)”.
There is a Lagrange multiplier for each training point, but only those having
i > 0 are of any importance in calculating the separator hyperplane with equa-
tion (9). These are the support vectors, which were defined in previous paragraphs.
Geometric interpretation of (11) is easier if the second term is substituted using
(9). Suppose we are in an intermediate optimisation state, and we want to calculate
the second term at step i = 0. Thus the term is:

l 
l

0
j y0 yj x0 • xj  =
0 y0
j yj x0 • xj 
j=1 j=1

l
=
0 y0 x0 •
j yj xj  =
0 y0 x0 • w
j=1

The scalar product of a point and a normal-to-the-hyperplane vector gives the


point projection over the vector, that is, relative distance between point and hyper-
plane. The relying concept under the formula is:

= 0 ∗ Correctness of classification ∗ distance between point and


sscurrent defined separator plane

At each sep i, the relation between xi and current-state w is calculated. Therefore,


we can deduce some hand-made optimisation rules:
A) If classification is correct, the term is negative, so i should decrease, and thus
reduce its weight (its importance) in the calculation of current w, in case the
optimum has not been reached.
B) If distance is big with respect to other points of the same class, and it is correctly
classified, i should decrease, while other same-class point k closer to the
margin should increase.
Note that when evaluating the correctness of a point during the training phase,
the point itself is used. If a point is misclassified, the algorithm will increase its
multiplier as much as needed, forcing the hyperplane definition until this point
condition is satisfied. For the linear separable case this strategy is valid, because
sooner or later the point must be correctly classified. But for non-linear or non-
separable cases, this strategy may give poor results. If we have some noise in the
training data, the algorithm will try to force the hyperplane definition to classify
points that are wrong. This will generate overfitting over the data so the performance
will be poorer.
Therefore, the SVM training algorithm consists of the following basic steps:
SUPPORT VECTOR MACHINES 155

1. Identify all training data points, and their labels.


2. Optimize (maximize) the dual Lagrangian, maintaining constraints defined in (7)
and (9). For that purpose, there are many convex quadratic problem optimisation
methods described in mathematical literature [Fletcher, 1987]. The optimisation
phase result is the set of all Lagrange multiplier values i . Basic optimization
methods have important limits about the resources (time and memory) needed in
big problems (more than 10.000 patterns). Thus, at the beginning of SVM history,
efficient optimization algorithms were the basic research line. In section 5 the
best SVM algorithm will be shown: SMO.
3. Throw away all those points which are not support vectors after the training
process (i.e. those having i = 0), and calculate the value of w and b from
support vectors and formulas (9) and (7). Then, we will have a completely
defined optimum separator hyperplane.

1.3 Karush-Khun-Tucker Conditions


Karush-Khun-Tucker (KKT) conditions represent necessary and sufficient condi-
tions for a solution to exist to the problem defined in step 2 in the previous algorithm.
This solution identifies the objective function LP  optimum value with respect to
all available parameters (all i ).
Many SVM algorithms use these KKT conditions to identify if the machine’s
current state is the optimum, and if not so, which are the points that violate these
optimality conditions the most. For the basic SVM definition, given in this chapter,
optimality conditions are:

l
(7.9) w= i yi xi
i=1


l
(7.10) i yi = 0
i=1

(7.7) yi w • xi + b − 1 ≥ 0
(7.7 bis) i yi w • xi + b − 1 = 0
i ≥ 0
Most of them have been introduced in previous sections of this chapter, but they
have been repeated here for better comprehension of the optimisation process.
The new equation (7.7 bis) is easy to be interpreted. It regards to the points that
must hold equality in inequality (7). It could be defined in the following words:
“Any training point, either holds equality in inequality (7), or its Lagrange multiplier
is annulated, i.e. i = 0”.
If it holds equality (7) and i = 0, then the point is on the margin hyperplane and
is a support vector. It can also happen that both conditions hold, that is, equality
(7) holds and i = 0. In that case, the point is on his class margin hyperplane but
it is not needed for the hyperplane definition, therefore it is not a support vector.
156 CHAPTER 7

1.4 Optimisation Example

To show with more clarity the optimisation process, we will introduce an example.
Suppose we have 3 points 1 1 2 1 3 1 ∈ 2 and labels +1 −1 −1
respectively (see figure 5).
Suppose the initialisation routine defines Lagrange multipliers as 1 = 2 2 =
1 3 = 1 (holding condition (10)). We use formulas (9) and (11) to calculate the
following:

w1 = 21 1 − 12 1 − 13 1 = −3 0


LD1 = 4 − 1/2−6 + 6 + 9 = −0 5

Then, we check if this is a valid solution for our SVM. For that purpose, we use
KKT conditions, specially condition (7). Note that all three points would be support
vectors, so they must have the same value of b when substituting in condition (7).
At this optimisation stage this is not true for w1 , because we obtain b = 4 b = 5
and b = 8. Thus we can say, without doubt, that this is no solution.
Now we must find another set of Lagrange multipliers that bring us to an increase
of LD . Point 3 is farthest from current pseudo-hyperplane (being correctly classified),
so it is a good candidate for decreasing its weight in the definition of w (see
section 2.2). Suppose that new Lagrange multiplier values are 1 = 1 2 = 1 3 = 0
(condition (10) must always hold).

w2 = 11 1 − 12 1 − 03 1 = −1 0


LD2 = 2 − 1/2−1 + 2 = 1 5

We made a good choice because LD has increased. Nevertheless, we still do not


satisfy KKT conditions. When we substitute equation (7), we obtain b = 2 y b = 1
for both points respectively (we have two support vectors only).

Figure 5. A linear separable set with margin and separator hyperplane


SUPPORT VECTOR MACHINES 157

Now that we have two support vectors, with different class, their Lagrange
multipliers must change in the same way for condition (10) to hold. We increase,
for instance, to 1 = 2 2 = 2 3 = 0.

w3 = 21 1 − 22 1 − 03 1 = −2 0


LD3 = 4 − 1/2−4 + 8 = 2

Again, LD has increased, so we have chosen wisely. Moreover, at this optimisation


step, KKT conditions hold, having the same value b for all support vectors, b = 3.
We can assert without any doubt that the optimum has been reached. For instance,
if we continue to increase the multipliers to 1 = 3 2 = 3 3 = 0, the result would
not be valid. We would obtain:

w4 = 31 1 − 32 1 − 03 1 = −3 0


LD4 = 6 − 1/2−9 + 18 = 1 5

Convexity required for the objective function definition holds: LD1 < LD2
< LD3 > LD4 .
Moreover, as the example is so small, some degree of uniform quadratic convexity
can be seen, as LD2 = LD4 , underneath the optimum.
During the optimisation process, while KKT conditions do not hold, the unique
separator hyperplane does not exist. At each new step (new set of values of ),
there is one hyperplane direction only, but as many separator hyperplanes as support
vectors in the training set (different values of b). These hyperplanes do not need to
have a geometric meaning; they do not try to separate the data, even though they
could. As we get closer to the optimum (increasing LD ), all support-vector-defined
hyperplanes will come closer to each other (less difference in the b value). The
limit is reached when LD gets to the optimum value, and all hyperplanes match up
with only one value of b: the separator hyperplane.
This concept differs largely on the search process followed by other similar
methods, like the perceptron. This last one always defines a separator hyperplane
that evolves at each training step trying to classify correctly all training data. For
that reason, it can reach a state in which all data points are correctly classified,
but whose margin is not the optimum. That is called a local minimum, where the
perceptron will be trapped and will not be able to continue. The SVM algorithm
performs a quadratic optimisation in which no intermediate state can be considered
as a valid solution. There will be one solution only, it will be global, and it will be
the best you can have.
Even though soft-margin SVM definition will take place in next sections,
this is a good place to see what happens when the optimisation algorithm is
applied to a non-separable data set. Suppose we have again the 3 points used
before 1 1 2 1 3 1 ∈ 2 but now with different labels +1 −1 +1 (see
figure 6). We have changed the third point label, so the training set becomes non-
separable with a linear machine. Nevertheless, this information is not given to the
SVM algorithm.
158 CHAPTER 7

Figure 6. A linear non-separable set

Suppose we initialise values as 1 = 1 2 = 2 3 = 1 (condition (10) holds).

w1 = 11 1 − 22 1 + 13 1 = 0 0


LD1 = 4 − 1/2+0 − 0 + 0 = 4

Of course, this cannot be a solution. We do not need to check KKT conditions,


because w = 0 0 does not define a hyperplane. At this stage we cannot guess
which points are better changing, so we do it randomly. Suppose we define a new
state 1 = 1 5 2 = 2 3 = 0 5 (there are not many more alternatives).

w2 = 1 51 1 − 22 1 + 0 53 1 = −1 0


LD2 = 4 − 1/2−1 5 + 4 − 1 5 = 3 75

We obtain LD2 < LD1 , so we can be sure this is not a solution, and, even more,
this way will take us nowhere. We choose another possible set, 1 = 2 2 = 4
3 = 2.

w3 = 21 1 − 42 1 + 23 1 = 0 0


LD3 = 8 − 1/2+0 − 0 + 0 = 8

As in the first case, this cannot be a solution. But LD has increased quite a lot,
and we could think this is getting us closer to the solution. But it can be noted that
we could increase the  multipliers anyhow, knowing LDn = 1 + 2 + 3 , and so,
the objective function increases without limit (note that in this example the problem
is not characterized by a quadratic function, but by a linear function, so there cannot
be an optimisation solution).
Therefore, if the objective function increases without limit, then we are applying
a linear separable machine to a linear non-separable training set.
SUPPORT VECTOR MACHINES 159

1.5 Test Phase

As it has been said, once we have trained a SVM, we obtain the values w and
b. With these values, we define a separating hyperplane, w • x + b = 0, parallel to
H1 and H2 and placed at the middle, at the same distance of both. To classify an
unseen pattern x, we just need to know which side of the separator hyperplane the
point is, i.e., the sign of w • x + b. Note that in the test phase we may have data
points placed in between H1 and H2 , and, if used during training, the solution found
would have changed somehow. This concept may be useful when developing SVM
training algorithms, because it could find a priori support vectors, before the whole
training, saving computational power.
Up until now we have mentioned only the binary case, that is, data can only
have two classes. SVM classifiers can be easily extended to the multiple class case:
for n classes, we just need to generate n-1 binary classifiers which separate one
class form the rest. Nevertheless, this multiple classifier is O(n) more complex in
time (memory resources are more difficult to estimate) than one binary classifier
in the training as well as the test phase. As this extension does not give new major
advances, it will not be mentioned in the rest of this chapter.

1.6 Non-Separable Linear Case

Now that we know everything that is needed to create and use a simple SVM, we
will upgrade its definition so that it will be able to deal with any real-life problem.
When the above-described algorithm for separable data is used over non-separable
data (see figure 7), no solution will be found, as the value of LD will grow without
limit (see section 2.5). For the non-separable data to satisfy initial constraints, we
have to introduce the concept of soft margin. This means that the algorithm will
allow some training points to violate those constraints, and so, the rest of training
data will be correctly classified (regardless of violating points). For that purpose we

Figure 7. A linear non-separable set, which needs a soft-margin classifier. The distribution is defined
as class = 1 if x1 + x2 > 7 5; class = −1 otherwise. The distribution has some noise
160 CHAPTER 7

introduce positive slack variables for each point in a way such that the following
inequalities hold [Cortes and Vapnik, 1995]:
(12) w • xi + b ≥ +1 − i  for yi = +1
(13) w • xi + b ≥ −1 + i for yi = −1
Values i are not fixed prior to the training; they will be calculated during the
optimisation process. And because they are not fixed, we can be certain that all
points will satisfy inequalities (12) and (13): just increase its i until inequality
holds. We have solved our troubles: now, there will always be a solution. But it
may be that the solution is not close enough to the true distribution under the data.
If that is so, then the solution is useless; so we have just changed the name of our
worries.
With the introduction of these variables must follow a primal Lagrangian LP
increase, so that classification errors during training will be minimized. For a
training pattern classification error to take place, its associated i must be greater
than 1, so

l
i
i=1

is a good estimate of the training errors’ upper bound with respect to the com-
plete training set. Therefore, the objective function to be minimized changes from
1/2w2 to
1 l
w 2 + C i
2 i=1

being C a parametrizable non-negative real value. This value corresponds to the


global penalization given to training errors. This new objective function could
have been different. We could have devised other methods for forcing i values
to be as small as possible. The election of exactly that function follows simplicity
reasons: the problem continues to be convex quadratic, and neither the i , nor the
Lagrange multipliers associated to these new constraints, appear in the problem
dual formulation. Therefore, we have to maximize LD :

l
1 l  l
(14) LD = i −  y  y x • x 
i=1
2 i=1 j=1 i i j j i j
with constraints:
(15) 0 ≤ i ≤ C

l
(16) w= i yi xi
i=1


l
(17) i yi = 0
i=1
SUPPORT VECTOR MACHINES 161

The only difference between the previous algorithm and this last one is that now
the i have an upper bound C. The training algorithm will not allow any point to
increase its weight indefinitely, and so, a solution will eventually be found.
The error term in the optimisation process goes to those points that have i > 0,
either because they are incorrectly classified or because they lie inside the margin.
For any point that satisfies i > 0, it can be stated i = C. It still is a support
vector, and it will be treated as such in the calculation of w, but in the optimisation
process its weight will grow no more. Soft margin philosophy (against hard margin
defined in section 2.2), is not to forbid training errors, not even to minimize
them alone. The idea is to minimize the whole objective function, in which errors
make some pressure as well as the hypothesis robustness, identified as the margin
maximization between those well-classified points at each side of the separating
hyperplane (characterised by constraint (7)).
Suppose, for instance, the case shown in figure 7. A hard margin classifier cannot
be found, but many soft margin classifiers will satisfy the constraints, and the only
difference will be the C value. The first approach for newcomers is usually the
hardest soft-margin possible, one that looks like figure 8. It is a valid solution, but it
has a very small margin. By definition of structural risk minimization, if we increase
the margin, test errors would decrease (better generalization performance). On the
other hand, training errors should be avoided (or, at least, limited), so a balance must
be found between margin maximization and error permissibility. A small quantity
of noise may be accepted without modifying the generalization performance, by
creating a hypothesis that is developed after some common properties satisfied by
the data (the internal, true data distribution). In the case of figure 9, it is easily
seen that more training points become errors, but the classifier is much closer to
the underlying distribution concept.
The new parameter C becomes the only value (until now) that must be provided
in the SVM architecture. As it has been said, C serves as a balance between error
permissibility and generalization goodness.

Figure 8. The figure 7 set, with a rather hard soft margin classifier
162 CHAPTER 7

Figure 9. The figure 7 set, with a softer margin classifier

– If C is small, then errors are cheap. The margin will grow, and so will the number
of training patterns that violate the margin.
– If C is big, then the value of w has small relevance in the objective function opti-
misation against training errors. We are approaching the hard margin philosophy.
Because w value is closely related to the margin maximization, decreasing w
relevance will take us to a smaller margin, and maybe, to a worse generalization
ability.
To choose a good C value, model complexity and expected data noise must be
evaluated as a whole.

1.7 Non-Linear Case

In most real life cases, data cannot be separated using a linear hyperplane in input
space. Even the use of slack variables could lead to a poor classifier, in case the
linear deviations are caused by the hypothesis structure and not because of noisy
data. The next step is to introduce in the SVM algorithm non-linear separating
surfaces instead of hyperplanes (see figure 10).
For that purpose, we generate an input data mapping into another Euclidean space
H, whose dimension is higher than the input space. We use a mapping function ,
such that:

d → H

In the problem dual formulation, input data vectors appear only as inner products
xi • xj , in the space they belong. Now they will only appear as xi  • xj  in
space H.
Space H will usually be a very high dimension space. It could even be an infinite
dimension space. Therefore, performing operations in this space could be too costly.
But if we could find a kernel function K such that Kxi  xj  = xi  • xj , then
we would not need to explicitly map data vectors into space H, we would not even
SUPPORT VECTOR MACHINES 163

Figure 10. Non-linear distribution set

need to know what is. Now we just have to define a valid kernel function K, and
substitute Kxi  xj  everywhere xi • xj stands in the algorithm.
When we use a much higher dimension space, many new data features, linear
and non-linear, arise. Each new dimension offers a new possible correlation view,
a new attribute with which we can separate the data, a new factor with which to
create the hypothesis. It will be the training process responsibility to discriminate
those attributes that contain useful hyperplane-definition information from those
that do not, by assigning a bigger weight in the linear combination of all features.
For those cases when there is some user information about data correlation, an
explicit mapping can be generated. Nevertheless this is not usual, and could lead
to an inefficient implementation, depending on the previous knowledge credibility.
Using generic mapping functions (we will see them later) offers the possibility to
generate an enormous number of new features, without taking care of the meaning
of each one. In fact, these spaces use to be in the order of thousands, millions or
even infinite dimensions. It is difficult to accept such a big geometrical space. It
seems easier to identify it with a set of non-linear relations between input attributes,
which can be assembled with linear relations in the optimisation process to create
a surface (hyperplane in feature space, indefinable curve in input space), capable
of separating input data one class from the other.
If we replace xi • xj by Kxi  xj  everywhere in the training phase formulas, the
algorithm defined in section 2.2 will generate a linear SVM in a high dimensional
space (specified by the mapping function). And most important, it will do it in
roughly the same time complexity as a simple linear SVM created in input space
(without mapping). All further development stays the same, as we are still creating
a linear separator, although in a different space.
In the linear case, the training phase output was the value of w and b, with which
the hyperplane was completely defined, and so the test phase had just to see at
which hyperplane side the new pattern was.
Now, we cannot explicitly calculate w, because it is defined in space H only and
we do not know exactly how the mapping is made.
164 CHAPTER 7

Through the support vector extension, the value of w can be written as:

N
(18) w= i yi si 
i=1

so we can write the classification function as:



N 
N
(19) fx = i yi si  • x + b = i yi Ksi  x + b
i=1 i=1

where si are the N support vectors, identified in the training phase as those patterns
whose Lagrange multiplier is not zero. With this definition we avoid calculating
mapping function once more.
Note that soft margin concept still applies to a non-linear classifier. Actually, its
implementation remains very simple: Lagrange multipliers have an upper limit. In
this case soft margin applies to the linear classifier in high dimension space. The
clearest advantage is that we still assert there is a solution. The use of a non-linear
surface as separator functions does not guarantee a solution will be found at all,
even though it is more probable. Moreover, using the soft margin alternative gives
the classifier more robustness against noisy training patterns.
Training phase time complexity does not change, but test phase is different. In
the linear case, having calculated explicitly w, algorithms complexity is O(1), using
inner product as the basic operation (which is O(d) if multiply-add is the basic
operation). For the non-linear phase, we need to perform O(N) operations, where
N was previously defined as the number of support vectors.
Because of the relation between support vectors number and complexity, algo-
rithms have been devised that try to minimize, or even replace, support vectors
during and after training, so that this phase may be competitive enough with other
machine learning methods, such as neural networks.

1.8 Mapping Function Example


For better understanding of the concept of new useful features generation, we will
show an example.
Suppose we have a data set xi  ci  in 2 × +1 −1 as shown in figure 11.
It can be seen that this is not a linear separable case, and the soft margin linear
separator is not enough. In this example, training data has no noise.
We define as a mapping function 2 → 3 with the form:
x1  x2  → x1  x2  x1 x2 
Therefore, we have added a new feature to the input definition, which gives us
information about a specific kind of relation between the two initial variables. Thus,
we can calculate the kernel function:
   
Kx x  = x • x  = x1  x2  x1 x2  • x1  x2  x1 x2 
   
= x1 x1 + x2 x2 + x1 x2 x1 x2
SUPPORT VECTOR MACHINES 165

Figure 11. Non-linear distribution set. The distribution is defined as class = 1 if x1 x2 < 14 5;
class = −1 otherwise

Figure 12. Feature space view for the main points from figure 11. The margin h1 − h2  is partially
shown using solid lines

We have defined the mapping function and the new space implicitly, using the
inner product in input space as the only valid operator.
In figure 12, the most important points, form the training data set, have been
represented, as well as the separator hyperplane the SVM algorithm would find and
those points that become support vectors. The separator hyperplane is z = 14 5.
Note that in the final hypothesis only one feature is required to create the hyperplane
(it is defined using just the third component) from the three available features. This
will be very common case in non-linear SVM: just a few features will form the
linear combination defining the separator hyperplane.
To represent the curve in input space that describes the generated hyperplane we
need to use the inverse mapping:

−1 x1  x2  x1 x2  → x1  x2 
166 CHAPTER 7

Figure 13. Non-linear classifier for the figure 11 set. Support vectors are encircled, margin is shown
using dashed lines and the separator curve is shown with a solid line

As the new axis z was defined as z = x1 x2 in the high dimensional space, those
points that lie on the hyperplane hold x1 x2 = 14 5, and so the curve in input space
can be defined as x2 = 14 5/x1 .
In figure 13, the final result can be observed, with hyperboloid x2 = 14 5/x1  as
the non-linear class separator surface. Support vectors in this figure are those that
were identified during training and highlighted in figure 12. It should not be thought
that those points that lie near the non-linear separator surface in input space should
become support vectors, although it usually tends to it. The mapping function does
not necessarily satisfy any input data relation properties, but the concept behind the
support vector is: “significant point”, and the points that carry more information
are those that lie near other class points in input space.
In real world cases, this function will not be useful, unless clear and easy a-
priori information is given to the SVM engineer. Nevertheless, it is a valid mapping
function and generates a valid kernel function. For this to happen, function K(x,y)
must satisfy some constraints, known as Mercer conditions.

1.9 Mercer Conditions

Not all kernel functions are valid, that is, they describe a Euclidean space with the
properties required in previous sections. It is enough to satisfy Mercer conditions
[Vapnik, 1995], which can be written as:
 exists a function Kx y = x • y if and only if for all g(x), such
There
that gx2 dx is finite, the following inequality holds:


Kx ygxgydxdy ≥ 0
SUPPORT VECTOR MACHINES 167

For most cases, this is a very complicated condition to check, because it is said
‘for all g(x)’. It has been demonstrated for


P
Kx y = Cp x • yP
i=1

when Cp is a positive real number and p is a positive integer.

1.10 Kernel Examples


The first (and only) basic kernels used to develop pattern recognition as well as
non-linear regression and principal component analysis with SVM are (for any pair
of vectors x y ∈ d ):

(20) Kx y =x • y + 1p


(21) Kx y = exp−x − y2 /2 2 
(22) Kx y = tanhx • y − 

Kernel (20) is a non-homogeneous polynomial classifier of degree p (another


used variation is the homogeneous polynomial kernel, without term ‘+1’). It creates
a space H with as many dimensions (data features) as p-combinations of x and y. All
possible relations between input attributes until degree p appear in the new space.
The margin maximization algorithm will discriminate those having information
from those that have not (should be most of them), so the number of adjustable
parameters required to obtain a good solution decreases.
Kernel (21) is a Gaussian radial base function (RBF). The new space dimension
is not fixed, depends on actual data distribution, and it could get to infinite. This
kernel visual effect is that near-by patterns form class clusters, as big as they can.
Clusters have the support vectors as centres (in feature space), and the radius is
given by the value of  and support vector weight, obtained during training.
Kernel (22) is similar to a two layer sigmoidal neural network. Using the neural
network kernel, the first layer is composed of N sets of weights, each set consisting
of d weights; the second layer is composed of N weights (the i ), so that an
evaluation requires a weighted sum of sigmoids evaluated on dot products. The
structure and weights (which defines the related neural network architecture) are
given automatically by the training process. Not all values of y satisfy Mercer
conditions [Vapnik, 1995].
We say (20), (21) and (22) are basic functions because new kernel functions
can be formulated combining them and still satisfying Mercer conditions. A lin-
ear combination of two Mercer kernels is a Mercer kernel. This can be easily
demonstrated knowing that the integrator operator is distributive with respect to the
add operator. Also, another kind of slight changes can be implemented from the
basic functions, looking for a kernel function having a priori information about the
internal distribution.
168 CHAPTER 7

Nevertheless, it has been experimentally stated that, in many cases, kernel choice
is not a determining factor in the machine performance. For a real world problem
whose internal distribution is not particularly fitted to some kind of kernel, support
vector set tend to be very similar, no matter what non-linear function is used. Of
course weights are fairly different, as the evaluating function is so. But the result,
the separating surface, tends to have a very similar geometrical shape, especially
where data density is high. As it was said in previous sections, the reason could
be that those patterns that are important because they lie near other-class patterns
continue to be important regardless of the mapping function, so they become support
vectors.
Last, we will define the kernel matrix as a symmetric square M-order matrix
(where M is the training pattern number), where position (i,j) describes the kernel
function value Kxi  xj .

1.11 Global Solutions and Uniqueness


As it has been shown in previous sections, the result of SVM training is a global
solution for the optimisation process, i.e., the parameter set (values for w, b and i )
which give an objective function maximum. This term goes against ‘local solution’,
defined as a parameter set whose objective function is optimum when compared
around the vicinity. In the SVM algorithm, any local solution is also a global
solution because it is characterised as a convex quadratic problem.
Nevertheless, global solution may not be unique. There could be more than one
parameter set where objective function gets the same value, and it could be the
optimum. It is not inconsistent with global solution definition. Solution uniqueness is
guaranteed only in case the problem is strictly convex. The SVM training definition
assures the problem to be convex, but training data will make the problem be strictly
convex or not.
Non-uniqueness occurs in two different ways:
• When w and b values are not unique. In this case all w and b values between
two solutions are also global solutions. This is easy to accept, as the problem is
characterized by a convex problem.
• When w and b values are unique, but the w value comes from different sets of
i values. Reaching one solution or the other depends on the training algorithm
randomness. Remember that there can be training data points that lie on the
hyperplane but are not support vectors. Much alike when three points in a row
give just one straight line and throwing away any of the three would give the
same result, it is easy to create one training set that would generate different
hard margin classifier support vector set depending on the listing order, although
the separator hyperplane would remain unchanged.

1.12 Generalization Performance Analysis


Mercer condition tells us whether a kernel function defines a new Euclidean space
or not, but it does not define how the mapping function must be applied or
SUPPORT VECTOR MACHINES 169

the new space morphology. For easy cases, the feature space dimensions can be
deduced. For instance, the p-degree homogeneous polynomial kernel has
 
d+p−1
p

new features or dimensions. For a 4-degree polynomial kernel using 16 × 16 pixel


images (256 initial features), the new space dimension is 183181376. In real world
cases we will never have training sets that big. A classification machine with a
huge ‘features over data’ ratio would undoubtedly produce overfitting.
Let us use an easier example: 3-degree polynomial with 8 × 8 pixel data. The new
space dimension is 45760. If you are using a simple multi-layer perceptron neural
net, the relation between number of weights and data points should not be greater
than around 15%. Suppose you are generating a hidden layer with 45760 units (new
features), 64 units in the input layer and one unit as output. The number of weights
in the net gets around 2974400 (almost 3 million). Therefore, the minimum training
data set should have 19829333 patterns (almost 20 million). Now, that is an awfully
big data set.
Of course, not all 45760 new features are important. Many of them will have
a null weight. But you cannot know at first which features will be needed and
which ones will not be. Some algorithms have been designed to decrease the neural
net while training, but even in this case the difference between useful feature and
disturbing feature is not easy to make.
A separator hyperplane in feature space H must have dimH+1 parameters. Any
classification system needing so many parameters to create a discrimination function
will be resource and time inefficient. Nevertheless SVM have a good classification
and generalization performance, in spite of treating data in an enormous space,
which could be even infinite. The reason has not been formally demonstrated,
although the maximum margin requirement has much to say about it. Within the
SVM, the solution has at most l + 1 adjustable parameters, being l the number of
training patterns. After the training, the solution has N + 1 parameters, being N the
number of support vectors, which is much less than the number of new features.
In section 2.1 we left a question about which classifier is better out of two
possible choices. The answer is “the one having lowest VC dimension”, which
is the same as saying “the simplest”. It was shown that the bound on the risk is
related to the VC dimension: the least the VC dimension, the least the risk bound.
However, it does not assure you which one will have the least actual risk. There is
no way to know it beforehand.
This approach is not only mathematically motivated, but we could also use
some philosophy statements on it. An English 14-century philosopher, William
of Ockham, enounces the Ockham’s razor theory: Given some evidence and two
hypothesis, one simple and one complex, both satisfying the evidence, then the
simplest hypothesis is most probable to be true. It does not say which one is true, but
if you had to bet and you had no additional knowledge or evidence, you should go
for the first hypothesis. That is all about learning, be it machine or human: choose
170 CHAPTER 7

the one hypothesis which seems most probable with current evidence. Whenever
you make a new assumption (using an unnecessary complex hypothesis) you are
most probably farther from the truth.
That answers the big question, why are support vector machines generalization
performance good even when using high dimension feature space? Because SVM
performance is not related to the space dimension where data is separated, but to the
classifier VC dimension. Therefore, SVM classifier depends on the data hypothesis
simplicity, not on the number of available features. If a simple hypothesis can do
the separating job, the SVM will use it, with no overfitting.
There is no magic any more. The SVM algorithm gives the simplest hypothesis,
that is, the most probable one. But it does not mean there cannot be a better answer
for a given problem. In spite of our SVM hard militancy we do not deny SVM
have been slightly outperformed (mostly by specific neural networks) in some
experimental benches. The answer is simple: luck. The SVM gave the most probable
answer after one general-purpose execution. But the true internal distribution may
have been slightly more complex, even though it did not show on the training data.
If you are trying a neural network architecture with a bit more complexity, which
way will you go? You cannot say unless you have additional information. The
successful architect engineer would most probably try all possible ways. It means
trying hundreds of different architectures and finally using the one having better
error rates on the test set. But that approach falls down in many places: first, the
engineer must decide how much complexity should the answer have (not an easy
task at all); second, the training set must be slightly deviated from the internal
distribution for the SVM to lie behind; third, if you are generating many classifiers
and you use the test set to decide which one is better, then the test set is no longer
a good validation set, because you are using it as a secondary training set (even
though it is used as a validation set for publishing the results); last, the engineer
spends a lot of time in the training phase.
And in spite of all this extra work, in cases where SVM are outperformed, they
are still very near to the highest results in this scientific ranking. Which means that
in the real world it is difficult to find the SVM outperformed.
Support Vector Machines are not easy to implement, but they are very easy to
use. Nevertheless, its use has some limits. As it is a statistical method, symbolic
learning does not suit too well. For instance, the parity problem with few data
makes the SVM decide that all points are support vectors. This is a clear hint for
bad generalization performance, because it means: “one point has no relation with
any other point”. In those cases a SVM is no better than a simple Nearest Neighbour
classification algorithm.
Other Machine Learning paradigms, for instance C4.5, are able to work with
input data having parameters with the unknown value (C4.5 uses the ‘?’ symbol).
The algorithm identifies this value and treats the information accordingly. However,
the SVM algorithm does not allow unknown values, diminishing the applicability
to some data sets.
SUPPORT VECTOR MACHINES 171

Inside the previously defined scope, SVM has a very light bias. It is a true
general-purpose machine learning method. Although a priori information can be
included inside the kernel function, the number of new features is so wide that,
regardless of the internal data distribution, there will always be a near-by hypothesis
model using those new features. The training algorithm will have embedded some
sort of balance between using too few features (too simple hypothesis), and using
too many (overfitting).
The basic achievement in using SVM is that you just choose a generic kernel
function (we won’t say “any kernel will do”, but it is not too far from the truth),
and the confidence degree C (up until now, mostly heuristics are used, but you
will soon find it is quite easy). Then you push the button, and after some time you
will have the best classification machine. No need for an experienced engineer or
scientist. No complicated architectures. No tailoring. No second thoughts. Child’s
play.

2. SVM MATHEMATICAL APLICATIONS

The initial mathematical development for SVM has been applied to different
approaches inside Machine Learning scope. All of them are based on the struc-
tural risk minimization principle, in the problem Lagrange formulation, and in
the non-linear case generalization. For each approach you only need to define the
requirements all points must satisfy, its effect on the objective function and the
mathematical steps through the Lagrange formulations.

2.1 Pattern Recognition

The first approach to SVM was in the pattern recognition field. In fact, the search
for a new statistical paradigm able to optimise the class separation problem was the
boost to V. Vapnik in his quadratic programming research.
For that reason, the SVM definition developed in the previous sections and their
implementation shown in next sections, apply specifically to pattern recognition.
Nevertheless, most concepts apply also to the other approaches defined in this
section.

2.2 Regression

Historically, the second approach the SVM had was non-linear regression and
function estimation, called SVRM (Support Vector Regression Machines) [Vapnik
et al, 1997].
This field can be divided into two parts: first, ‘function approximation’ tries
to find the curve that best adapts to the training data, acquired without noise
(which makes it very similar to usual methods for interpolation); second, function
estimation (regression), where data is noisy and whose distribution is unknown,
172 CHAPTER 7

the method tries to estimate as simplest as possible unseen data points, including
extrapolation.
SVRM algorithm treats both cases in a very similar way. For each case, the cost
function can be slightly changed.

2.2.1 Definition
Suppose we have a training set with l data pairs xi  yi , where xi ∈ d  i = 1     M
(up until now, just the same as the pattern recognition case), and where yi ∈ , is
not a label any more but a real number which represents the value of the function
we want to estimate at xi , i.e. yi = fr xi  + ni , being ni the noise associated to
point i.
We want to find a function fx having a deviation maximum of  with respect to
all training yi . In the basic case, there can be no training points having a distance to
the expected value bigger than , so the resulting curve must fit all points. This case
can be used only when data describe a linear function with a noise level ni < ∀i.
The estimating function has the form:

(23) fx = w • x + b

being w the vector defining the curve in input space, and b the free term (the bias).
Similarly to the pattern recognition case, the structural risk minimization principle
demands the greatest possible simplicity to the approximation function. We will
try to minimize w2 , which will give us the flattest linear function from those
satisfying the constraints (unlike the margin maximization definition in pattern
recognition). Therefore, the optimisation problem is written as:
Minimize

1/2w2 

with respect to constraints:

yi − w • xi  − b ≤ 
(24) w • xi  + b − yi ≤ 

Nevertheless, following the same reasoning as in section 2, this inflexible for-


mulation is only valid when there is at least one solution satisfying conditions (24).
Because this is usually an unreal case, without noise in the data (it could be used
for an interpolation approach), the soft margin idea must be introduced. We define
positive slack variables i that give information about how far is the expected value
from the true value for point i. Thus, we are introducing in the algorithm the ability
to admit errors (points not satisfying constraints), but keeping the ability to find
a solution representing the data distribution well enough. Likewise, a new cost
function must be defined giving a balance between the number of allowed errors
and simplicity (and usefulness) of the final estimating function. This cost function,
cx y f, must fulfil some properties, discussed in next section.
SUPPORT VECTOR MACHINES 173

To continue with the formulation development through this section we will use
the -insensitive cost function [Vapnik et al, 1997], partially because it was the first
one proposed, and because it is the simplest to interpret and optimise. This cost
function is continuous non-derivable, so variables must be duplicated (formulation
gets longer but no more difficult). Now all and  turn   ∗  and  ∗ , where
the one without asterisk is associated to the yi ≥ fxi  case, and the one with asterisk
is associated to the yi < fxi ) case. Note that both cases cannot be true for any one
point, so for all training points at least one of the duplicated variables will be zero.
Thus, objective function becomes:
 
1 1 M
(25) LP = w + C 2
cxi  yi  f
2 M i=1

After the primal and dual formulation development (just like the pattern recog-
nition case), the problem can be written as:
Maximise

1 M  M M M
LD = − i −∗i i −∗i xi •xj − i −∗i + yi i +∗i 
2 i=1 j=1 i=1 i=1

with respect to:


M
i − ∗i  = 0
i=1

(26) i  ∗i ∈ 0 C

having C the same meaning as in section 2: an error permissibility balance


parameter.
This development remains defined as a convex quadratic optimisation prob-
lem, which has to satisfy Karush-Khun-Tucker conditions at optimality. Therefore,
implementation methods defined for pattern recognition are applicable, although
with some differences caused by the cost function. In the case of -insensitive cost
function, duplicated Lagrange multipliers must be treated specifically. This will
happen to all non-derivable cost functions.
Again, support vectors are those training points whose Lagrange multipliers are
not zero (in the case of duplication, it means one of the multipliers is non-zero).
Moreover, those points having a non-zero slack variable are considered as training
∗
errors, and have the corresponding multiplier set to the maximum i = C (where
∗
the symbol means “either of the duplicated items, the applicable one”). Support
vectors having a Lagrange multiplier not at bound 0 <  < C are placed on the
 margin (they are needed to define the  margin) and have a zero slack variable.
Basically, the concept after the support vectors, weights and geometrical meaning,
remain the same as the pattern recognition case (see figure 14).
174 CHAPTER 7

Figure 14. Linear regression machine. Support vectors are encircled, the margin/tube is shown with
dashed lines and the estimated function is shown with a solid line

The value of the bias b can be calculated from a non-bound support vector, i.e.
a non-error support vector. The equalities to be used are those in inequalities (24)

(27) b =yi − w • si  −  ifi = 0yi = C


(28) b =yi − w • si  +  if∗i = 0y∗i = C

In case all support vectors are errors (very unusual, and in any case, most probably
a bad solution) the b calculation method is much more complex, and can be done
during optimisation itself.
Likewise, we can define a non-linear mapping from input space to a feature
space, where the algorithm will try to find the flattest function approximating the
data well enough. The ‘flat’ property can usually be seen in the corresponding
input-space non-linear curve: its shape is the one having smaller tangent value
through the point set. The mapping concept and development is similar to the one
described in previous sections: using a kernel function K(x,y) making all operations
implicitly in feature space, usually a much higher dimension space (see figures 15a
and 15b).
Therefore, the non-linear problem is defined as:
Maximize
1 M  M
LD = −  − ∗i i − ∗i Kxi  xj 
2 i=1 j=1 i


M 
M
− i − ∗i  + yi i + ∗i 
i=1 i=1

with respect to:


M
i − ∗i  = 0
i=1
SUPPORT VECTOR MACHINES 175

Figure 15a. Non-Linear regression machine. The dots follow the sinc(x) function, the dashed lines are
the -tube, and the solid line is the function SVRM estimation. Note that support vectors are those
corresponding to the 3 tangent points (1 in the middle x = 0, and the other two at the limits)

Figure 15b. Non-Linear regression machine. The dots follow the same sinc(x) function, and the other
elements follow the figure 15a notation. Note that as the -tube decreases, the function estimation gets
more accurate. At the limit, if noise allowance approaches 0, the function estimation error will also be
0 in this example

(29) i  ∗i ∈ 0 C

and w support vector expansion and estimated function are written:


N
(30) w= i − ∗i  si 
i=1


N
(31) fx = i − ∗i Ksi  x
i=1

being si the resulting N support vectors, and being M the complete training set.
It has been observed, through a number of experiments [Osuna and Girosi, 1998],
that SVRM tend to use a relatively low number of support vectors, compared to
176 CHAPTER 7

other similar machine learning processes. The reason could be the allowed flexibility
while errors are below a threshold, generating simpler surfaces, and thus needing
less support vectors to define them.
Moreover, It has been proved that the algorithm works well when a non-linear
kernel is applied in spite of having few training data. Other well-known methods
will easily overfit the data, while the SVRM dynamically controls its generalization
ability, generating a hypothesis simple enough to model training data distribution
better.

2.2.2 Cost Functions and -SVRM


The cost function is one of the key elements in SVRM. As it was said in the
previous section, real data is usually acquired with a certain noise figure with
unknown distribution. The cost function is in charge of accepting noise deviations,
and penalizing wide deviations, whether they are caused by noise or by a current
too simple hypothesis. The point is how to make the difference between noise and
hypothesis complexity.
Nevertheless, this function must satisfy certain features. For the sake of problem
resolution usefulness, the cost function must be convex, thus maintaining problem
convexity and assuring solution existence, uniqueness and globality. Moreover, for
the mathematical development to remain simple, it is required to be symmetric
and having at most two discontinuities at ±, in the first derivative, being  ≥ 0.
Therefore, even if we know the noise distribution, it would be too complex to
introduce that additional information inside the algorithm. We should then have to
find a convex cost function that may adjust to the noise distribution, but we would
still use an approximation. Not to mention the mathematical development for the
new cost function, notably difficult for non-expert mathematicians. The conclusion
is: just use a general purpose cost function and let the SRVM automatic learning
do the engineering job.
The development described in the previous subsection refers to the -insensible
cost function, which is the most commonly used, and is defined as:

0 if   ≤ 
(32) c  =
  −  if   > 

These kind of functions have an additional parameter , which helps to adjust


the maximum allowable deviation for any given point. A validation process is
required to adjust this parameter, even though its value can be approximated after
any additional knowledge about noise or data distributions.
To finish with the SVRM section, we will summarize a variation for the -SVRM
(using -insensitive cost function), called -SVRM [Schölkopf et al, 1998].
The difference consists not in the cost function itself (which remains the
-insensitive), but in the objective function. The -SVRM gave the objective
function as:
SUPPORT VECTOR MACHINES 177
 
1 1 M

(33) LP = w + C
2
 + i 
2 M i=1 i

and now, in -SVRM, the objective function is:


 
1 1 M

(34) LP = w + C  +
2
 + i 
2 M i=1 i

with respect to the same constraints as in (24).


The resulting dual formulation problem gets:
Maximize

M
1 M  M
LD = yi i − ∗i  −  − ∗i  i − ∗i  Kxi  xj 
i=1
2 i=1 j=1 i

with respect to


M
i − ∗i  = 0
i=1


M
i + ∗i  ≤ C
i=1


C
(35) i  ∗i ∈ 0
M

and leaving the estimating function in the same form as in (31). The values of b
and  can be calculated after training using constraints (24) for non-bound support
vectors. If the value of  increases, then the first term in the cost effect at (34), ,
will increase proportionally, while the second term will decrease as some points will
benefit from the softer constraints and will be inside the  bound (it also decreases
proportionally to the new lucky points). For the objective function to attain the
optimum, the value of  must increase until the fraction of error points (out of
bounds) is less than or equal to the value of . Therefore the new parameter  is an
upper limit for training errors (which are related to the number of support vectors).
Obviously it must satisfy  ∈ 0 1.
It seems easier to pick a good  value rather than a  value. Moreover, -SVRM
is a superset of -SVRM: after training with the first method we can calculate the
 parameter value, which can be used in a -SVRM algorithm giving exactly the
same solution obtained in the first place.

2.3 Principal Component Analysis


Support Vector Machines (regression included) and non-linear Principal Component
Analysis (PCA) were the first applications developed under the idea of a high
178 CHAPTER 7

dimension space mapping using Mercer kernels in Machine Learning. They differ
in the problem to solve even though they use similar means. SVM is a supervised
algorithm, i.e. the system state changes whether an output for a given pattern is
equal to the expected correct value or not. On the other hand, kernel PCA is an
unsupervised algorithm, i.e. there are no labels, and the output is the training data
distribution covariance analysis [Schölkopf et al, 1998].
PCA is an efficient method to extract the input data in a certain structure, and
can be achieved by calculating the system eigenvalues and eigenvectors. Formally
speaking, kernel PCA is an input space base transformation for diagonalizing the
normalized input data covariance matrix estimation with the form:

1 M
(36) C= xi xiT
M i=1

where M is the number of patterns xi .


It is called principal component to the new coordinates described by the eigen-
vectors as base, i.e. the matrix vectors orthogonal projection over the eigenvectors.
Eigenvalues  and eigenvectors V must be non-zero and satisfy V = CV.
We introduce the usual non-linearity concept, with the mapping function and its
corresponding kernel. We assume there exist coefficients 1      M , such that :


M
(37) V= i xi 
i=1

and the corresponding matrix kernel K (as defined in previous sections). Then we
arrive to the problem:

(38) M = K

being  the eigenvalues and  = 1      M  the eigenvectors coefficients.


To extract the principal components for a given pattern, data projections in feature
space are calculated in the following form:


M M
(39) Vk x = ki xi  x = ki Kxi  x
i=1 i=1

In this notation, k is a super-index representing the k-th eigenvector and its


k-th coefficients set. Note that after the previous calculation process, k non-zero
eigenvalues and eigenvectors are obtained, each one of them with a set of M
coefficients.
To implement the kernel PCA algorithm, the following steps must be taken:
1. Kernel matrix must be calculated, being of size MxM

Kij = kxi  xj ij for all i,j ∈ 1 M


SUPPORT VECTOR MACHINES 179

Here comes the first problem when using kernel PCA. Any matrix calculation
resources will grow at least with the square of its size, so with current algorithms
and hardware no more than 5000 data should be used. If you are provided with
more data for training (which should never be seen as an unfortunate case), a
representative subset must be created heuristically.
2. Diagonalize matrix K, to calculate eigenvalues and eigenvectors after equa-
tion (38), using traditional methods, and normalize such vectors. After this
calculation we can obtain coefficients k = k1      kM , to be used in the
projection phase.
3. To extract non-linear principal components from a given pattern, point projec-
tions over eigenvectors must be calculated using equation (39). The number of
principal components (non-zero eigenvectors) to be used is designer’s choice.
But not all of them must be used: if so, the process would be useless. Just choose
the first k principal components, those with a significant amount of information
and very little noise.
After this simple process, you get a data space change. From input space we
changed to a k-dimension space (being k a fraction of M), in which each dimension
gives a useful feature taken from non-linear correlation in the training data set.
That is a conceptual difference between SVM training and kernel PCA training:
in the first case new features are implicitly generated, and many are dropped after
training; in the second case k new features are explicitly generated, all of them
with a lot of information, ordered from most important downwards. The value k
has an upper bound of M, the number of training patterns. New features are made
explicit, so, obviously, the number k must not be too high or computation effort
would be inefficient. So only the first components should be used, those having the
greatest possible variance, i.e. the biggest eigenvector, i.e. the most discriminant
information.
Non-linear PCA usefulness in pattern recognition has been tested thoroughly,
attaining classification performances as good as the best non-linear SVM and well
above neural networks. The process is very simple: first calculate projection coef-
ficients and select the best ones; then transform all patterns (training, validation
and test) explicitly into the new space; afterwards use these data to train a linear or
non-linear classification machine (SVM, neural networks, decision trees, …, any-
one will do) which will be the true supervised classification process. When using
kernel PCA for classification, usually a linear SVM is used for supervised training,
giving enough flexibility to solve any non-linear problem.
The described process is very much like a one-hidden-layer neural network, in
which the architecture and the first layer weights are obtained by optimised means:
the variance matrix eigenvectors.
Also because of the explicit new features calculation, multiclass SVM can be
trained easily: the first layer would be common (as in neural networks), and the
second layer (linear discriminant) can be calculated using the hyperplane w value
(now it can be calculated because feature space is no longer implicit), giving O(1)
complexity.
180 CHAPTER 7

3. SVM VERSUS NEURAL NETWORKS

Neural Networks has led the Machine Learning field from the 1980’s thanks to its
development and interpretation simplicity, while having very competitive general-
ization ability. Nevertheless, after 20 years have gone by, design and development
complexity has increased considerably when trying to solve secondary problems as
convergence speed, new error calculation concepts, new activation functions with
additional constraints, local minimum preventing, and so on. So many years of
active research have turned NN from initial simplicity to current complexity, fit
only for specialised engineers.
Probably, as time goes by, SVM will follow a similar path, from current simplicity
to some complexity degree, needing a human expert to take out all of its potential.
Complexity by itself is not bad: higher method complexity usually leads to better
performance or classification rates. But NN basic research is currently scarce. For
the most part, it is about new applications where NN give better results for specific
architectures, so a qualitative jump is needed: SVM. This is quite a natural step in
human research, and many examples can be shown. When some technology gets to
its limit, then a new approach must be issued. At first, both methods performance
may be similar, but the new one will eventually outperform the old method. We
believe we are currently in the beginning of a technology jump, so it is a nice time
to change sides.
All along this chapter, the relation between SVM and NN has been widely
established. It can be stated that a SVM object topology can be developed as a
one-hidden-layer perceptron. It has been demonstrated in NN literature that the
family of one-hidden-layer perceptron can act as a universal discriminator, i.e. it
can approximate any function.
For sigmoidal activation function, similarity between NN and SVM with kernel
(22) is complete. (see figure 16). After the SVM optimisation process, we obtain
a network having d units in the first layer (input space dimension), N units in

Figure 16. A neural network approach for the SVM implicit architecture. Note that the layers are
completely connected (although not explicitly shown for figure clarity). Also, all weights are equal to 1
except the ones connecting the hidden layer and the output layer which are equal to the corresponding
support vector’s 
SUPPORT VECTOR MACHINES 181

the hidden layer (number of support vectors), and one unit in the output layer
(binary classifier), with weights  connecting the last two layers. For other kernels,
similarity is somewhat lesser, although the resulting topology is like the d,N ,1
one-hidden-layer perceptron. Only the kernel function makes the difference. On
the other hand, the SVM with RBF kernel is like a RBF classification network,
in which clusters and its characteristics have been calculated using an automatic
optimal algorithm.
When using a similar kernel and activation function, an important difference can
be observed. SVM tend to have a bigger number of support vectors (hidden-layer
units), when facing a complex or noisy training set. Neural networks can attain
similar classification performances with much less internal units. This is essential
to the test phase speed, because it depends directly on the number of elements in
the hidden layer. The number of multiply-add operations done in the test phase of
either method is Nd + 1.
The reason for such difference is mainly that support vectors defining the hidden
layer are constrained to be training points. Neural networks do not have such
constraint, so they need less elements to model the same function (it has greater
freedom degree). This does not mean that NN solution is better; it is quicker in
test phase, and topology complexity is lower, but generalization performance is not
affected.
Moreover, note that the training phase allowed errors (including those points
lying inside the margin) become support vectors. When optimising complex or
noisy training sets with loose error penalization, the number of training errors can
be very large.
But this problem has also been solved during the first SVM research steps. In
[Burges, 1996] the “reduced support vector set” method is described. Given a trained
SVM, this method creates a smaller support vector set representing approximately
the same information than the whole support vector set. But in this case the former
constraint is eliminated because the new virtual support vectors need not be training
points. The result is very much alike the NN approach topology.
This new expansion solves the classification speed problem, making the SVM
competitive against other Machine Learning methods. Nevertheless, it is seldom
used because of its considerable development difficulty.
Even more similarity can be found between NN and SVM classifiers using kernel
PCA as the feature extraction step. Units in the hidden layer are calculated explicitly
using the eigenvector projection instead of kernel calculation. These units are not
significant training points but true features, all of which share the concepts under
the internal data distribution. Thus, the classifier topology should be very similar
to the one generated by an experienced NN architect, because they have heavy
statistical meaning.
The only flexibility a NN offers and the SVM cannot reach is the multiple-
hidden-layer approach (using kernel PCA plus non-linear SVM could get up to 2
hidden layers, but it is seldom used). In spite of the fact that a one-hidden-layer
182 CHAPTER 7

topology is a universal discriminator, having more hidden layers can make the
training process much more efficient.
Using that capability, maybe there are fewer units in the net, or convergence
is faster. But the training algorithms grow more complex, and the overfitting and
local minimum finding problems will still be there. Therefore, the main differences
between both methods are:
• Training one SVM requires much more computation resources than training one
NN.
• Classification speed is usually slower in SVM
• SVM result is the optimum, while NN can be stuck in local minima. Therefore
SVM usually outperforms NN in classification performance.
• SVM parameters are few and easy to use, while NN requires an experienced
engineer to create and try the right architecture.
• SVM usually needs one execution only to give the best results, while NN usually
requires many tries to take out its best.
Outside scientific community, money rules. Expert engineer time is much more
expensive than computing resources, and differences will grow higher. If Machine
Learning algorithms are to be introduced massively in commercial products such
as knowledge management or data mining, automatic methods must be used. In
the real world, new data is always coming; new profiles arise while others are no
longer valid. Neural network flexibility must be tailored by an expert to fit current
state. But, for a company, it may be not worthy the cost of tailoring a Machine
Learning system that will become obsolete within some months. It is unavoidable:
craftsmen will be eventually replaced by machines.

4. SVM OPTIMISATION METHODS

4.1 Optimisation Methods Overview


SVM development tries to solve the problem described in (14): Maximize LD with
respect to Lagrange multipliers and with constraints (15) and (16).
When SVM appeared, the first approach to solve this problem was using standard
optimisation methods, such as gradient-descent or quasi-Newton. These methods,
quite veterans in mathematical literature, mainly apply complex operators over the
Hessian matrix (partial derivatives matrix). These one-step processes are compu-
tationally as well as memory resources intensive. Memory resources for matrices
is OM2 , while computational resources are OM3 , being M the number of pat-
terns in the training set. For instance, a 5000-point set will require 100 Mbytes of
storing memory using single precision floating numbers. Any process over such an
enormous data structure will be very inefficient, beyond many machines ability.
The main research line in the first years of live of SVM was the search for
alternative optimisation methods, developed explicitly for SVM mathematical use.
Many new approaches were published before one of them pleased all researchers
for its simplicity and its efficiency. The main methods, in chronological appearance
are the following:
SUPPORT VECTOR MACHINES 183

• The chunking method, developed by Vapnik. Points that are not support vectors
do not affect the Hessian matrix calculation; therefore, if we take them out
before the matrix calculation begins the resulting Lagrange multipliers would
remain the same. At the same time, the matrix calculation itself is easier, now its
complexity is ON3  being N the number of support vectors, and N << M. The
problem is that we cannot know beforehand which patterns will become support
vectors and which ones will not. Therefore, the algorithm is described as:
• Divide the training set in subsets randomly, with same (heuristic) size.
• Initialise the “support vector set” to void
• Repeat until there are no more subsets
• Join the “support vector set” with next subset
• Optimise (as if it were a complete SVM training set) the joined subset
using basic optimisation techniques (gradient-descent, etc..).
• Assign to the “support vector set” the patterns identified in the last opti-
misation execution as support vectors.
The idea behind this algorithm is, at each step, eliminate those points that are not
support vectors, and therefore, they are not likely to become support vectors in
a complete set optimisation. However, you could eliminate significant patterns
in one step that would become support vectors only in the presence of patterns
not in the current set. Of course, if N and M are about the same magnitude, then
this algorithm is even worse than the basic methods.
• The decomposition method, described in [Osuna et al, 1997]. Following the
chunking idea, we can optimise just a few training points subset with respect
to the complete training set. Similarities with the chunking method are: first,
small subsets are trained at each step; second, basic optimisation algorithms
are used at each step. And it also has differences: first, each step calculates
the optimisation state for a small subset with respect to the whole training set
(therefore, no support vectors will be missed); second, the values of N and M
are not so strictly constrained. The resulting value you get is an approximation
having more guarantees to be close to the basic method approach (which gives
the exact result), but at the same time it can handle many more data patterns.
The size of the small subset is heuristics-driven, depending on the hypothesis
complexity, the number of data points or the noise figures.
• The SMO method, described in [Platt, 1999]. The decomposition method is
extended to the limit. The smallest data set you can optimise is two, for condition
(16) to hold. But the remarkable difference is that a two-point optimisation
step can be calculated analytically, using a simple equation. Therefore, the
optimisation method becomes very simple:
• Until we are close enough to the correct solution
• Choose two points
• Optimise those two points with respect to the complete training set.
Generating a SVM optimisation algorithm is no longer a mathematician’s field.
Calculating the optimisation step is easy, and selecting the two-point set is made
using a fixed set of heuristics, accepted by the scientific community as the best you
184 CHAPTER 7

could imagine (until a better one is published). The SMO algorithm is the quickest
and the most scalable of them all, so there is no more controversy about which
optimisation method should be used.
The greatest disadvantages of SVM with respect to other Machine Learning
methods are the training speed and the training set size limit. Using the decom-
position algorithm, the biggest training set successfully tried had around 100.000
patterns. Time complexity is a bit harder to calculate, but it has been calculated
empirically to be ON2 , while memory requirements decrease dramatically using
SMO (no matrices needed).
When facing a really big problem (more than 100.000 points), the SVM user
should try divide-and-conquer algorithms, allowing the SMO algorithm to perform
more time-efficient tasks.

4.2 SMO Algorithm

After describing the main features about optimisation methods, we will show with
more detail the SMO algorithm basics. This section is divided in two parts: the
takeStep function, were the two-point optimisation equation is implemented; the
point pair choice, were heuristics are implemented, and one usually ignored (but
important to efficiency) parameter appears.

4.2.1 The takeStep Function


After reading Platt’s paper [Platt, 1999] , where the first SMO pseudo-code appears,
the non-mathematician reader will have that uncomfortable “this is too complex for
me to understand” feeling.
Because we do not want to lose your interest in SVM (after all, you read all the
way up until now), we will follow Platt’s pseudo-code regarding to the takeStep
function, adding some intermediate steps that will make it easier for engineers
without a very strong mathematical background.
Given two points, A and B, defined with an input vector x and label y as xA  yA 
and xB  yB . We distinguish two cases, one where yA = yB and another where
yA = yB .
• Case yA = yB .
Because of constraint (16), A + B = H, before and after optimisation, where H is
a constant real value, calculated using the previous A and B values.
The SVM tries to maximize LD with respect to all i . In one SMO step we
want to maximize LD with respect to A and B . All other Lagrange multipliers are
constants to this optimisation step.
LD
The main condition for this step is  = 0, where
A


l
1 l  l
LD = i −  y  y x • x 
i=1
2 i=1 j=1 i i j j i j
SUPPORT VECTOR MACHINES 185

We can separate this expression in two adding expressions:



l
Set 0 = i
i=1

1 l  l
Set u = −  y  y x • x 
2 i=1 j=1 i i j j i j

Suppose the set L is defined as the set of all training patterns except A and B.
We can further separate expression Set u in nine adding expressions:

Set 1 When i = A and j = A


Set 2 When i = A and j = B
Set 3 When i = A and j = B and j = A
Set 4 When i = B and j = B
Set 5 When i = B and j = A
Set 6 When i = B and j = B and j = A
Set 7 When i = A and i = B and j = A
Set 8 When i = A and i = B and j = B
Set 9 When i = A and i = B and j = A and j = B
And substitute B by H − A  everywhere it appears. Then
LD Set0
= +
A A
Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 Set9
+ + + + + + + + + =0
A A A A A A A A A

The complete expression is:


1 + 1 + K + A + H − A + K + n
= +
A
1 2A KAA
− +
2 A
1 HA − 2A KAB
− +
2 A

L
j A yj yA KjA
1 j=1
− +
2 A
1 H − A 2 KBB
− +
2 A
186 CHAPTER 7

1 HA − n2A KAB


− +
2 A

L
j H − A yj yA KjB
1 j=1
− +
2 A

L
i A yi yA KiA
1 i=1
− +
2 A

L
i H − A yi yA KiB
1 i=1
− +
2 A

L 
L
i j yi yA Kij
1 i=1 j=1

2 A

After partial derivation:

= 0+
1
− 2A KAA +
2
1
− H − 2A KAB +
2
1 L
− yy K +
2 i=1 i i A Ai
1
− −2H − A KBB +
2
1
− H − 2A KAB +
2
1 L
− −1 i yi yB KBi +
2 i=1

1 L
− yy K +
2 i=1 i i A Ai

1 L
− −1 i yi yB KBi +
2 i=1

0
SUPPORT VECTOR MACHINES 187

Adding up members with equal Kmn term


L
= yA i yi KAi +
i=1


L
− yB i yi KBi +
i=1

A KAA +
H − 2A KAB +
− H − A KBB

Where index i is related to the set L (to all training points except A and B).
Therefore,

L 
L
yB i yi KBi − yA i yi KAi + HKBB − KAB 
i=1 i=1
(40) A =
KAA − 2 KAB + KBB
and B is very easy to compute, because B = H − A
• Case yA = yB .
Because of constraint (16), B = H + A , before and after optimisation, where
H is a constant real value, calculated using the previous A and B values. The
mathematical development is very similar to the first case. The formulas to calculate
the step are:

L 
L
−yB i yi KBi − yA i yi KAi + HKBB − KAB  + 2
i=1 i=1
(41) A =
KAA − 2 KAB + KBB
and B = H + A .
Note that in the soft margin algorithm the  values must be between 0 and
C. This is not guaranteed using the formula here written. Therefore, a last step
is needed: clipping. The A value must be set inside defined bounds prior to the
B calculation. Afterwards, the B value must be checked to be inside the same
bounds, and otherwise clipped. Note that this clipping procedure only needs two
checks, and that the H term must always hold. Nevertheless, numerical stability
must be ensured in the clipping procedure using the add and substract arithmetic
operations only, avoiding inexact floating point multiply operations.
In both cases, you still have to calculate a couple of  terms, so its complexity
is related to the number of points in the training set. Suppose you implement an
internal cache where those terms are calculated a priori. Then, the step complexity
would be O(1). But this improvement is not free: now you have to update the cache
value for all training points after each successful step, so the complexity becomes
O(M) all over again. Actually you can limit the cache update only to the current
188 CHAPTER 7

non-bound support vectors, that is, the points that most probably will be optimised
in next steps. Then, the complexity decreases to ON , being N the number of
non-bound support vectors, and usually

N < N << M

The cache used is the error term (usually defined as E in code implementations),
being

ExA  = EA = SVMCurrentlearnedFunctionxA  − yA
= w • xA  + b − yA

N
= i yi KAi + b − yA
i=1

Therefore, the previously defined term becomes:


L 
N
i yi KAi = i yi KAi − A yA KAA − B yB KAB
i=1 i=1

=EA − b + yA − A yA KAA − B yB KAB

Substituting this term in (40) and (41) and expanding H, gives a common and
very simple formula for both cases (same and different class):

yA EB − EA 
A = A +
new old
(42)
KAA − 2KAB + KBB
This cache concept comes along with the SMO algorithm from its very begin-
nings. It may seem that no great advantage is gained using and updating the cache
instead of calculating the  terms at each optimisation step. The reason why is
done so is the heuristics.

4.2.2 The Heuristics


The greatest cost in the SMO algorithm is the error cache update. The optimisation
step itself has a very low impact when using this cache. Therefore, as it was said
in the previous section, the least number of updated cache values, the better.
So, a two-fold search pattern must be used: first, the best optimisation pair must
be tried at each step (the highest increase in LD ); second, the least number of pairs
to be inspected for one step, the better.
For the first issue, we could use some of the hand-made optimisation rules
examples given at section 2.2. In much the same way as those rules were figured
out, the main heuristic defined in SMO is: chose the pair of points whose error cache
difference is the greatest. Note that one of the effects of a two-point optimisation
SUPPORT VECTOR MACHINES 189

is that both points become “well classified”, i.e., their error value becomes 0.
Looking at the LD formula analysis at section 2.2, it can be seen that classification
correctness is one of the main signs for reaching the optimum. If you optimise
the two points which are most poorly classified, then the increase in the LD value
seems the highest. This approach is only a heuristic function. Maybe it cannot be
formally demonstrated, but it seems very plausible.
For the second issue, we need a search pattern using the least number of error
calculations possible. When in an intermediate optimisation step, it has been demon-
strated that those points whose Lagrange multiplier is at bound (i = 0 or i = C),
will probably remain that way throughout the rest of the training. Thus, updating
these points error cache is not worthy.
Most of the time the non-bound points will be used for the optimisation procedure
(and therefore, only these points will have its cache updated). In that usefulness
order, support vectors at bound i = C and non-suport-vectors i = 0 will most
probably remain non-significant throughout the rest of the training.
Therefore, the choosing of both points follows the same search pattern:
• First, try successful steps using every non-bound point (fine-grain).
• If not successful, try on all the set (coarse-grain).
The training procedure can be seen as a two-loop approach, each one selecting
one of the optimisation pair points (called i1 and i2 in the original SMO pseudo-
code).
Loop until LD is high enough
Choose next i1 following the two-set approach heuristic.
Loop until an i2 choice is good enough or i1 cannot be
optimised, choose i2 in one of three ways:
First try the point having greatest error (cache) difference.
If not successful, try all support vectors.
If not successful, try all patterns.
If not successful, i1 cannot be optimised at this step
If i2 choosing successful, then takeStep(i1, i2).
Actually, the implementation is slightly different, as the takeStep procedure is
used to assert if the optimisation step is successful. But the idea behind this simple
pseudo-code is quite a close approximation.
Remember that points at bound do not have their error cache updated. Therefore,
when choosing i2, if the first try fails, then non-updated-cache points are used,
decreasing the current step efficiency.
Much in the same line, there is a usually forgotten parameter in the SMO code
that is very important to attain a greater efficiency degree. In the previous pseudo-
code, there is no mention about when to switch the set used to choose i1. Current
implementations use a very simple decision: when you reach an optimisation state
where no first-set points can be optimised, switch to the bigger set. This condition
is also used for the algorithm termination: when you are already using the bigger
set and no points can be optimised the current numerical approximation is good
enough.
190 CHAPTER 7

The effect of these sharp limits is lower efficiency. Suppose you are at an early
optimisation step where at-bound points are still far from being in their correct state
(a very common case). A set of non-bound points is ready for optimisation. The
sharp-limit algorithm will optimise that small set until it is completely correct. That
approach is not at all cheap, it is like generating a correct smaller SVM using some
constant information. At each optimisation step, many other points could become
non-optimal. Then, another pair will be chosen, and the previous pair could become
non-optimal again. Fine-grain optimisation processes move at a very slow pace.
And after all that work is performed, a coarse-grain good optimisation step will
make all previous fine-grain computation useless. Of course, fine-grain steps are
needed as well as coarse-grain steps. A balance must be implemented optimising
the computational cost. One coarse-grain step always costs more than a fine-grain
step, but the objective function increase in the former can be much greater than
the latter. In early steps, coarse-grain optimising performs better, while in close-to-
optimum steps, fine-grain optimising is more cost-efficient. However, this analysis
is not defined as a sharp rule; it is more like a fuzzy rule, where some heuristics
must be used.
Note that you can always know approximately how far you are from the solution
at any point in the optimisation procedure. At optimality, the difference LP −LD is 0.
In complex problems, the difference will hardly be 0, as a numerical approximation
is used and some degree of floating point error is allowed (using a precision
parameter). But it gives a very good estimate about how much a fine-grain or
coarse-grain optimisation step is more likely to give better results. Note that the
difference is non-negative, as LP is always greater (or equal to at optimality) than
LD (see section 2.2).
The basic SMO has been enhanced in several ways, fulfilling the premonition
enounced in section 4: SVM algorithms will get more complex. Keerthi proposed
in [Keerthi et al., 1999] two modifications on SMO heuristics that highly upgrades
performance, but they are a bit more difficult to follow. Also, C.J. Lin created
and maintains LIBSVM (Library for Support Vector Machines), a freely available,
open-source, unreadable, optimised implementation of SMO, considered worldwide
as the fastest SVM training library.

5. CONCLUSIONS

The SVM community is growing fast. Nowadays all commercial pattern recognition
toolboxes have the SVM algorithms as a standard Machine Learning method. In the
early years, from 1995 to 1999, its implementation was too hard for the engineering
community to develop and maintain. But after the publishing of the SMO algorithm
in 1999, the whole Machine Learning community is walking with big strides to
accept the SVM as a reference method.
SVM theory has very strong foundations. Statistical learning theory has its roots
deep in mathematical and logic knowledge, and SVM is just a new mathematically
SUPPORT VECTOR MACHINES 191

developed appendix. Although its implementation may seem difficult, the solid
background makes the SVM basics very easy to understand, making the results
clear and useful for analysis.

6. ACKNOWLEDGEMENTS

The authors would like to thank Ignacio Melgar for text review, and the I + D
department at Sener Ingeniería y Sistemas for their support.

REFERENCES
C. J. C. Burges. Simplified support vector decision rules. In L. Saitta, editor, Proc. 13th International
Conference on Machine Learning, pages 71–77, San Mateo, CA, 1996.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery
and Data Mining, 2(2), pages 121–167, 1998.
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.
R. Fletcher. Practical methods of optimization. John Wiley and Sons, Inc, 2nd edition, 1987.
S. Keerthi, S. Shevade, C. Bhattacharyya and K. Murthy, Improvements to Platt’s smo algorithm for svm
classifier design. Technical Report CD-99-14, Control Division, Dept. of Mechanical and Production
Engineering, National University of Singapore, Singapore, August 1999.
E. Osuna, R. Freund and F. Girosi. An improved algorithm for support vector machines. In proceedings
of the 1997 IEEE workshop on Neural Networks for Signal Processing 7, pages 276–285, Amelia
Island, FL, 1997.
E. Osuna and F. Girosi. Reducing run-time complexity in SVMs. In Proceedings of the 14th International
Conf. on Pattern Recognition, pages 271–284,Brisbane, Australia, 1998.
J. Platt. Fast training of support vector machines using sequential minimal optimisation. In B. Scholkopf,
C. Burges and A. Smola, editors, Advances in Kernel Methods - support vector learning, pages 185–208.
MIT press, Cambridge, MA, 1999
B. Schölkopf, P. Y. Simard, A. J. Smola, and V. N. Vapnik. Prior knowledge in support vector kernels.
In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural information processings
systems, volume 10, pages 640–646, Cambridge, MA, 1998. MIT Press.
V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression
estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in
Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press.
CHAPTER 8
FRACTALS AS PRE-PROCESSING TOOL
FOR COMPUTATIONAL INTELLIGENCE APPLICATION

ANA M. TARQUIS1 , VALERIANO MÉNDEZ1 , JUAN B. GRAU1 ,


JOSÉ M. ANTÓN1 , DIEGO ANDINA12
1
Dpto. de Matemática Aplicada, E.T.S. de Ingenieros Agrónomos, U.P.M., Av. Complutense s.n.,
Ciudad Universitaria, Madrid 28040, Spain.
2
Dpto. de Señales, Sistemas y Radiocomunicaiones, E.T.S. Ingenieros de Telecomunicación, U.P.M.,
Av. Complutense s.n., Ciudad Universitaria, Madrid 28040, Spain.

Abstract: Preprocessing is the process of adapting the input of our Computational Intelligence (CI)
problem to the CI technique applied. Images are inputs of many problems, and Fractal
processing of the images to extract relevant geometry characteristics is a very important
tool. This chapter is dedicated to Fractal Preprocessing. In Pedology, fractal models were
fitted to match the structure of soils and techniques of multifractal analysis of soil images
were developed as is described in a state-of-the-art panorama. A box-counting method
and a gliding box method are presented, both obtaining from images sets of dimension
parameters, and are evaluated in a discussed case study from images of samples, and the
second seems preferable. Finally, a comprehensive list of references is given

Keywords: pedology; soil structure; multifractal; soil images; box-counting method; gliding-box
method; capacity dimension; information dimension; correlation dimension; multiscale
heterogeneity; fractal models; porous media; partition function

INTRODUCTION

Methods of analysis based on fractal theories have been developed for Pedology to
describe the structure of soil, as natural soils due to a combination of geology, action
of water and air and organisms present a structure down to very small scales that
is in rapport with the physical, biological and agricultural properties. That structure
tends to correspond to fractal paradigms as maintaining somehow a fairly similar
structure when reducing the scale to small microscopic ranges. Fractal models with
a reduced number of parameters have been developed to describe naturally complex
soils, and different experimental methods were created to compare theory with
193
D. Andina and D.T. Pham (eds.), Computational Intelligence, 193–213.
© 2007 Springer.
194 CHAPTER 8

reality or to evaluate parameters, including representative images of soils treated


so as to maintain natural structure features while marking pores or flow patterns,
etc. Next section contains a condensed panorama of the corresponding state of art
with references of authors and methods. Third section explains the most common
methods in subdividing an image to calculate fractal dimensions. Later, two related
methods of image description are exposed with formulae, a box-counting method
and a gliding box method, that assume a multifractal structure and the type of image
analysis resulting in parameters that include capacity, information and correlation
dimension. Next, a case study of three obtained images is presented, and using both
methods a wide range of generalized dimensions is plotted for three samples, and
these results are described and discussed to assess the methods. Finally, conclusions
and a list of references are provided.

1. STATE OF THE ART

Many parameters may be used in the attempt to describe a disordered morphology,


but the spatial arrangement of its most prominent features is a challenging problem
throughout a wide range of disciplines (Ripley, 1988; Griffith, 1988; Baveye and
Boast, 1998). In the case of 2-dimensional images of soil sections, several works
try to describe the spatial structure applying fractal techniques. Some of them were
studying the spatial arrangement of pore and solid spaces on images of sections of
resin-impregnated soil (Protz and VandenBygaart, 1998; VandenBygaart and Protz,
1999). Thin soil sections are analysed by transmitted light to obtain images from
which pores, filled with a resin, and solid spaces can be separated using image
analysis techniques (Morán et al., 1989; Vogel and Kretzschmar, 1996). In other soil
science areas, dye tracers are frequently used to study flow patterns in structured
soils, and with modern photographic techniques, they may provide excellent spatial
resolution of the flow paths (Flury and Fluhler, 1994; 1995). The most common
approach to describe dye patterns has been by descriptive statistics of the vertical
variation in dye coverage or shape parameters (Flury et al., 1994) based on a black
and white image, as the ones used for soil-pore structure.
A main objective was to extract fractal dimensions which characterize multi-
scale and self-similar geometric structures within the images. In other words, we
expect to see that the image viewed at different resolutions looks the same. At
a given resolution we should see the matrix as a collection of subsets similar to
each other and furthermore similar to the whole. If such a hierarchical organization
exists, it can be characterized by a mass fractal dimension D. These structures are
either one of the phases (black or white) or the interface between the two within
a 2-dimensional image. That such dimensions can be extracted from soil images
is evident, from soil pore structure (Brakensiek et al., 1992; Peyton et al., 1994;
Crawford et al., 1993; 1995; Anderson et al., 1996; 1998; Pachepsky et al., 1996;
Giménez et al., 1997; 1998; Oleschko et al., 1997; 1998a; 1998b; Hallet et al.,
1998; Bartoli et al., 1991; 1998; 1999; Dathe et al., 2001; Bird et al., 2000; 2003)
or flow paths in soils (Hatano and Booltink, 1992; Hatano et al, 1992; Booltink
et al., 1993).
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 195

More recently, interest has turned to multifractal analysis of soil images. A mul-
tifractal or more precisely a geometrical multifractal (Tel and Vicsek, 1987) is
a non-uniform fractal which unlike a uniform fractal exhibits local density fluc-
tuations. Its characterization requires not a single dimension but a sequence of
generalized fractal dimensions (Pachepsky et al., 2000). A multifractal analysis to
extract these dimensions from a soil image may have utility for more complex
distributions, if there is marked variation in local density or porosity.
Multifractal analysis (MFA) has been applied to images of rock pore systems
(Muller and McCauley, 1992; Saucier, 1992; Muller et al., 1995; Muller, 1996;
Saucier and Muller, 1999; Saucier et al., 2002), reviewed in the context of soils by
Tarquis et al. (2003) and recently applied to soils by Posadas et al. (2003). In other
areas, researchers have calculated fractal dimensions of dye patterns in horizontal
sections of undisturbed soil cores and related them to the outflow of percolating
dye solution. Baveye et al. (1998) calculated the information dimension D1  and
the correlation dimension D2  of dye stain patterns in vertical sections in a field
soil. Based on these calculations, models have approached the variability showed
in the dye images using diffusion limited aggregation techniques (Persson et al.,
2001).
Several authors have shown that the exact value of the generalized dimension is
not an easy task (Baveye et al., 1998; Baveye and Boast, 1998; Crawford et al.,
1999; Ogawa et al., 1999) pointing to practical difficulties in extracting generalized
dimensions (Vicsek, 1990; Buczkowski et al., 1998). Merits and limitations of
the multifractal analysis have been discussed by Aharony (1990), Beghdadi et al.
(1993), and Andraud et al. (1994). In the same way, Chhabra et al. (1989) pointed
out the risks in the estimation of the Hausdorff dimension and cited different sources
of errors in some physical cases. The difficulties arising in practice are due to
the fact that the relevant quantities used in the multifractal concept are estimated
asymptotically and in image analysis these estimations are much more coarse and
limited by the finite resolution of the image (Ahammer et al., 2003) and the measure
build on it, as the number of black points in a box (Buczkowski et al., 1998). Also
some authors have pointed out the influence of the percentage of black pixels in
a 2-dimensional image in the Dq obtained (Dathe and Thullner, 2005; Bird et al.,
2005; Tarquis et al., 2005; Dathe et al., 2005).
On the other hand, MFA involves partitioning the space of study into boxes
to construct samples with multiple scales. The number of the samples at a given
scale is restricted by the size of the partitioning space and data resolution, which is
usually another main factor influencing statistical estimation in MFA (Cheng and
Agerberg, 1996).
The purpose of this chapter is to ascertain the successful extraction of a spectrum
of generalized dimensions from a soil image, trying to avoid all the restrictions
that the box-counting method has and to discern the existence/non-existence of a
multifractal distribution of black or white within the image.
In the following sections, we shall review the box counting algorithm to obtain
the generalized fractal dimensions of multifractal analysis, as well as the gliding-box
method. We will show some examples to highlight shortcomings of these methods.
196 CHAPTER 8

2. FRACTAL CALCULATIONS
The main purpose of this section is to introduce some of the fractal methods used
in the context of black and white image analysis of soil structure. A complete
treatment of fractal and multifractal theories can be found, among others, in Feder
(1989) and Baveye and Boast (1998).

2.1 Box-counting Method


This methodology is classical in this field and has generated a large volume of work.
If a fractal line in a 2-dimensional space is covered by boxes of side length d, the
number of such boxes, n, needed to cover the line when  → 0 is (Mandelbrot,
1982):

(1) n = c−DL

The length of the line studied (e.g., the pore-solid interface), L, can be defined
at different scales and is equal to n. At small  values the method provides a
good approximation to the length of the line because the resolution of the image
is approached. At larger sizes the difference between n and the “true” length
increases. Thus, DL , or capacity dimension, is estimated using small  values
(Gimenez et al., 1997b). The box-counting method is also used to obtain a fractal
dimension of pore space by counting boxes that are occupied for at least one pixel
belonging to the class “pore” (Gimenez et al., 1997b).

2.2 Dilation Method


Dathe et al. (2001) is the only published report of this method in the soil science.
The dilation method follows essentially the same procedure as the box-counting
method, but instead of using boxes it uses other structuring elements to cover the
object under study, e.g., circles (Dathe et al., 2001). The image is formed by pixels,
which are either square or rectangular in shape. If circles are used, the measure
of scale is their diameter (as is the side length of the box in the box-counting
technique). If we want to have the same dilation
√ in any direction, the orthogonal and
diagonal increments should be biased by 2, which corresponds to the hypotenuse
of a square of unit side length (Kaye, 1989). The length of the studied object is
counted by numbers of circles, and then the slope of the regression line between
the log of the object length and the log of the object diameter is defined by the
relation:

(2) L = c1−DL

Dathe et al. (2001) applied the box-counting and dilation methods to the same
images and found non-significant differences in the values of the fractal dimensions
obtained with both methods. They pointed out, however, that fractal dimensions
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 197

estimated with both methods are different: the box counting dimension is the
Kolmogorov dimension while the dimension obtained with the dilation method is the
embedding dimension (Mikowski-Bouligand). For further details, see Takayasu’s
work (Takayasu, 1990).

2.3 Random Walk

Fractal methods can also be used to describe the dynamic properties of fractal
networks (Crawford et al., 1993; Anderson et al., 1996]. Characterization of fractals
involving space and time are achieved through the use of fractons (Kaye, 1989) or
the spectral dimension (Orbach, 1986). For example, Crawford et al. (1999) related
measurements of the spectral dimension d to diffusion through soil, associating d
with the resistance degree to which the network delay the diffusing particle in a
given direction.
The determination of d is based on random walks, where in each walk the
number of steps taken ns  and the number of different pore pixels visited Sn  are
computed. At the beginning of the random walk a pore pixel is randomly chosen,
then a random step is taken to another pore pixel from the eight pixels surrounding
the present one (see Figure 1A). If the new pore pixel has not been visited by the
random walk, the Sn and ns are increased by one, otherwise only ns is increased.
The random walk stops when a certain number of null steps (the step goes into a
site that has been used previously during the walk) is achieved or the random walk

Figure 1. Possible steps taken in a: a) eight-connected random walk, and b) four-connected random
walk. The present position of the pore pixel is marked by an x, and the arrows indicate the possible
next pore pixel. (From Tarquis et al. In: Scaling Methods in Soil Physics, Pachepsky, Radcliffe and
Selim Eds., CRC Press, 2003. With permission)
198 CHAPTER 8

arrives to an edge of the image (for further details see Crawford et al., 1990). A
graphical representation of these random walks is shown in Figure 2. The number
of walks and the maximum number of null steps for each walk can vary (Anderson
et al., 1996). Also a four-connected random walk (Figure 1B) can be used instead
of eight-connected one (Anderson et al., 1996).
For each random walk, d is calculated based on the relation:
d
(3) ns = cSn2

where c is a constant. The mean value of the d calculated for each walk is the
spectral dimension.

Figure 2. A simplified example of one random walk through the pore space (Anderson et al., 1996)
(From Anderson et al. Soil.Sci.Soc. A. J., 60, 962,1996. With permission)
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 199

3. CALCULATION OF GENERALIZED FRACTAL DIMENSIONS

3.1 Box-counting Method

Generalized dimensions calculated using the box counting technique basically


accounts for the mass contained in each box. An image is divided into n boxes of
size r (pixels in each dimension making r · r pixels per box), designated as nr,
and for each box the fraction i  of pore space in that box is defined and calculated
as

mi 
nr
(4) i = = m i / mi
M i=1

where mi is the number of pore class pixels and M is the total number of pore class
pixels in an image. In this case, the pore space area is the measure whereas the
support is the pore space itself. The next step is to define the generating function
q r as:


nr
(5) q r = i q r  q ∈ R
i=1

where
 q

nr
(6) i q r = qi = mi / mi
i=1

i is a weighted measure that represents the percentage of pore space in the ith
box, and q is the weight or moment of the measure. When computing boxes of size
r, the possible values of mi are from 0 to r · r. Therefore, let Nj r be the number
of boxes containing j pixels of pore space in that grid. Equations (5) and (6) will
then be (Barnsley et al., 1988):

(7)
⎛ ⎞q
 nr   q r·r
 nr
 mi q  r·r
j  ⎜ j ⎟
q r = i q = = Nj r = Nj r ⎜




r·r
i=1 i=1 M j=1 M j=1 kNk r
k=1

Using the distribution function Nj r, calculations become simpler and compu-
tational errors are smaller.
A log-log plot of a self-similar measure, q r, vs. r at various values for q
gives

(8)  q r ∼ r −
q
200 CHAPTER 8

where
q is the q th mass exponent (Feder, 1989). We can express
q as:

logq r
(9)
q = −limr→0
logr
Then, the generalized dimension, Dq , can be introduced by the following scaling
relationship (Feder, 1989):

logq r
(10)
q = −limr→0
logr
And, therefore

(11)
q = q − 1Dq

For the case that q = 1, Equation (11) cannot be applied and the following
equation should be used:


nr
i 1 r · logi 1 r
i=1
(12) D1 = lim
r→0 log r
The generalized dimensions, Dq , for q = 0, 1 and 2 are known as the capac-
ity, the information, and the correlation dimensions, respectively (Hentschel and
Procaccia, 1983). The capacity dimension is the box-counting or fractal dimension.
The information dimension is related to the entropy of the system, whereas the
correlation dimension computes the correlation of measures contained in boxes of
various sizes (Posadas et al., 2003).
Given these definitions and the behaviour to expect in case of a multifractal
measure, it is again instructive to seek the lower and upper bounds for q r in
order to establish what scope exists for behaviour other than that associated with
a multifractal measure (for further explanation see Bird et al., 2005). Following
Bird et al.’s work (2005), a brief explanation will be shown to establish bounds for
q r. Four separate ranges of values of the parameter q should be considered:

Case q > 1
The smallest value that q r can take corresponds to a uniform distribution of
pore phase over the image. The largest value that q r can take corresponds to
the case in which each grid block covering the pore phase is entirely filled by pore
phase. The lower and upper bounds are then as follows:
 21−q  21−q
L L
(13) < q r < f 1−q  q > 1
r r
where f is the fraction of the image occupied by black pixels and L is the length
of the image.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 201

Case q = 1
The smallest value of the entropy corresponds to the case in which each grid
block covering the pore phase is entirely filled by pore phase. The largest value
corresponds to a uniform distribution. Lower and upper bounds are then as follows:
   
L 
nr
L
(14) 2 ln + lnf < − i lni  < 2 ln q = 1
r i=1 r

Case 0 ≤ q < 1
The smallest value q r can take corresponds to the case in which each grid block
covering the pore phase is entirely filled by pore. The largest value corresponds to
a uniform distribution of pore phase over the image. Lower and upper bounds are
as follows:
 21−q  21−q
L L
(15) f 1−q
< r q r < 0 ≤ q < 1
r r

Case q < 0
The smallest value q r can take corresponds to the case in which each grid block
covering the pore phase is entirely filled by pore. The function is monotonically
decreasing. Therefore, the value corresponding to r = 1 pixel can be selected as an
upper bound.
 21−q
L
(16) f 1−q < q r < L21−q f 1−q  q < 0
r

Having defined these bounds, we now seek to examine their significance in


terms of extracting generalized dimensions from image data. For q > 1 and for
0 ≤ q < 1, the bounding functions when plotted on the log-log plot used to extract
the dimension yield two parallel lines with a vertical separation of 1 − q lnf.
For q = 1, the bounding functions when included in the plot of entropy against
lnr again yield two parallel lines of slope 2 with separation of lnf. Thus, in these
cases we reach the same impasse as that with the fractal analysis, namely depending
on f , and independent of actual geometry considered, the data can be so constrained
as to yield convincing straight-line fits with associated derived dimensions.

3.2 Gliding Box Method

The gliding-box method was originally used for lacunarity analysis (Allain and
Cloitre, 1991). Later, it was modified by Cheng (1997a, 1997b) for estimating
q
as follows:
log< Mq r >
(17) <
q > +D = −
logr/rmin 
202 CHAPTER 8

Where D is the dimension of the Euclidean space where the image is imbibed
(in this case D = 2) and M represents the multiplier measured on each pixel as:
 
rmin  q
(18) Mq r =
r

For further details see Grau et al. (2006). The advantage of using Equation (17)
in comparison with Equation (9) is that the estimation is independent of box size r
which allows the use of two successive box sizes only to estimate
q. Equation (18)
imposes that rmin  should not be null.
Once this estimation is done, Equation (8) can be applied to estimate Dq . For
the case of q = 1 the following relationship is applied based on the work given in
(Saucier and Muller, 1999):

(19) D̂1 = 2D2 − D3

4. IMAGES FOR THE CASE STUDY


Three soil samples were selected with the aim to represent a different range in
void pattern distribution in soils and a wide range of porosity values, from 5% of
porosity till 47%. Each of the samples was prepared for image analysis following
the procedure described by Protz and VandenBygaart (1998).
The data was obtained by imaging thin sections with a Kodak 460 RGB camera
using transmitted and circularly polarized illumination. The data was cropped from
3060 × 2036 pixels to 3000 × 2000 pixels. Then, EASI/PACE software classified
the data and the void bitmap separated, each individual pixel size was 18 6 × 18 6
microns. The images of these soils are showed in Figure 3.
To avoid any interference of the edge effect for the calculations using the box-
counting method, an area of 1024 × 1024 pixels of the left upper corner of the
original images was selected.

5. RESULTS OF THE CASE STUDY AND DISCUSSION

5.1 Generating Function with the Box-counting Method


For the three binary images, q r was calculated and then a bi-log plot of q r
versus r was made to observe the behavior. All plots showed a clear pattern in the
data. In Figure 2, for example, at negative q there were two distinctive areas, one
where there was a linear relationship between logr and logq r and another
where the value of log q r was almost constant versus logr. The box size at
which the behavior is different for the three images is around 64 pixels. These two
phases were not evident with positive q values (see Figure 4).
The existence of a plateau phase of logq r can be explained by the nature
of the measure under consideration. At r values close to 1, the variation in number
of black pixels is based on a few pixels, having the most simplicity when r = 1
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 203

Sample A

Sample B

Sample C

Figure 3. Soil binary images, pore phase in black pixels, of: (a) ADS, (b) BUSO and (c) EHV1. Each
image has 5.65%, 19.17% and 46.67% of porosity, respectively
204 CHAPTER 8

A
150

100

50
LogX(q,r)

–50

–100

–150
0 2 4 6
Log(r)

B
150

100

50
LogX(q,r)

–50

–100

–150
0 2 4 6
Log(r)

C
150
–10
100 –8
50 –6
LogX(q,r)

–4
0 –2
0
–50 2
–100 4
6
–150 8
0 2 4 6
10
Log(r)
Figure 4. Bi-log plot of q r versus box size r at different mass exponent q: A): ADS; B)
BUSO; C) EVH1

where the measure can only have 0 or 1 value. Thus, for small boxes of size r the
proportions among their values are mainly constant. However, when the box size
passes certain size a scaling pattern begins.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 205

5.2 Generalized Dimensions Using the Box-counting Method

If all of the regression points are considered, the Dq values, obtained mainly for
q < 0, were quite different from these obtained if only the regression points in
the linear behavior were chosen (Figure 5). Between both criteria, any Dq can
be obtained, but for q >= 0 the differences are not significant. Many authors
have pointed out this fact since the first applications of multifractal analysis to
experimental results (Tarquis et al., 2005).
The implications of Dq changes, too noticeable in this case, make impossible any
comparison and calculation of the amplitude of the dimensions D−10 − D+10  as it
has been used in several works.
The differences found among the Dq representation (Figure 5, filled circles) are
mainly found in the negative part. In particular, comparing ADS (Figure 5A filled
circles) with the rest it is evident that it doesn’t show a multifractal behavior.
All the D0 obtained have a value of 2 (plane dimension). This overestimation is
due to the fact that the studied range that was selected to have an optimum fit for
all the q values. However, looking at the lower and upper bond of the box-counting
plots for q = 0 (Figure 6) it is quite clear that regardless the structure in the image
the linear fit will be obtained with a high r 2 .
The standard errors (data not shown) of the Dq obtained in the linear behavior
phase are minimum and the r 2 of the regression analysis very high. However, this
is not surprising if we realize that only three points are being used. In addition, the
number of boxes of each size is very low, for size 128 × 128 pixels the number
of boxes is 64, for size 256 × 256 pixels the number of boxes is 16, analyzing an
image of 1024 × 1024 pixels that is considered a representative elementary area
(VandenBygaart and Protz, 1999).
This size restriction is avoided by using the gliding box method and its results
are discussed in the next section.

5.3 Generalized Dimensions Using the Gliding Box Method

For the three binary images, < Mq r > was calculated and then a bi-log plot of
< Mq r > versus r/rmin was made. All plots showed a linear relationship, as it
was expected, with an important number of points to calculate a linear regression
and based on the line’s slope estimate Dq (Figure 4). In the case of EHV1 for
q < −6 (Figure 4A), the linear relationship is not as clear as in the rest of the
images.
Finally, a comparison between both methods in the Dq values obtained can be
studied in Figure 5. In all of the graphics, Dq appears again with a value of 2
imposed by the box gliding method as it was explained in section 3.2.
For ADS (Figure 5A) both curves are similar. On propose, the range of values
for Dq has been changed to observe that the image effect could induce to an error
in our conclusions, when in Figure 3 was evident that Dq was an almost constant
value.
206 CHAPTER 8

6,50

5,50

4,50
Dq

3,50

2,50

1,50
–10 –8 –6 –4 –2 0 2 4 6 8 10
q
B

6,50

5,50

4,50
Dq

3,50

2,50

1,50
–10 –8 –6 –4 –2 0 2 4 6 8 10
q

6,50

5,50

4,50
Dq

3,50

2,50

1,50
–10 –8 –6 –4 –2 0 2 4 6 8 10
q

Figure 5. Generalized dimensions (Dq) from q = −10 to q = +10 for all points of the regression line
(filled square) and for the three selected points based on bi-log plot of X(r,q) (filled circles) of each
image: A) ADS; B) BUSO and C) EVH1
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 207

Observing the differences between both methods in BUSO and EVH1 (Figure 5B
and 5C respectively) are bigger in the negative q values although in the positive
values Dq shows a stronger decay (Grau et al., 2006).

6. CONCLUSIONS

Over the last years, the concepts of fractal/multifractal have been increasingly
applied in analysis of porous materials including soils and in the development of
fractal models of porous media. In terms of modeling, it is important to charac-
terize the multiscale heterogeneity of soil structure in a useful way, but the blind
application of these analyses does not approach to it.

(a)

16
14
12
10
log N

8
6
4
2
0
0 1 2 3 4 5 6 7 8
log r

(b)

16
14
12
10
log N

8
6
4
2
0
0 1 2 3 4 5 6 7 8
log r

Figure 6. Box counting plots for EHV1 soil images, q = 0, with upper and lower bounds (a) solid
phase (b) pore phase. (From Bird et al., J. of Hydrol., 322, 211, 2006. With permission)
208 CHAPTER 8

A
40
30
20
Log (<M(r,q)>)

10
0
–10
–20
–30
–40
–0,1 0,1 0,3 0,5 0,7 0,9 1,1 1,3 1,5
Log(r/rmin)

B
40

20
Log (<M(r,q)>)

–20

–40

–60

–80
0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6
Log(r/rmin)

C
20 –10
0 –8
Log (<M(r,q)>)

–20 –6
–4
–40
–2
–60 0
–80 2
–100 4
6
–120
8
0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6
10
Figure 7. Bi-log plot of < Mr q > versus box size rate r/rmin  at different mass exponent (q): A):
ADS; B) BUSO, C) EVH1
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 209

A
2,030

2,025

2,020

2,015
Dq

2,010

2,005

2,000

1,995

1,990
-10 -8 -6 -4 -2 0 2 4 6 8 10
q

B
4,000

3,500

3,000

2,500

2,000
Dq

1,500

1,000

0,500

0,000
-10 -8 -6 -4 -2 0 2 4 6 8 10
q

C
7,00

6,00

5,00
Dq

4,00

3,00

2,00

1,00
-10 -8 -6 -4 -2 0 2 4 6 8 10
q

Figure 8. Generalized dimensions (Dq) from q = −10 to q = +10 based on the box-gliding method
(empty square) and based on the box-counting method (filled circles) using the same box sizes range:
A) ADS; B) BUSO; C) EVH1
210 CHAPTER 8

The results obtained by the “box-counting” and “gliding-box” methods for mul-
tifractal modeling of soil pore images show that “gliding-box” provides more con-
sistent results as it creates more number of large size boxes in comparison with the
box-counting method and avoids the restriction that box-counting method imposes
to the partition function.

7. ACKNOWLEDGEMENTS

We thank Dr Richard Heck of Guelph University for the soil images. We are very
indebted to Dr. N. Bird, Dr. Q. Cheng and Dr. D. Gimenez for helpful discussions.
This work was supported by Techical University of Madrid (UPM) and Madrid
Autonomous Community (CAM), Project No. M050020163.

REFERENCES
Aharony, A., 1990, Multifractals in physics – successes, dangers and challenges, Physica A. 168:
479–489.
Ahammer, H., De Vaney, T.T.J. and Tritthart, H.A., 2003, How much resolution is enough? Influence
of downscaling the pixel resolution of digital images on the generalised dimensions, Physica D. 181
(3–4):147–156.
Allain, C. and Cloitre, M., 1991, Characterizing the lacunarity of random and deterministic fractal sets,
Physical Review A. 44:3552–3558.
Anderson, A.N., McBratney, A.B. and FitzPatrick, E.A., 1996, Soil Mass, Surface, and Spectral Fractal
Dimensions Estimated from Thin Section Photographs, Soil Sci. Soc. Am. J. 60:962–969.
Anderson, A.N., McBratney, A.B. and Crawford, J.W., Applications of fractals to soil studies. Adv.
Agron., 63:1, 1998.
Barnsley, M.F., Devaney, R.L., Mandelbrot, B.B., Peitgen, H.O., Saupe, D. and Voss, R.F., 1988, The
Science of Fractal Images. Edited by H.O. Peitgen and D. Saupe, Springer-Verlag, New York.
Bartoli, F., Philippy, R., Doirisse, S., Niquet, S. and Dubuit, M., 1991, Structure and self-similarity in
silty and sandy soils; the fractal approach, J. Soil Sci. 42:167–185.
Bartoli, F., Bird, N.R., Gomendy, V., Vivier, H. and Niquet, S., 1999, The relation between silty soil
structures and their mercury porosimetry curve counterparts: fractals and percolation, Eur. J. Soil Sci.,
50(9).
Bartoli, F., Dutartre, P., Gomendy, V., Niquet, S. and Vivier, H., 1998. Fractal and soil structures. In:
Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC Press, Boca Raton, 203–232.
Baveye, P. and Boast, C.W. Fractal Geometry, Fragmentation Processes and the Physics of Scale-
Invariance: An Introduction. In Fractals in Soil Science, Baveye, Parlange and Stewart, Eds., CRC
Press, Boca Raton, 1998, 1.
Baveye, P., Boast, C.W., Ogawa, S., Parlange, J.Y. and Steenhuis, T., 1998. Influence of image resolution
and thresholding on the apparent mass fractal characteristics of preferential flow patterns in field soils,
Water Resour. Res. 34, 2783–2796.
Bird, N., Díaz, M.C., Saa, A. and Tarquis, A.M., 2006. Fractal and Multifractal Analysis of Pore-Scale
Images of Soil. J. Hydrol, 322, 211–219.
Bird, N.R.A., Perrier, E. and Rieu, M., 2000. The water retention function for a model of soil structure
with pore and solid fractal distributions. Eur. J. Soil Sci. 51, 55–63.
Bird, N.R.A. and Perrier, E.M.A., 2003. The pore-solid fractal model of soil density scaling. Eur. J. Soil
Sci. 54, 467–476.
Booltink, H.W.G., Hatano, R. and Bouma, J., 1993. Measurement and simulation of bypass flow in a
structured clay soil; a physico-morphological approach. J. Hydrol. 148, 149–168.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 211

Brakensiek, D.L., W.J. Rawls, S.D. Logsdon and Edwards, W.M., 1992. Fractal description of macrop-
orosity. Soil Sci. Soc. Am. J. 56, 1721–1723.
Buczhowski, S., Hildgen, P. and Cartilier, L. 1998. Measurements of fractal dimension by box-counting:
a critical analysis of data scatter. Physica A 252, 23–34.
Cheng, Q. and Agerberg, F.P. (1996). Comparison between two types of multifractal modeling. Mathe-
matical Geology, 28(8), 1001–1015.
Cheng, Q. (1997a). Discrete multifractals. Mathematical Geology, 29(2), 245–266.
Cheng, Q. (1997b). Multifractal modeling and lacunarity analysis. Mathematical Geology, 29(7),
919–932.
Crawford, J.W., Baveye, P., Grindrod, P. and Rappoldt, C. Application of Fractals to Soil Properties,
Landscape Patterns, and Solute Transport in Porous Media, in Assessment of Non-Point Source
Pollution in the Vadose Zone. Geophysical Monograph 108, Corwin, Loague and Ellsworth, Eds.,
American Geophysical Union, Wahington, DC, 1999, 151.
Crawford, J.W., Ritz, K. and Young, I.M. Quantification of fungal morphology, gaseous transport and
microbial dynamics in soil: an integrated framework utilising fractal geometry. Geoderma, 56, 1578,
1993.
Crawford, J.W., Matsui, N. and Young, I.M. 1995., The relation between the moisture-release curve and
the structure of soil. Eur. J. Soil Sci. 46, 369–375.
Dathe, A., Eins, S., Niemeyer, J. and Gerold, G. The surface fractal dimension of the soil-pore interface
as measured by image analysis. Geoderma, 103, 203, 2001.
Dathe, A., Tarquis, A.M. and Perrier, E., 2006. Multifractal analysis of the pore- and
solid-phases in binary two-dimensional images of natural porous structures. Geoderma,
doi:10.1016/j.geoderma.2006.03.024, in press.
Dathe, A. and Thullner, M., 2005. The relationship between fractal properties of solid matrix and pore
space in porous media. Geoderma, 129, 279–290.
Feder, J., 1989. Fractals. Plenum Press, New York. 283pp
Flury, M. and Fluhler, H., 1994. Brilliant blue FCF as a dye tracer for solute transport studies – A
toxicological overview. J.Environ. Qual. 23, 1108–1112.
Flury, M. and Fluhler, H., 1995. Tracer characteristics of brilliant blue. Soil Sci. Soc. Am. J. 59, 22–27.
Flury, M., Fluhler, H., Jury, W.A. and Leuenberger, J., 1994. Susceptibility of soils to preferential flow
of water: A field study, Water Resour. Res. 30, 1945–1954.
Giménez, D., R.R. Allmaras, E.A. Nater and Huggins, D.R., 1997a. Fractal dimensions for volume and
surface of interaggregate pores – scale effects. Geoderma 77, 19–38.
Giménez D., Perfect E., Rawls W.J. and Pachepsky, Y., 1997b. Fractal models for predicting soil
hydraulic properties: a review. Eng. Geol. 48, 161–183.
Gouyet, J.G. Physics and Fractal Structures. Masson, Paris, 1996.
Grau, J., Méndez, V., Tarquis, A.M., Díaz, M.C. and A. Saa, 2006. Comparison of gliding box and
box-counting methods in soil image analysis. Geoderma, doi:10.1016/j.geoderma.2006.03.009, in
press.
Griffith, D.A.. Advanced Spatial Statistics. Kluwer Academic Publishers, Boston, 1988.
Hallett, P.D., Bird, N.R.A., Dexter, A.R. and Seville, P.K., 1998. Investigation into the fractal scaling
of the structure and strength of soil aggregates. Eur. J. Soil Sci. 49, 203–211.
Hatano, R. and Booltink, H.W.G., 1992. Using Fractal Dimensions of Stained Flow Patterns in a Clay
Soil to Predict Bypass Flow. J. Hydrol. 135, 121–131.
Hatano, R., Kawamura, N., Ikeda, J. and Sakuma, T. Evaluation of the effect of morphological features
of flow paths on solute transport by using fractal dimensions of methylene blue staining patterns.
Geoderma 53, 31, 1992.
Hentschel, H.G.R. and Procaccia, I. (1983). The infinite number of generalized dimensions of fractals
and strange attractors. Physica D, 8, 435, 1983.
Kaye, B.G. A Random Walk through Fractal Dimensions. VCH Verlagsgesellschaft, Weinheim,
Germany, 1989, 297.
Mandelbrot, B.B. The Fractal Geometry of Nature. W.H. Freeman, San Francisco, CA, 1982.
McCauley, J.L. 1992. Models of permeability and conductivity of porous media. Physica A 187, 18–54.
212 CHAPTER 8

Moran, C.J., McBratney, A.B. and Koppi, A.J.,1989. A rapid method for analysis of soil macropore
structure. I. Specimen preparation and digital binary production. Soil Sci. Soc. Am. J. 53, 921–928.
Muller, J., 1996. Characterization of pore space in chalk by multifractal analysis. J. Hydrology, 187,
215–222.
Muller, J., Huseby, O.K. and Saucier, A. Influence of Multifractal Scaling of Pore Geometry on
Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals, 5, 1485, 1995.
Muller, J. and McCauley, J.L., 1992. Implication of Fractal Geometry for Fluid Flow Properties of
Sedimentary Rocks. Transp. Porous Media 8, 133–147.
Muller, J., Huseby, O.K. and Saucier, A., 1995. Influence of Multifractal Scaling of Pore Geometry on
Permeabilities of Sedimentary Rocks. Chaos, Solitons & Fractals 5, 1485–1492.
Ogawa, S., Baveye, P., Boast, C.W., Parlange, J.Y. and Steenhuis, T. Surface fractal characteristics of
preferential flow patterns in field soils: evaluation and effect of image processing. Geoderma, 88,
109, 1999.
Oleschko, K., Fuentes, C., Brambila, F. and Alvarez, R. Linear fractal analysis of three Mexican soils
in different management systems. Soil Technol., 10, 185, 1997.
Oleschko, K. Delesse principle and statistical fractal sets: 1. Dimensional equivalents. Soil&Tillage
Research, 49, 255,1998a.
Oleschko, K., Brambila, F., Aceff, F. and Mora, L.P. From fractal analysis along a line to fractals on
the plane. Soil&Tillage Research, 45, 389, 1998b.
Orbach, R. Dynamics of fractal networks. Science (Washington, DC) 231, 814, 1986.
Pachepsky, Y.A.,Yakovchenko, V., Rabenhorst, M.C., Pooley, C. and Sikora, L.J. . Fractal parameters
of pore surfaces as derived from micromorphological data: effect of long term management practices.
Geoderma, 74, 305, 1996.
Pachepsky, Y.A., Giménez, D., Crawford, J.W. and Rawls, W.J. Conventional and fractal geometry in
soil science. In Fractals in Soil Science, Pachepsky, Crawford and Rawls, Eds., Elsevier Science,
Amsterdam, 2000, 7.
Persson, M., Yasuda, H., Albergel, J., Berndtsson, R., Zante, P., Nasri, S. and Öhrström, P., 2001.
Modeling plot scale dye penetration by a diffusion limited aggregation (DLA) model. J. Hydrol. 250,
98–105.
Peyton, R.L., Gantzer, C.J., Anderson, S.H., Haeffner, B.A. and Pfeifer, P. . Fractal dimension to
describe soil macropore structure using X ray computed tomography. Water Resource Research, 30,
691, 1994.
Posadas, A.N.D., Giménez, D., Quiroz, R. and Protz, R., 2003. Multifractal Characterization of Soil
Pore Spatial Distributions. Soil Sci. Soc. Am. J. 67, 1361–1369
Protz , R. and VandenBygaart, A.J. 1998. Towards systematic image analysis in the study of soil
micromorphology. Science Soils, 3. (available online at http://link.springer.de/link/service/journals/).
Ripley, B.D. Statistical Inference for Spatial Processes, Cambridge Univ. Press, Cambridge, 1988.
Saucier, A. Effective permeability of multifractal porous media. Physica A, 183, 381, 1992.
Saucier, A. and Muller, J. Remarks on some properties of multifractals. Physica A, 199, 350, 1993.
Saucier, A. and Muller, J. Textural analysis of disordered materials with multifractals. Physica A, 267,
221, 1999.
Saucier, A., Richer, J. and Muller, J., 2002. Statistical mechanics and its applications. Physica A, 311
(1–2): 231–259.
Takayasu, H. Fractals in the Physical Sciences. Manchester University Press, Manchester, 1990.
Tarquis, A.M., Giménez, D., Saa, A., Díaz, M.C. and Gascó, J.M., 2003. Scaling and Multiscaling of
Soil Pore Systems Determined by Image Analysis. In: Scaling Methods in Soil Physics, Pachepsky,
Radcliffe and Selim Eds., CRC Press, 434 pp.
Tarquis, A.M., McInnes, K.J., Keys, J., Saa, A., García, M.R. and Díaz, M.C., 2006. Multiscaling
Analysis In A Structured Clay Soil Using 2D Images. J. Hydrol, 322, 236–246.
Tel, T. and Vicsek, T., 1987. Geometrical multifractality of growing structures, J. Physics A. General,
20, L835–L840.
VandenBygaart, A.J. and Protz, R., 1999. The representative elementary area (REA) in studies of
quantitative soil micromorphology. Geoderma 89, 333–346.
PRE-PROCESSING TOOL FOR COMPUTATIONAL INTELLIGENCE APPLICATION 213

Vicsek, T. 1990. Mass multifractals. Physica A, 168, 490–497.


Vogel, H.J. and Kretzschmar, A., 1996. Topological characterization of pore space in soil-sample
preparation and digital image-processing. Geoderma 73, 23–38.

View publication stats

You might also like